Multiple column RDD
There's no such thing really, but nor do you need one. You can create an RDD of objects with any type T
. This type should model a record, so a record with multiple columns can be of type Array[String]
, Seq[AnyRef]
, or whatever best models your data. In Scala, the best choice (for type safety and code readability) is usually using a case class that represents a record.
For example, if your CSV looks like this:
+---------+-------------------+--------+-------------+
| ID | Name | Age | Department |
+---------+-------------------+--------+-------------+
| 1 | John Smith | 29 | Dev |
| 2 | William Black | 31 | Finance |
| 3 | Nancy Stevens | 32 | Dev |
+---------+-------------------+--------+-------------+
You could, for example:
case class Record(id: Long, name: String, age: Int, department: String)
val input: RDD[String] = sparkContext.textFile("./file")
val parsed: RDD[Record] = input.map(/* split string and create new Record */)
Now you can conveniently perform transformations on this RDD, for example if you want to transform it into a PairRDD with the ID as key, simply call keyBy
:
val keyed: RDD[(Int, Record)] = parsed.keyBy(_.id)
That said - even though you're more interested in "batch processing" and not analysis - this could still be achieved more easily (and perhaps perform better, depending on what you do with this RDD) using the DataFrames API - it has good facilities for reading CSVs safely (e.g. spark-csv), and for treating data as columns without the need to create case classes matching each type of record.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…