Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
184 views
in Technique[技术] by (71.8m points)

Working with multiple column RDD in Spark?

I've read through a number of Spark examples, and I can't seem to find out how to create an RDD with with a key column and multiple value columns from a CSV file.

I've read a little bit about Spark SQL and don't think it's what I want in this case. I'm not looking for interactive analysis with this data, more of a batch type processing.

I'm interested in Java or Scala syntax.

Can you point me in the right direction?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Multiple column RDD

There's no such thing really, but nor do you need one. You can create an RDD of objects with any type T. This type should model a record, so a record with multiple columns can be of type Array[String], Seq[AnyRef], or whatever best models your data. In Scala, the best choice (for type safety and code readability) is usually using a case class that represents a record.

For example, if your CSV looks like this:

+---------+-------------------+--------+-------------+
| ID      | Name              | Age    | Department  |
+---------+-------------------+--------+-------------+
| 1       | John Smith        | 29     | Dev         |
| 2       | William Black     | 31     | Finance     |
| 3       | Nancy Stevens     | 32     | Dev         |
+---------+-------------------+--------+-------------+

You could, for example:

case class Record(id: Long, name: String, age: Int, department: String)

val input: RDD[String] = sparkContext.textFile("./file")
val parsed: RDD[Record] = input.map(/* split string and create new Record */)

Now you can conveniently perform transformations on this RDD, for example if you want to transform it into a PairRDD with the ID as key, simply call keyBy:

val keyed: RDD[(Int, Record)] = parsed.keyBy(_.id)

That said - even though you're more interested in "batch processing" and not analysis - this could still be achieved more easily (and perhaps perform better, depending on what you do with this RDD) using the DataFrames API - it has good facilities for reading CSVs safely (e.g. spark-csv), and for treating data as columns without the need to create case classes matching each type of record.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...