Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
160 views
in Technique[技术] by (71.8m points)

Spark scala read multiple files from S3 using Seq(paths)

I have a scala program that reads json files into a DataFrame using DataFrameReader, using a file pattern like "s3n://bucket/filepath/*.json" to specify files. Now I need to read both ".json" and ".json.gz" (gzip) files into the dataframe.

Since current approach uses a wildcard, like this:

session.read().json("s3n://bucket/filepath/*.json")

I want to read both json and json-gzip files, but I have not found documentation for the wildcard pattern expression. I was tempted to compose a more complex wildcard, but the lack of wildcard documentation motivated me to consider another approach.

Reading the documentation for Spark, it says that the DataFrameReader has these relevant methods,

  • json(path: String): DataFrame
  • json(paths: String*): DataFrame

Which would produce code more like this:

// spark.isInstanceOf[SparkSession]
// val reader: DataFrameReader = spark.read
val df: DataFrame = spark.read.json(path: String)
// or
val df: DataFrame = spark.read.json(paths: String*)

I need to read json and json-gzip files, but I may need to read other filename formats. The second method (above) accepts a Scala Seq(uence), which means I could provide a Seq(uence), which I could later add other filename wildcards.

// session.isInstanceOf[SparkSession]
val s3json: String = "s3n://bucket/filepath/*.json"
val s3gzip: String = "s3n://bucket/filepath/*.json.gz"
val paths: Seq[String] = Seq(s3json, s3gzip)
val df: DataFrame = session.read().json(paths)

Please comment on this approach, and is this idionatic?

I have also seen examples of the last line with the splat operator ("_") added to the paths sequence. Is that needed? Can you explain what the ": _" part does?

val df: DataFrame = session.read().json(paths: _*)

Example of the splat operator use are here:

question from:https://stackoverflow.com/questions/65853273/spark-scala-read-multiple-files-from-s3-using-seqpaths

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Adding further to blackbishop's answer, you can use val df = spark.read.json(paths: _*) for reading files from entirely independent buckets/folders.

    val paths = Seq("s3n://bucket1/filepath1/","s3n://bucket2/filepath/2")
    val df = spark.read.json(paths: _*)

The _* converts a Seq to variable arguments needed by path function.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...