Spark scala read multiple files from S3 using Seq(paths)

Question

Welcome To Ask or Share your Answers For Others

Spark scala read multiple files from S3 using Seq(paths)

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Spark scala read multiple files from S3 using Seq(paths)

I have a scala program that reads json files into a DataFrame using DataFrameReader, using a file pattern like "s3n://bucket/filepath/*.json" to specify files. Now I need to read both ".json" and ".json.gz" (gzip) files into the dataframe.

Since current approach uses a wildcard, like this:

session.read().json("s3n://bucket/filepath/*.json")

I want to read both json and json-gzip files, but I have not found documentation for the wildcard pattern expression. I was tempted to compose a more complex wildcard, but the lack of wildcard documentation motivated me to consider another approach.

Reading the documentation for Spark, it says that the DataFrameReader has these relevant methods,

json(path: String): DataFrame
json(paths: String*): DataFrame

Which would produce code more like this:

// spark.isInstanceOf[SparkSession]
// val reader: DataFrameReader = spark.read
val df: DataFrame = spark.read.json(path: String)
// or
val df: DataFrame = spark.read.json(paths: String*)

I need to read json and json-gzip files, but I may need to read other filename formats. The second method (above) accepts a Scala Seq(uence), which means I could provide a Seq(uence), which I could later add other filename wildcards.

// session.isInstanceOf[SparkSession]
val s3json: String = "s3n://bucket/filepath/*.json"
val s3gzip: String = "s3n://bucket/filepath/*.json.gz"
val paths: Seq[String] = Seq(s3json, s3gzip)
val df: DataFrame = session.read().json(paths)

Please comment on this approach, and is this idionatic?

I have also seen examples of the last line with the splat operator ("_") added to the paths sequence. Is that needed? Can you explain what the ": _" part does?

val df: DataFrame = session.read().json(paths: _*)

Example of the splat operator use are here:

question from:https://stackoverflow.com/questions/65853273/spark-scala-read-multiple-files-from-s3-using-seqpaths

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:28:48+0000

Adding further to blackbishop's answer, you can use val df = spark.read.json(paths: _*) for reading files from entirely independent buckets/folders.

    val paths = Seq("s3n://bucket1/filepath1/","s3n://bucket2/filepath/2")
    val df = spark.read.json(paths: _*)

The _* converts a Seq to variable arguments needed by path function.

Categories

Spark scala read multiple files from S3 using Seq(paths)

Spark scala read multiple files from S3 using Seq(paths)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags