I have a scala program that reads json files into a DataFrame using DataFrameReader, using a file pattern like "s3n://bucket/filepath/*.json" to specify files. Now I need to read both ".json" and ".json.gz" (gzip) files into the dataframe.
Since current approach uses a wildcard, like this:
session.read().json("s3n://bucket/filepath/*.json")
I want to read both json and json-gzip files, but I have not found documentation for the wildcard pattern expression. I was tempted to compose a more complex wildcard, but the lack of wildcard documentation motivated me to consider another approach.
Reading the documentation for Spark, it says that the DataFrameReader has these relevant methods,
- json(path: String): DataFrame
- json(paths: String*): DataFrame
Which would produce code more like this:
// spark.isInstanceOf[SparkSession]
// val reader: DataFrameReader = spark.read
val df: DataFrame = spark.read.json(path: String)
// or
val df: DataFrame = spark.read.json(paths: String*)
I need to read json and json-gzip files, but I may need to read other filename formats. The second method (above) accepts a Scala Seq(uence), which means I could provide a Seq(uence), which I could later add other filename wildcards.
// session.isInstanceOf[SparkSession]
val s3json: String = "s3n://bucket/filepath/*.json"
val s3gzip: String = "s3n://bucket/filepath/*.json.gz"
val paths: Seq[String] = Seq(s3json, s3gzip)
val df: DataFrame = session.read().json(paths)
Please comment on this approach, and is this idionatic?
I have also seen examples of the last line with the splat operator ("_") added to the paths sequence. Is that needed? Can you explain what the ": _" part does?
val df: DataFrame = session.read().json(paths: _*)
Example of the splat operator use are here:
question from:
https://stackoverflow.com/questions/65853273/spark-scala-read-multiple-files-from-s3-using-seqpaths 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…