I am trying to read the files present at Sequence
of Paths in scala. Below is the sample (pseudo) code:
val paths = Seq[String] //Seq of paths
val dataframe = spark.read.parquet(paths: _*)
Now, in the above sequence, some paths exist whereas some don't. Is there any way to ignore the missing paths while reading parquet
files (to avoid org.apache.spark.sql.AnalysisException: Path does not exist
)?
I have tried the below and it seems to be working, but then, I end up reading the same path twice which is something I would like to avoid doing:
val filteredPaths = paths.filter(p => Try(spark.read.parquet(p)).isSuccess)
I checked the options
method for DataFrameReader
but that does not seem to have any option that is similar to ignore_if_missing
.
Also, these paths can be hdfs
or s3
(this Seq
is passed as a method argument) and while reading, I don't know whether a path is s3
or hdfs
so can't use s3
or hdfs
specific API to check the existence.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…