scala - Spark : Read file only if the path exists

Question

Welcome To Ask or Share your Answers For Others

scala - Spark : Read file only if the path exists

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Spark : Read file only if the path exists

I am trying to read the files present at Sequence of Paths in scala. Below is the sample (pseudo) code:

val paths = Seq[String] //Seq of paths
val dataframe = spark.read.parquet(paths: _*)

Now, in the above sequence, some paths exist whereas some don't. Is there any way to ignore the missing paths while reading parquet files (to avoid org.apache.spark.sql.AnalysisException: Path does not exist)?

I have tried the below and it seems to be working, but then, I end up reading the same path twice which is something I would like to avoid doing:

val filteredPaths = paths.filter(p => Try(spark.read.parquet(p)).isSuccess)

I checked the options method for DataFrameReader but that does not seem to have any option that is similar to ignore_if_missing.

Also, these paths can be hdfs or s3 (this Seq is passed as a method argument) and while reading, I don't know whether a path is s3 or hdfs so can't use s3 or hdfs specific API to check the existence.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:19:13+0000

You can filter out the irrelevant files as in @Psidom's answer. In spark, the best way to do so is to use the internal spark hadoop configuration. Given that spark session variable is called "spark" you can do:

import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path

val hadoopfs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)

def testDirExist(path: String): Boolean = {
  val p = new Path(path)
  hadoopfs.exists(p) && hadoopfs.getFileStatus(p).isDirectory
}
val filteredPaths = paths.filter(p => testDirExists(p))
val dataframe = spark.read.parquet(filteredPaths: _*)

Categories

scala - Spark : Read file only if the path exists

scala - Spark : Read file only if the path exists

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags