I am trying to access gz files on s3 that start with _
in Apache Spark. Unfortunately spark deems these files invisible and returns Input path does not exist: s3n:.../_1013.gz
. If I remove the underscore it finds the file just fine.
I tried adding a custom PathFilter to the hadoopConfig:
package CustomReader
import org.apache.hadoop.fs.{Path, PathFilter}
class GFilterZip extends PathFilter {
override def accept(path: Path): Boolean = {
true
}
}
// in spark settings
sc.hadoopConfiguration.setClass("mapreduce.input.pathFilter.class", classOf[CustomReader.GFilterZip], classOf[org.apache.hadoop.fs.PathFilter])
but I still have the same problem. Any ideas?
System: Apache Spark 1.6.0 with Hadoop 2.3
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…