Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.2k views
in Technique[技术] by (71.8m points)

scala - How to read a file from HDFS in map() quickly with Spark

I need to read a different file in every map() ,the file is in HDFS

  val rdd=sc.parallelize(1 to 10000)
  val rdd2=rdd.map{x=>
    val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())
    val path=new Path("/user/zhc/"+x+"/")
    val t=hdfs.listStatus(path)
    val in =hdfs.open(t(0).getPath)
    val reader = new BufferedReader(new InputStreamReader(in))
    var l=reader.readLine()
  }
 rdd2.count

My problem is this code

val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())

takes too much running time, every time of map() needs to create a new FileSystem value. Can i put this code outside map() function so it doesn't have to create hdfs every time? Or how can i read files quickly in map()?

My code runs on multiple machines. Thank you!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In your case, I recommend the use of wholeTextFiles method wich will return pairRdd with the key is the file full path, and the value is the content of the file in string.

val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/")
val filesLineCount = filesPariRDD.map( x => (x._1, x._2.length ) ) //this will return a map of fileName , number of lines of each file. You could apply any other function on the file contents
filesLineCount.collect() 

Edit

If your files are in directories which are under the same directory ( as mentioned in comments)you could use some kind of regular expression

val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/*/")

Hope this is clear and helpful


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...