scala - Spark textFile vs wholeTextFiles

Question

Welcome To Ask or Share your Answers For Others

scala - Spark textFile vs wholeTextFiles

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Spark textFile vs wholeTextFiles

I understand the basic theory of textFile generating partition for each file, while wholeTextFiles generates an RDD of pair values, where the key is the path of each file, the value is the content of each file.

Now, from a technical point of view, what's the difference between :

val textFile = sc.textFile("my/path/*.csv", 8)
textFile.getNumPartitions

and

val textFile = sc.wholeTextFiles("my/path/*.csv",8)
textFile.getNumPartitions

In both methods I'm generating 8 partitions. So why should I use wholeTextFiles in the first place, and what's its benefit over textFile?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:48:35+0000

The main difference, as you mentioned, is that textFile will return an RDD with each line as an element while wholeTextFiles returns a PairRDD with the key being the file path. If there is no need to separate the data depending on the file, simply use textFile.

When reading uncompressed files with textFile, it will split the data into chuncks of 32MB. This is advantagous from a memory perspective. This also means that the ordering of the lines is lost, if the order should be preserved then wholeTextFiles should be used.

wholeTextFiles will read the complete content of a file at once, it won't be partially spilled to disk or partially garbage collected. Each file will be handled by one core and the data for each file will be one a single machine making it harder to distribute the load.

Categories

scala - Spark textFile vs wholeTextFiles

scala - Spark textFile vs wholeTextFiles

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags