python - PySpark: spit out single file when writing instead of multiple part files

Question

Welcome To Ask or Share your Answers For Others

python - PySpark: spit out single file when writing instead of multiple part files

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - PySpark: spit out single file when writing instead of multiple part files

Is there a way to prevent PySpark from creating several small files when writing a DataFrame to JSON file?

If I run:

 df.write.format('json').save('myfile.json')

or

df1.write.json('myfile.json')

it creates the folder named myfile and within it I find several small files named part-***, the HDFS way. Is it by any means possible to have it spit out a single file instead?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T19:22:55+0000

Well, the answer to your exact question is coalesce function. But as already mentioned it is not efficient at all as it will force one worker to fetch all data and write it sequentially.

df.coalesce(1).write.format('json').save('myfile.json')

P.S. Btw, the result file is not a valid json file. It is a file with a json object per line.

Categories

python - PySpark: spit out single file when writing instead of multiple part files

python - PySpark: spit out single file when writing instead of multiple part files

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags