Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.4k views
in Technique[技术] by (71.8m points)

json - How to avoid generating crc files and SUCCESS files while saving a DataFrame?

I am using the following code to save a spark DataFrame to JSON file

unzipJSON.write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")

the output result is:

part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00000-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8
.part-r-00001-704b5725-15ea-4705-b347-285a4b0e7fd8.crc
_SUCCESS
._SUCCESS.crc
  1. How do I generate a single JSON file and not a file per line?
  2. How can I avoid the *crc files?
  3. How can I avoid the SUCCESS file?
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If you want a single file, you need to do a coalesce to a single partition before calling write, so:

unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")

Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write - especially if you do a write with a partitionBy - but as far as I know, there are currently no other way.

I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.

sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Note, that you may also want to disable generation of the metadata files with:

sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")

Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...