If you want a single file, you need to do a coalesce
to a single partition before calling write, so:
unzipJSON.coalesce(1).write.mode("append").json("/home/eranw/Workspace/JSON/output/unCompressedJson.json")
Personally, I find it rather annoying that the number of output files depend on number of partitions you have before calling write
- especially if you do a write
with a partitionBy
- but as far as I know, there are currently no other way.
I don't know if there is a way to disable the .crc files - I don't know of one - but you can disable the _SUCCESS file by setting the following on the hadoop configuration of the Spark context.
sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Note, that you may also want to disable generation of the metadata files with:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
Apparently, generating the metadata files takes some time (see this blog post) but aren't actually that important (according to this). Personally, I always disable them and I have had no issues.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…