scala - How to split parquet files into many partitions in Spark?

Question

Welcome To Ask or Share your Answers For Others

scala - How to split parquet files into many partitions in Spark?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - How to split parquet files into many partitions in Spark?

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).

Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.

https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html

I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:58:09+0000

You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.

The source of ParquetOuputFormat is here, if you want to dig into details.

The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.

Categories

scala - How to split parquet files into many partitions in Spark?

scala - How to split parquet files into many partitions in Spark?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags