So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism
to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).
Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.
https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html
I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions
is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…