pyspark - Spark parquet compression and encoding schemes

Question

Welcome To Ask or Share your Answers For Others

pyspark - Spark parquet compression and encoding schemes

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

pyspark - Spark parquet compression and encoding schemes

I need to encode parquet files which are produced by my pyspark script, so that the encoding is using RLE_DICTIONARY (https://www.slideshare.net/databricks/the-parquet-format-and-performance-optimization-opportunities).

Secondly, I need the compression to be applied, but not on the full file level, but I need the row group (split unit) level compression - ideally with snappy, so we can support parallel reads from Redshift Spectrum (https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-data-files.html).

However, looking at the official parquet docs, there are only few parquet related properties that can be set (https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html#configuration). This property:

spark.sql.parquet.compression.codec

defaults to snappy, but does that apply file level or split level compression (i.e. does it first produce parquet file and then snappy compresses, or first it snappy compresses row groups - splits, and then produces the file level?)

What is the default behavior here? Does the default behavior meet my requirement of applying split chunk compression instead of file level compression? Is the RLE_DICTIONARY a default encoding used by Spark? I couldn't find an option to define encoding itself?

question from:https://stackoverflow.com/questions/65844890/spark-parquet-compression-and-encoding-schemes

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

pyspark - Spark parquet compression and encoding schemes

pyspark - Spark parquet compression and encoding schemes

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags