Say I have a 1TB parquet file stored in S3. It's written out in individual files of size 1GB. The data contains just two columns COL1 and COL2.
I use spark SQL to filter the file down based on one of the columns and then then do a maximum on another column to get a single scalar result.
SELECT MAX(COL2) FROM TABLE WHERE COL1 = VALUE
Let's say I have a 5 node cluster each with 4 cores and 32GB of memory. The full data set clearly doesn't fit into memory. What is actually going on under the hood in terms of loading the file into partitions spread across the nodes and the work done on each node including spilling to disk to get this query done? Which of the multitude of config settings have a likely impact on performance?
question from:
https://stackoverflow.com/questions/65854311/how-does-spark-perform-filters-and-aggregations-on-datasets-which-dont-fit-in-m 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…