Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
149 views
in Technique[技术] by (71.8m points)

How does spark perform filters and aggregations on datasets which don't fit in memory

Say I have a 1TB parquet file stored in S3. It's written out in individual files of size 1GB. The data contains just two columns COL1 and COL2.

I use spark SQL to filter the file down based on one of the columns and then then do a maximum on another column to get a single scalar result.

SELECT MAX(COL2) FROM TABLE WHERE COL1 = VALUE

Let's say I have a 5 node cluster each with 4 cores and 32GB of memory. The full data set clearly doesn't fit into memory. What is actually going on under the hood in terms of loading the file into partitions spread across the nodes and the work done on each node including spilling to disk to get this query done? Which of the multitude of config settings have a likely impact on performance?

question from:https://stackoverflow.com/questions/65854311/how-does-spark-perform-filters-and-aggregations-on-datasets-which-dont-fit-in-m

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...