Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
493 views
in Technique[技术] by (71.8m points)

python - Strategy for partitioning dask dataframes efficiently

The documentation for Dask talks about repartioning to reduce overhead here.

They however seem to indicate you need some knowledge of what your dataframe will look like beforehand (ie that there will 1/100th of the data expected).

Is there a good way to repartition sensibly without making assumptions? At the moment I just repartition with npartitions = ncores * magic_number, and set force to True to expand partitions if need be. This one size fits all approach works but is definitely suboptimal as my dataset varies in size.

The data is time series data, but unfortunately not at regular intervals, I've used repartition by time frequency in the past but this would be suboptimal because of how irregular the data is (sometimes nothing for minutes then thousands in seconds)

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

As of Dask 2.0.0 you may call .repartition(partition_size="100MB").

This method performs an object-considerate (.memory_usage(deep=True)) breakdown of partition size. It will join smaller partitions, or split partitions that have grown too large.

Dask's Documentation also outlines the usage.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...