Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
244 views
in Technique[技术] by (71.8m points)

python - Sorting in pandas for large datasets

I would like to sort my data by a given column, specifically p-values. However, the issue is that I am not able to load my entire data into memory. Thus, the following doesn't work or rather works for only small datasets.

data = data.sort(columns=["P_VALUE"], ascending=True, axis=0)

Is there a quick way to sort my data by a given column that only takes chunks into account and doesn't require loading entire datasets in memory?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In the past, I've used Linux's pair of venerable sort and split utilities, to sort massive files that choked pandas.

I don't want to disparage the other answer on this page. However, since your data is text format (as you indicated in the comments), I think it's a tremendous complication to start transferring it into other formats (HDF, SQL, etc.), for something that GNU/Linux utilities have been solving very efficiently for the last 30-40 years.


Say your file is called stuff.csv, and looks like this:

4.9,3.0,1.4,0.6
4.8,2.8,1.3,1.2

Then the following command will sort it by the 3rd column:

sort --parallel=8 -t . -nrk3 stuff.csv

Note that the number of threads here is set to 8.


The above will work with files that fit into the main memory. When your file is too large, you would first split it into a number of parts. So

split -l 100000 stuff.csv stuff

would split the file into files of length at most 100000 lines.

Now you would sort each file individually, as above. Finally, you would use mergesort, again through (waith for it...) sort:

sort -m sorted_stuff_* > final_sorted_stuff.csv

Finally, if your file is not in CSV (say it is a tgz file), then you should find a way to pipe a CSV version of it into split.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...