Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
246 views
in Technique[技术] by (71.8m points)

Python: Read multiple large csv's at the same time

I have 9 large CSVs (12GB each), with exactly the same column structure and row order, just different values in each csv. I need to go through the csv's row by row and compare the data inside them, but they are far too large to store in memory. Row order being maintained is highly important as the row position is used as an index for comparing the data between csvs, so appending the tables together isn't ideal.

I'd rather avoid 9 nested "with open() as csv:" using DictReader and this seems very messy.

I've tried to used pandas and concatenate:

files = [list_of_csv_paths]
result = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

but it simply tries to load all the data into memory and I don't have nearly enough RAM. Changing the pd.read_csv to have a specific chucksize returns a TypeError.

I've seen that possibly Dash could be used for this but I'm not experienced with Dash.

I'm open to any suggestions.

question from:https://stackoverflow.com/questions/65888056/python-read-multiple-large-csvs-at-the-same-time

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I think this might be a good start - reading by chunks - where chunksize is number of lines by documentation. That should be the best way of reading huge files. You can try to use threading as well to process it faster.

Simple example:

import pandas as pd
chunksize = 10 ** 8
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

Check the skiprows parameter as well. Next example is gonna read lines from 1000 to 2000.

Example:

df = pd.read_csv('file.csv',sep=',', header=None, skiprows=1000, chunksize=1000)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...