python - Pandas read_stata() with large .dta files

Question

Welcome To Ask or Share your Answers For Others

python - Pandas read_stata() with large .dta files

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Pandas read_stata() with large .dta files

I am working with a Stata .dta file that is around 3.3 gigabytes, so it is large but not excessively large. I am interested in using IPython and tried to import the .dta file using Pandas but something wonky is going on. My box has 32 gigabytes of RAM and attempting to load the .dta file results in all the RAM being used (after ~30 minutes) and my computer to stall out. This doesn't 'feel' right in that I am able to open the file in R using read.dta() from the foreign package no problem, and working with the file in Stata is fine. The code I am using is:

%time myfile = pd.read_stata(data_dir + 'my_dta_file.dta')

and I am using IPython in Enthought's Canopy program. The reason for the '%time' is because I am interested in benchmarking this against R's read.dta().

My questions are:

Is there something I am doing wrong that is resulting in Pandas having issues?
Is there a workaround to get the data into a Pandas dataframe?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T00:46:15+0000

Here is a little function that has been handy for me, using some pandas features that might not have been available when the question was originally posed:

def load_large_dta(fname):
    import sys

    reader = pd.read_stata(fname, iterator=True)
    df = pd.DataFrame()

    try:
        chunk = reader.get_chunk(100*1000)
        while len(chunk) > 0:
            df = df.append(chunk, ignore_index=True)
            chunk = reader.get_chunk(100*1000)
            print '.',
            sys.stdout.flush()
    except (StopIteration, KeyboardInterrupt):
        pass

    print '
loaded {} rows'.format(len(df))

    return df

I loaded an 11G Stata file in 100 minutes with this, and it's nice to have something to play with if I get tired of waiting and hit cntl-c.

This notebook shows it in action.

Categories

python - Pandas read_stata() with large .dta files

python - Pandas read_stata() with large .dta files

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags