Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
120 views
in Technique[技术] by (71.8m points)

python - Splitting dataframe into multiple dataframes

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents).

I would like to split the dataframe into 60 dataframes (a dataframe for each participant).

In the dataframe, data, there is a variable called 'name', which is the unique code for each participant.

I have tried the following, but nothing happens (or execution does not stop within an hour). What I intend to do is to split the data into smaller dataframes, and append these to a list (datalist):

import pandas as pd

def splitframe(data, name='name'):
    
    n = data[name][0]

    df = pd.DataFrame(columns=data.columns)

    datalist = []

    for i in range(len(data)):
        if data[name][i] == n:
            df = df.append(data.iloc[i])
        else:
            datalist.append(df)
            df = pd.DataFrame(columns=data.columns)
            n = data[name][i]
            df = df.append(data.iloc[i])
        
    return datalist

I do not get an error message, the script just seems to run forever!

Is there a smart way to do it?

question from:https://stackoverflow.com/questions/66058369/divide-dataframe-with-unique-key

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.

However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?

I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.

Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.

Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:

# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)

# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)

# get a list of names
names=df['name'].unique().tolist()

# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']

# now you can query all 'joes'

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...