Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
221 views
in Technique[技术] by (71.8m points)

python - Dataframe Slice does not remove Index Values

I recently had this issue with a large dataframe and its associated multi index. This simplified example will demonstrate the issue.

import pandas as pd
import numpy as np

np.random.seed(1)
idx = pd.MultiIndex.from_product([['A','B'],[5,6]])


df = pd.DataFrame(data= np.random.randint(1,100,(4)),index= idx,columns =['P'])
print df

Which yields:

      P
A 5  38
  6  13
B 5  73
  6  10

Now take a quick look at the index

print df.index

MultiIndex(levels=[[u'A', u'B'], [5, 6]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])

If I slice this dataframe I notice that the multi index never condenses. Even with a deep copy.

What is the best way to reduce the memory footprint of the index in a slice operation?

df_slice = df[df['P']>20]
print df_slice
print df_slice.index

      P
A 5  38
B 5  73

See how the dataframe has reduced, but the index has not.

MultiIndex(levels=[[u'A', u'B'], [5, 6]],
           labels=[[0, 1], [0, 0]])

Even with a .copy(deep=True)

df_slice = df[df['P']>20].copy(deep=True)
print df_slice.index


MultiIndex(levels=[[u'A', u'B'], [5, 6]]
    ,labels=[[0, 1], [0, 0]])

I would have expected MultiIndex to have the 6 removed as shown:

MultiIndex(levels=[[u'A', u'B'], [5]]
    ,labels=[[0, 1], [0, 0]])

The issue comes in practice when the dataframe is large.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I understand your concern, but I believe you have to see what is happening in pandas low-level application.

First, we must declare that indexes are supposed to be immutable. You can check more of its documentation here -> http://pandas.pydata.org/pandas-docs/stable/indexing.html#setting-metadata

When you create a dataframe object, let's name it df and you want to access its rows, basically all you do is passing a boolean series that Pandas will match with its corresponding index.

Follow this example:

index = pd.MultiIndex.from_product([['A','B'],[5,6]])
df = pd.DataFrame(data=np.random.randint(1,100,(4)), index=index, columns=["P"])

      P
A 5   5
  6  51
B 5  93
  6  76

Now, let's say we want to select the rows with P > 90. How would you do that? df[df["P"] > 90], right? But look at what df["P"] > 90 actually returns.

A  5     True
   6     True
B  5     True
   6    False
Name: P, dtype: bool

As you can see, it returns a boolean series matching the original index. Why? Because Pandas needs to map which index values have an equivalent true value, so it can select the proper outcome. So basically, during your slice opperations, you will always carry this index, because it is a mapping element for the object.

However, hope is not gone. Depending on your application, if you believe it is actually taking a huge portion of your memory, you can spend a little time doing the following:

def df_sliced_index(df):
    new_index = []
    rows = []
    for ind, row in df.iterrows():
        new_index.append(ind)
        rows.append(row)
    return pd.DataFrame(data=rows, index=pd.MultiIndex.from_tuples(new_index))

df_sliced_index(df[df['P'] > 90]).index

Which yields what I believe, is the desired output:

MultiIndex(levels=[[u'B'], [5]], labels=[[0], [0]])

But if data is too large to worry you about the size of index, I wonder how much it may cost you in terms of time.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...