I recently had this issue with a large dataframe and its associated multi index.
This simplified example will demonstrate the issue.
import pandas as pd
import numpy as np
np.random.seed(1)
idx = pd.MultiIndex.from_product([['A','B'],[5,6]])
df = pd.DataFrame(data= np.random.randint(1,100,(4)),index= idx,columns =['P'])
print df
Which yields:
P
A 5 38
6 13
B 5 73
6 10
Now take a quick look at the index
print df.index
MultiIndex(levels=[[u'A', u'B'], [5, 6]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
If I slice this dataframe I notice that the multi index never condenses.
Even with a deep copy.
What is the best way to reduce the memory footprint of the index in a slice operation?
df_slice = df[df['P']>20]
print df_slice
print df_slice.index
P
A 5 38
B 5 73
See how the dataframe has reduced, but the index has not.
MultiIndex(levels=[[u'A', u'B'], [5, 6]],
labels=[[0, 1], [0, 0]])
Even with a .copy(deep=True)
df_slice = df[df['P']>20].copy(deep=True)
print df_slice.index
MultiIndex(levels=[[u'A', u'B'], [5, 6]]
,labels=[[0, 1], [0, 0]])
I would have expected MultiIndex to have the 6 removed as shown:
MultiIndex(levels=[[u'A', u'B'], [5]]
,labels=[[0, 1], [0, 0]])
The issue comes in practice when the dataframe is large.
See Question&Answers more detail:
os