I'd like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let's say I have the DataFrame created by:
import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame
key1 key2 data
0 1 2 5
1 2 2 6
2 3 1 2
3 1 2 6
4 2 2 1
5 3 4 6
6 2 2 2
7 2 2 8
As you can see, rows 0,1,3,4,6, and 7 are all duplicates (using 'key1' and 'key2'. However, if I index this DataFrame like so:
frame[frame.duplicated(['key1','key2'])]
I get
key1 key2 data
3 1 2 6
4 2 2 1
6 2 2 2
7 2 2 8
(i.e., the 1st and 2nd rows do not show up because they are not indexed to True by the duplicated method).
That is my first problem. My second problems deals with how to extract the descriptive statistics from this information. Forgetting the missing duplicate for the moment, let's say I want to compute the .min() and .max() for the duplicate entries (so that I can get a range). I can use groupby and these methods on the groupby object like so:
a.groupby(['key1','key2']).min()
which gives
key1 key2 data
key1 key2
1 2 1 2 6
2 2 2 2 1
The data I want is obviously here, but what's the best way for me to extract it? How do I index the resulting object to get what I want (which is the key1,key2,data info)?
See Question&Answers more detail:
os