Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
112 views
in Technique[技术] by (71.8m points)

python - How to analyze all duplicate entries in this Pandas DataFrame?

I'd like to be able to compute descriptive statistics on data in a Pandas DataFrame, but I only care about duplicated entries. For example, let's say I have the DataFrame created by:

import pandas as pd
data={'key1':[1,2,3,1,2,3,2,2],'key2':[2,2,1,2,2,4,2,2],'data':[5,6,2,6,1,6,2,8]}
frame=pd.DataFrame(data,columns=['key1','key2','data'])
print frame


     key1  key2  data
0     1     2     5
1     2     2     6
2     3     1     2
3     1     2     6
4     2     2     1
5     3     4     6
6     2     2     2
7     2     2     8

As you can see, rows 0,1,3,4,6, and 7 are all duplicates (using 'key1' and 'key2'. However, if I index this DataFrame like so:

frame[frame.duplicated(['key1','key2'])]

I get

   key1  key2  data
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

(i.e., the 1st and 2nd rows do not show up because they are not indexed to True by the duplicated method).

That is my first problem. My second problems deals with how to extract the descriptive statistics from this information. Forgetting the missing duplicate for the moment, let's say I want to compute the .min() and .max() for the duplicate entries (so that I can get a range). I can use groupby and these methods on the groupby object like so:

a.groupby(['key1','key2']).min()

which gives

           key1  key2  data
key1 key2                  
1    2        1     2     6
2    2        2     2     1

The data I want is obviously here, but what's the best way for me to extract it? How do I index the resulting object to get what I want (which is the key1,key2,data info)?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

EDIT for Pandas 0.17 or later:

As the take_last argument of the duplicated() method was deprecated in favour of the new keep argument since Pandas 0.17, please refer to this answer for the correct approach:

  • Invoke the duplicated() method with keep=False, i.e. frame.duplicated(['key1', 'key2'], keep=False).

Therefore, in order to extract the required data for this specific question, the following suffices:

In [81]: frame[frame.duplicated(['key1', 'key2'], keep=False)].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

Interestingly enough, this change in Pandas 0.17 may be partially attributed to this question, as referred to in this issue.


For versions preceding Pandas 0.17:

We can play with the take_last argument of the duplicated() method:

take_last: boolean, default False

For a set of distinct duplicate rows, flag all but the last row as duplicated. Default is for all but the first row to be flagged.

If we set take_last's value to True, we flag all but the last duplicate row. Combining this along with its default value of False, which flags all but the first duplicate row, allows us to flag all duplicated rows:

In [76]: frame.duplicated(['key1', 'key2'])
Out[76]: 
0    False
1    False
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [77]: frame.duplicated(['key1', 'key2'], take_last=True)
Out[77]: 
0     True
1     True
2    False
3    False
4     True
5    False
6     True
7    False
dtype: bool

In [78]: frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])
Out[78]: 
0     True
1     True
2    False
3     True
4     True
5    False
6     True
7     True
dtype: bool

In [79]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])]
Out[79]: 
   key1  key2  data
0     1     2     5
1     2     2     6
3     1     2     6
4     2     2     1
6     2     2     2
7     2     2     8

[6 rows x 3 columns]

Now we just need to use the groupby and min methods, and I believe the output is in the required format:

In [81]: frame[frame.duplicated(['key1', 'key2'], take_last=True) | frame.duplicated(['key1', 'key2'])].groupby(('key1', 'key2')).min()
Out[81]: 
           data
key1 key2      
1    2        5
2    2        1

[2 rows x 1 columns]

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...