python - find duplicate rows in a pandas dataframe

Question

Welcome To Ask or Share your Answers For Others

python - find duplicate rows in a pandas dataframe

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - find duplicate rows in a pandas dataframe

I am trying to find duplicates rows in a pandas dataframe.

df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])

df
Out[15]: 
   col1  col2
0     1     2
1     3     4
2     1     2
3     1     4
4     1     2

duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first')
duplicate = df.loc[duplicate_bool == True]

duplicate
Out[16]: 
   col1  col2
2     1     2
4     1     2

Is there a way to add a column referring to the index of the first duplicate (the one kept)

duplicate
Out[16]: 
   col1  col2  index_original
2     1     2               0
4     1     2               0

Note: df could be very very big in my case....

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:55:53+0000

Use groupby, create a new column of indexes, and then call duplicated:

df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')    
df[df.duplicated(subset=['col1','col2'], keep='first')]

   col1  col2  index_original
2     1     2               0
4     1     2               0

Details

I groupby first two columns and then call transform + idxmin to get the first index of each group.

df.groupby(['col1', 'col2']).col1.transform('idxmin') 

0    0
1    1
2    0
3    3
4    0
Name: col1, dtype: int64

duplicated gives me a boolean mask of values I want to keep:

df.duplicated(subset=['col1','col2'], keep='first')

0    False
1    False
2     True
3    False
4     True
dtype: bool

The rest is just boolean indexing.

Categories

python - find duplicate rows in a pandas dataframe

python - find duplicate rows in a pandas dataframe

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags