python - Remove pandas rows with duplicate indices

Question

Welcome To Ask or Share your Answers For Others

python - Remove pandas rows with duplicate indices

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Remove pandas rows with duplicate indices

How to remove rows with duplicate index values?

In the weather DataFrame below, sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file.

I'm reading some automated weather data from the web (observations occur every 5 minutes, and compiled into monthly files for each weather station.) After parsing a file, the DataFrame looks like:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress
Date                                                                                      
2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31
2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30
2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30
2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30
2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

Example of a duplicate case:

import pandas 
import datetime

startdate = datetime.datetime(2001, 1, 1, 0, 0)
enddate = datetime.datetime(2001, 1, 1, 5, 0)
index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H')
data1 = {'A' : range(6), 'B' : range(6)}
data2 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]}
df1 = pandas.DataFrame(data=data1, index=index)
df2 = pandas.DataFrame(data=data2, index=index[:3])
df3 = df2.append(df1)

df3
                       A   B
2001-01-01 00:00:00   20 -50
2001-01-01 01:00:00  -30  60
2001-01-01 02:00:00   40 -70
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2

And so I need df3 to eventually become:

                       A   B
2001-01-01 00:00:00    0   0
2001-01-01 01:00:00    1   1
2001-01-01 02:00:00    2   2
2001-01-01 03:00:00    3   3
2001-01-01 04:00:00    4   4
2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T21:21:38+0000

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, .drop_duplicates is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

Using the sample data provided:

>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop

>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 μs per loop

>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 μs per loop

Note that you can keep the last element by changing the keep argument to 'last'.

It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 μs per loop

>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 μs per loop

Categories

python - Remove pandas rows with duplicate indices

python - Remove pandas rows with duplicate indices

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags