Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
151 views
in Technique[技术] by (71.8m points)

python - Keep the data with latest date from a pandas dataframe

I have a csv file with company software details along with installed dates. I am reading the csv file using pandas in a dataframe. Below is the sample data containing two different software:

software_id software_name                               installed_date    software_version
8331        Intel(R) Graphics Media Accelerator Driver  2009-05-23 0:00   8.15.10.2008
8331        Intel(R) Graphics Media Accelerator Driver  2010-09-15 0:00   8.15.10.2008
8331        Intel(R) Graphics Media Accelerator Driver  2009-12-27 0:00   8.15.10.2008
8332        Wireless Switch Utility                     2009-12-22 0:00   4.3.1400.0
8332        Wireless Switch Utility                     2010-11-22 0:00   4.3.1400.0
8332        Wireless Switch Utility                     2011-01-25 0:00   4.3.1400.0

So from the above data I just need to keep one row with the latest date from each software code. For example, the output of the above file should be:

software_id     software_name                               installed_date    software_version
    8331        Intel(R) Graphics Media Accelerator Driver  2010-09-15 0:00   8.15.10.2008
    8332        Wireless Switch Utility                     2011-01-25 0:00   4.3.1400.0

How can I choose the set of rows for one software code and remove all rows but the one with the latest date and then move to the next software code until the file only has one entry per software code. I cannot hardcode the software_id to check as there are thousands of them.

My logic is to read and store the first software_id and installed_date in two variables and start reading the file line by line. Next line will check if the software_id matches the one stored in the variable and then compare the date and store the latest one in the variable. When the software_id does not match the stored software id then it means that new software_id block has started. It will then store the previous values in the dataframe and start executing next blocks and so on.

FYI - I am a pandas noob.

Thanks for all the help.

question from:https://stackoverflow.com/questions/65859559/keep-the-data-with-latest-date-from-a-pandas-dataframe

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

First you need to convert the installed_date column to datetime:

df['installed_date'] = pd.to_datetime(df['installed_date'])

Then you can use one of the options below:

Option 1: sort the values on installed_date then drop_duplicates keeping only the last row per software_id.

df.sort_values('installed_date').drop_duplicates('software_id', keep='last')

Option 2: group the dataframe on softaware_id and aggregate using idxmax to get the index of most recent date per software_id group, then use loc with this index to filter the required rows:

idx = df.groupby('software_id')['installed_date'].idxmax()
df.loc[idx]

Result:

   software_id                               software_name installed_date software_version
1         8331  Intel(R) Graphics Media Accelerator Driver     2010-09-15     8.15.10.2008
5         8332                     Wireless Switch Utility     2011-01-25       4.3.1400.0

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...