What methods are available to merge columns which have timestamps that do not exactly match?
DF1:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:13 7261824 871631182
DF2:
date start_time employee_id session_id
01/01/2016 01/01/2016 06:03:37 7261824 871631182
I could join on the ['date', 'employee_id', 'session_id'], but sometimes the same employee will have multiple identical sessions on the same date which causes duplicates. I could drop the rows where this takes place, but I would lose valid sessions if I did.
Is there an efficient way to join if the timestamp of DF1 is <5 minutes from the timestamp of DF2, and the session_id and employee_id also match? If there is a matching record, then the timestamp will always be slightly later than DF1 because an event is triggered at some future point.
['employee_id', 'session_id', 'timestamp<5minutes']
Edit - I assumed someone would have run into this issue before.
I was thinking of doing this:
- Take my timestamp on each dataframe
- Create a column which is the timestamp + 5 minutes (rounded)
- Create a column which is the timestamp - 5 minutes (rounded)
Create a 10 minute interval string to join the files on
df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
Does someone know how to round those 5 minute intervals to the nearest 5 minute mark?
02:59:37 - 5 min = 02:55:00
02:59:37 + 5 min = 03:05:00
interval_string = '02:55:00-03:05:00'
pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']
Does anyone know how to round the time like that? This seems like it could work. You still match based on the date, employee, and session, and then you look for times which are basically within the same 10 minute interval or range
See Question&Answers more detail:
os