I think the biggest problem here is to identify when the time intervals are overlapping, the rest is just grouping and addition.
First, be sure (if not already done) to convert your dates to datetime
and days in timedelta
. This will help to compare dates and durations and perform some math on them.
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
df['days'] = pd.to_timedelta(df['days'], unit='D')
This code produces your expected result:
def join_times(x):
startdf = pd.DataFrame({'time':x['start'], 'what':1})
enddf = pd.DataFrame({'time':x['end'], 'what':-1})
mergdf = pd.concat([startdf, enddf]).sort_values('time')
mergdf['running'] = mergdf['what'].cumsum()
mergdf['newwin'] = mergdf['running'].eq(1) & mergdf['what'].eq(1)
mergdf['group'] = mergdf['newwin'].cumsum()
x['group'] = mergdf['group'].loc[mergdf['what'].eq(1)]
res = x.groupby('group').agg({'days':'sum', 'start':'first'})
res['end'] = res.apply(lambda x : x['start'] + x['days'] + pd.to_timedelta(5, unit='D'), axis=1)
return res
ddf = df.groupby(['ID', 'ingredient', 'dose']).apply(join_times).reset_index().drop('group', axis=1)
This needs to be explained. As you see, I use groupby
to identify the subsamples. Then the job is done by the custom join_times
function.
join_times
function joins together in the same column of a single dataframe (column 'time'
) start and end times, sorted in order.
A second column 'what'
marks with +1 starting times, and with -1 ending time. These are used to keep track of how many intervals are overlapping (in colum 'running'
using cumsum()
).
Then a boolean column 'newwin'
is buildt to identify the beginning of a new non overlapping time interval and a column 'group'
is buildt to mark with the same integer the rows belonging to the same overlapping time interval.
The a 'group'
column is added to the original subsample, copying the values in the previously buildt 'group'
column. Finally, we can identify, for each subsample, which rows have overlapping.
So we can use groupby
again and sum the 'days'
column, keeping the first date from 'start'
column.
'end'
column is calculated by adding to 'start'
the duration 'days'
plus 5 days.
The above code, using your data sample, gives:
ID ingredient dose days start end
0 1000 Metron... 125.0 17 days 2018-10-03 2018-10-25
1 1001 Cefalexin 150.0 7 days 2018-03-08 2018-03-20
2 1001 Cefalexin 150.0 13 days 2018-09-17 2018-10-05
3 1002 Amoxiclav 75.0 7 days 2018-05-18 2018-05-30
4 1002 Amoxiclav 100.0 7 days 2018-05-25 2018-06-06
5 1003 Amoxiclav 50.0 30 days 2018-07-01 2018-08-05
Which is your expected result. Column order is different due to groupby
operations with indexing.