I have the following test DataFrame:
import random
from datetime import timedelta
import pandas as pd
import datetime
#create test range of dates
rng=pd.date_range(datetime.date(2015,1,1),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)], 'cid':[random.randint(1,2) for _ in testpts],
'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,32))
Which gives a dataframe like the below, with a company id column 'cid', a unique id column 'jid', a start date 'stdt', and an enddt 'enddt'.
cid jid stdt enddt
0 1 100 2015-07-06 2015-07-13
1 1 101 2015-07-15 2015-07-22
2 2 102 2015-07-12 2015-07-19
3 2 103 2015-07-07 2015-07-14
4 2 104 2015-07-14 2015-07-21
5 1 105 2015-07-11 2015-07-18
6 1 106 2015-07-12 2015-07-19
7 2 107 2015-07-01 2015-07-08
8 2 108 2015-07-10 2015-07-17
9 2 109 2015-07-09 2015-07-16
What I need to do is the following: Count the number of jid that occur by cid, for each date(newdate) between the min(stdt)
and max(enddt), where the newdate is between the stdt and
the enddt.
The resulting data set should be a dataframe that has for each cid, a column range of dates (newdate) that is between the min(stdt) and the max(enddt) specific to each cid, and a count (cnt) of the number of jid that the newdate is between of the min(stdt) and max(enddt). That resulting DataFrame should look like (this is just for 1 cid using above data):
cid newdate cnt
1 2015-07-06 1
1 2015-07-07 1
1 2015-07-08 1
1 2015-07-09 1
1 2015-07-10 1
1 2015-07-11 2
1 2015-07-12 3
1 2015-07-13 3
1 2015-07-14 2
1 2015-07-15 3
1 2015-07-16 3
1 2015-07-17 3
1 2015-07-18 3
1 2015-07-19 2
1 2015-07-20 1
1 2015-07-21 1
1 2015-07-22 1
I believe there should be a way to use pandas groupby (groupby cid), and some form of lambda(?) to pythonically create this new dataframe.
I currently run a loop that for each cid (I slice the cid rows out of the master df), in the loop determine the relevant date range (min stdt and max enddt for each cid frame, then for each of those newdates (range mindate-maxdate) it counts the number of jid where the newdate is between the stdt and enddt of each jid. Then I append each resulting dataset into a new dataframe which looks like the above.
But this is very expensive from a resource and time perspective. Doing this on millions of jid for thousands of cid literally takes a full day. I am hoping there is a simple(r) pandas solution here.
See Question&Answers more detail:
os