Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
703 views
in Technique[技术] by (71.8m points)

python - Pandas Dataframe - for each row, return count of other rows with overlapping dates

I've got a dataframe with projects, start dates, and end dates. For each row I would like to return the number of other projects in process when the project started. How do you nest loops when using df.apply()? I've tried using a for loop but my dataframe is large and it takes way too long.

import datetime as dt

data = {'project' :['A', 'B', 'C'],
        'pr_start_date':[dt.datetime(2018, 9, 1), dt.datetime(2019, 4, 1), dt.datetime(2019, 6, 8)],
        'pr_end_date': [dt.datetime(2019, 6, 15), dt.datetime(2019, 12, 1), dt.datetime(2019, 8, 1)]}

df = pd.DataFrame(data)

def cons_overlap(start):
    overlaps = 0
    for i in df.index:
        other_start = df.loc[i, 'pr_start_date']
        other_end = df.loc[i, 'pr_end_date']
        if (start > other_start) & (start < other_end):
            overlaps += 1

    return overlaps

df['overlap'] = df.apply(lambda row: cons_overlap(row['pr_start_date']), axis=1)

This is the output I'm looking for:

    pr  pr_start_date pr_end_date   overlap
0   A   2018-09-01    2019-06-15    0
1   B   2019-04-01    2019-12-01    1
2   C   2019-06-08    2019-08-01    2
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

I suggest you take advantage of numpy broadcasting:

ends = df.pr_start_date.values < df.pr_end_date.values[:, None]
starts = df.pr_start_date.values > df.pr_start_date.values[:, None]
df['overlap'] = (ends & starts).sum(0)
print(df)

Output

  project pr_start_date pr_end_date  overlap
0       A    2018-09-01  2019-06-15        0
1       B    2019-04-01  2019-12-01        1
2       C    2019-06-08  2019-08-01        2

Both ends and starts are matrices of 3x3 that are truth when the condition is met:

# ends   
[[ True  True  True]  
 [ True  True  True]
 [ True  True  True]]

# starts
[[False  True  True]
 [False False  True]
 [False False False]]

Then find the intersection with the logical & and sum across columns (sum(0)).


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...