Grouping data by id, var1 into consecutive dates in python using pandas

Question

Welcome To Ask or Share your Answers For Others

Grouping data by id, var1 into consecutive dates in python using pandas

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

Grouping data by id, var1 into consecutive dates in python using pandas

I have some data that looks like:

df_raw_dates = pd.DataFrame({"id": [102, 102, 102, 103, 103, 103, 104], "var1": ['a', 'b', 'a', 'b', 'b', 'a', 'c'],
                             "val": [9, 2, 4, 7, 6, 3, 2],
                             "dates": [pd.Timestamp(2020, 1, 1),
                                       pd.Timestamp(2020, 1, 1),
                                       pd.Timestamp(2020, 1, 2),
                                       pd.Timestamp(2020, 1, 2),
                                       pd.Timestamp(2020, 1, 3),
                                       pd.Timestamp(2020, 1, 5),
                                       pd.Timestamp(2020, 3, 12)]})

I want group this data into IDs and var1 where the dates are consecutive, if a day is missed I want to start a new record.

For example the final output should be:

df_end_result = pd.DataFrame({"id": [102, 102, 103, 103, 104], "var1": ['a', 'b', 'b', 'a', 'c'],
                              "val": [13, 2, 13, 3, 2],
                              "start_date": [pd.Timestamp(2020, 1, 1),
                                             pd.Timestamp(2020, 1, 1),
                                             pd.Timestamp(2020, 1, 2),
                                             pd.Timestamp(2020, 1, 5),
                                             pd.Timestamp(2020, 3, 12)],
                              "end_date": [pd.Timestamp(2020, 1, 2),
                                           pd.Timestamp(2020, 1, 1),
                                           pd.Timestamp(2020, 1, 3),
                                           pd.Timestamp(2020, 1, 5),
                                           pd.Timestamp(2020, 3, 12)]})

I have tried this a few ways and keep failing, the length of time that something can exist for is unknown and the possible number of var1 can change with each id and with date window as well.

For example I have tried to identify consecutive days like this, but it always returns ['count_days'] == 0 (clearly something is wrong!). Then I thought I could take date(min) and date(min)+count_days to get 'start_date' and 'end_date'

s = df_raw_dates.groupby(['id','var1']).dates.diff().eq(pd.Timedelta(days=1))
s1 = s | s.shift(-1, fill_value=False)
df['count_days'] = np.where(s1, s1.groupby(df.id).cumsum(), 0)

I have also tried:

df = df_raw_dates.groupby(['id', 'var1']).agg({'val': 'sum', 'date': ['first', 'last']}).reset_index()

Which gets me closer, but I don't think this deals with the consecutive days problem but instead provides the earliest and latest day which unfortunately isn't something that I can take forward.

EDIT: adding more context

Another approach is:

df = df_raw_dates.groupby(['id', 'dates']).size().reset_index().rename(columns={0: 'del'}).drop('del', axis=1)

which provides a list of ids and dates, but I am getting stuck with finding min max consecutive dates within this new window

Extended example that has a break in the date range for group (102,'a').

df_raw_dates = pd.DataFrame(
    {
        "id": [102, 102, 102, 103, 103, 103, 104, 102, 102, 102, 102, 108, 108],
        "var1": ["a", "b", "a", "b", "b", "a", "c", "a", "a", "a", "a", "a", "a"],
        "val": [9, 2, 4, 7, 6, 3, 2, 1, 2, 3, 4, 99, 99],
        "dates": [
            pd.Timestamp(2020, 1, 1),
            pd.Timestamp(2020, 1, 1),
            pd.Timestamp(2020, 1, 2),
            pd.Timestamp(2020, 1, 2),
            pd.Timestamp(2020, 1, 3),
            pd.Timestamp(2020, 1, 5),
            pd.Timestamp(2020, 3, 12),
            pd.Timestamp(2020, 1, 3),
            pd.Timestamp(2020, 1, 7),
            pd.Timestamp(2020, 1, 8),
            pd.Timestamp(2020, 1, 9),
            pd.Timestamp(2020, 1, 21),
            pd.Timestamp(2020, 1, 25),
        ],
    }
)

Further example

This is using the anwser below from wwii

import pandas as pd
import collections

df_raw_dates1 = pd.DataFrame(
    {
        "id": [100,105,105,105,100,105,100,100,105,105,105,105,105,105,105,105,105,105,105,105,105,105,105],
        "var1": ["a","b","d","a","d","c","b","b","b","a","c","d","c","a","d","b","a","d","b","b","d","c","a"],
        "val": [0, 2, 0, 0, 0, 0, 0, 0, 9, 1, 0, 1, 1, 0, 9, 5, 10, 12, 13, 15, 0, 1, 2 ],
        "dates": [
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 22),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 21),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 20),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 19),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18),
            pd.Timestamp(2021, 1, 18)

        ],
    }
)

day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates1.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)

for k,g in gb:
    # print(g)
    eyed, var1 = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(g.val.sum())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(), date_range['dates'].max()
        val = date_range.val.sum()
        new_df['id'].append(eyed)
        new_df['var1'].append(var1)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)

print(pd.DataFrame(new_df))

>>>    id var1   val      start        end
0   100    a   0.0 2021-01-22 2021-01-22
1   100    b   0.0 2021-01-22 2021-01-22
2   100    d   0.0 2021-01-22 2021-01-22

3   105    a   0.0 2021-01-22 2021-01-22
4   105    a   1.0 2021-01-21 2021-01-21
5   105    a   0.0 2021-01-20 2021-01-20
6   105    a  10.0 2021-01-19 2021-01-19

7   105    b   2.0 2021-01-22 2021-01-22
8   105    b   9.0 2021-01-21 2021-01-21
9   105    b   5.0 2021-01-20 2021-01-20
10  105    b  13.0 2021-01-19 2021-01-19

From the above I would have expected the rows 3,4,5,6 to be grouped together and 7,8,9,10 also. I am not sure why this example now breaks?

Not sure what the difference with this example and the extended example above is and why this seems to not work?

question from:https://stackoverflow.com/questions/65849441/grouping-data-by-id-var1-into-consecutive-dates-in-python-using-pandas

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

1.4m articles

1.4m replys

5 comments

57.0k users

Most popular tags

javascript python c# java How android c++ php ios html sql r c node.js .net iphone asp.net css reactjs jquery ruby What Android objective mysql linux Is git Python windows Why regex angular swift amazon excel algorithm macos Java visual how bash Can multithreading PHP Using scala angularjs typescript apache spring performance postgresql database flutter json rust arrays C# dart vba django wpf xml vue.js In go Get google jQuery xcode jsf http Google mongodb string shell oop powershell SQL C++ security assembly docker Javascript Android: Does haskell Convert azure debugging delphi vb.net Spring datetime pandas oracle math Django

联盟问答网站-Union QA website

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

在这了问答社区

DevDocs API Documentations

Xstack问答社区

生活宝问答社区

OverStack问答社区

Ostack问答社区

在这了问答社区

在哪了问答社区

Xstack问答社区

无极谷问答社区

TouSu问答社区

SQlite问答社区

Qi-U问答社区

MLink问答社区

Jonic问答社区

Jike问答社区

16892问答社区

Vigges问答社区

55276问答社区

OGeek问答社区

深圳家问答社区

深圳家问答社区

深圳家问答社区

Vigges问答社区

Vigges问答社区

在这了问答社区

DevDocs API Documentations

广告位招租

深蓝 · Answer 1 · 2021-10-06T19:29:33+0000

I don't have Pandas superpowers so I never try to do groupby one-liners, maybe someday.

Adapting the accepted answer to SO question Find group of consecutive dates in Pandas DataFrame - first group by ['id','var1']; for each group group by consecutive date ranges.

import pandas as pd
sep = "************************************
"
day = pd.Timedelta('1d')
# using the extended example in the question.
gb = df_raw_dates.groupby(['id', 'var1'])

for k,g in gb:
    print(g)
    dt = g['dates']
    # find difference in days between rows
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)

    # create a Series to identify consecutive ranges to group by
    # this cumsum trick can be found in many SO answers
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    # split into date ranges
    date_groups = g.groupby(groups)
    for _,date_range in date_groups:
        print(date_range)
    print(sep)

You can see that the (102,'a') group has been split into two groups.

    id var1  val      dates
0  102    a    9 2020-01-01
2  102    a    4 2020-01-02
7  102    a    1 2020-01-03
     id var1  val      dates
8   102    a    2 2020-01-07
9   102    a    3 2020-01-08
10  102    a    4 2020-01-09

Going a bit further: while iterating construct a dictionary to make a new DataFrame with.

import pandas as pd
import collections
day = pd.Timedelta('1d')
# again using the extended example in the question
gb = df_raw_dates.groupby(['id', 'var1'])
new_df = collections.defaultdict(list)
for k,g in gb:
    # print(g)
    eyed,var = k
    dt = g['dates']
    in_block = ((dt - dt.shift(-1)).abs() == day) | (dt.diff() == day)
    filt = g.loc[in_block]
    breaks = filt['dates'].diff() != day
    groups = breaks.cumsum()
    date_groups = g.groupby(groups)
    # print(k,groups,groups.any())
    # accomodate groups with only one date
    if not groups.any():
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(g.val.mean())
        new_df['start'].append(g.dates.min())
        new_df['end'].append(g.dates.max())
        continue

    for _,date_range in date_groups:
        start,end = date_range['dates'].min(),date_range['dates'].max()
        val = date_range.val.mean()
        new_df['id'].append(eyed)
        new_df['var1'].append(var)
        new_df['val'].append(val)
        new_df['start'].append(start)
        new_df['end'].append(end)


print(pd.DataFrame(new_df))

>>>
    id var1        val      start        end
0  102    a   4.666667 2020-01-01 2020-01-03
1  102    a   3.000000 2020-01-07 2020-01-09
2  102    b   2.000000 2020-01-01 2020-01-01
3  103    a   3.000000 2020-01-05 2020-01-05
4  103    b   6.500000 2020-01-02 2020-01-03
5  104    c   2.000000 2020-03-12 2020-03-12
6  108    a  99.000000 2020-01-21 2020-01-25

Seems pretty tedious, maybe someone will come along with a less-verbose solution. Maybe some of the operations could be put in functions and .apply or .transform or .pipe could be used making it a little cleaner.

It does not account for ('id','var1') groups that have more than one date but only single date ranges. e.g.

     id var1  val      dates
11  108    a   99 2020-01-21
12  108    a   99 2020-01-25

You might need to detect if there are any gaps in a datetime Series and use that fact to accommodate.

Categories

Grouping data by id, var1 into consecutive dates in python using pandas

Grouping data by id, var1 into consecutive dates in python using pandas

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags