How do I calculate a "rolling" statistic on this pandas table, but with the time-window centered on the datapoint?

Question

Welcome To Ask or Share your Answers For Others

How do I calculate a "rolling" statistic on this pandas table, but with the time-window centered on the datapoint?

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

How do I calculate a "rolling" statistic on this pandas table, but with the time-window centered on the datapoint?

Suppose I have the following pandas table:

import pandas as pd
import math
l = [['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605148870, 51.98157826, 5.85744811], ['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605141900, 51.98157842, 5.85744476], ['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605145244, 51.98157826, 5.85744811], ['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605153343, 51.98157826, 5.85744811], ['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605143645, 51.98157842, 5.85744476], ['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605159323, 51.98157826, 5.85744811], ['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605157740, 51.98157826, 5.85744811], ['f8196bb6d34a9f44e950e30f15e1a2ab_6862', 1605150342, 51.98157826, 5.85744811]]
d = pd.DataFrame.from_records(l, columns=['device_zip', 'ts', 'lat', 'lon'])
d.sort_values(by=['ts'], inplace=True)
d['t'] = pd.to_datetime(d['ts'].astype(int), unit='s')
d['dummy'] = d.t.dt.hour

How do I calculate a new column where for every row, I count the number of rows with the dummy value > 40 in the following interval {1 minute before row's timestamp, 1 minute after row's timestamp} ? I've played around with the rolling function which can take a timewindow parameter, but I don't think it's possible to center the timewindow on each row.

I've been able to do what I want with an ugly loop construct, but it's quite slow. There must be a faster and more elegant way to do this.

question from:https://stackoverflow.com/questions/65851888/how-do-i-calculate-a-rolling-statistic-on-this-pandas-table-but-with-the-time

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:29:30+0000

Indeed centered rolling with datetime does not seem possible. One work around is to do two rolling with half of the window you want and the second rolling being on the reverse data with [::-1], then substract the value of the row as it has been counted twice. With the provided data, it is hard to implement your needs, so here are random data:

# random data
import numpy as np
np.random.seed(2)
nb_rows = 20
d = pd.DataFrame(
    {'t':np.sort(
            np.random.choice(
                pd.date_range('2020-01-22 12:00:00',periods=nb_rows*10, freq='1s'), 
                size=nb_rows, replace=False)),   
     'dummy':40+np.random.choice([1,-1], size=nb_rows)})

Now you want to create a column that meets your criteria to use the sum, define the semi window and do both rolling:

d['dummy_count'] = d['dummy']>40
semi_win = '1T' # one minute for a 2 min window centered
d['roll_2T'] = (
    d.rolling(window=semi_win, min_periods=1, on='t')['dummy_count'].sum() 
    + d[::-1].rolling(window=semi_win, min_periods=1, on='t')['dummy_count'].sum()
    - d['dummy_count']
)

print(d)
                     t  dummy  dummy_count  roll_2T
0  2020-01-22 12:00:02     39        False      3.0 # value is 41 for 3 times up to 12:01:02
1  2020-01-22 12:00:03     39        False      3.0
2  2020-01-22 12:00:10     39        False      3.0
3  2020-01-22 12:00:12     39        False      3.0
4  2020-01-22 12:00:13     39        False      3.0
5  2020-01-22 12:00:14     41         True      3.0 
6  2020-01-22 12:00:29     39        False      3.0
7  2020-01-22 12:00:35     39        False      4.0
8  2020-01-22 12:00:44     41         True      4.0
9  2020-01-22 12:00:54     41         True      5.0 
10 2020-01-22 12:01:25     39        False      4.0 # 4 times 41 between 12:00:25 and 12:02:25
11 2020-01-22 12:01:32     41         True      4.0
12 2020-01-22 12:01:52     41         True      3.0
13 2020-01-22 12:01:53     39        False      3.0
14 2020-01-22 12:01:55     39        False      3.0
15 2020-01-22 12:02:06     39        False      3.0
16 2020-01-22 12:02:54     41         True      2.0
17 2020-01-22 12:03:02     39        False      2.0
18 2020-01-22 12:03:13     41         True      2.0
19 2020-01-22 12:03:19     39        False      2.0

Categories

How do I calculate a "rolling" statistic on this pandas table, but with the time-window centered on the datapoint?

How do I calculate a "rolling" statistic on this pandas table, but with the time-window centered on the datapoint?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags