Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
454 views
in Technique[技术] by (71.8m points)

python - Pandas rolling max for time series data

I get 2 different behaviours when applying rolling("1D").max() on 2 datasets in Jupyter notebook.

I need to calculate rolling max for each day.

Sample:
df = pd.DataFrame({'B': [0, 4, 3, 3, 4, 2, 1, 2, 3, 4]},
                  index = [pd.Timestamp('20130101 09:00:00'),
                           pd.Timestamp('20130101 09:02:02'),
                           pd.Timestamp('20130101 09:03:03'),
                           pd.Timestamp('20130101 09:04:05'),
                           pd.Timestamp('20130101 09:15:06'),                          
                           pd.Timestamp('20130102 09:16:06'),
                           pd.Timestamp('20130102 09:17:06'),
                           pd.Timestamp('20130102 09:35:06'),
                           pd.Timestamp('20130102 09:36:06'),
                           pd.Timestamp('20130102 09:37:06')])

df.rolling("1D").max() #gives desired output

                        B
2013-01-01 09:00:00     0.0
2013-01-01 09:02:02     4.0
2013-01-01 09:03:03     4.0
2013-01-01 09:04:05     4.0
2013-01-01 09:15:06     4.0
2013-01-02 09:16:06     2.0 # <- 2 is the highest value for new day
2013-01-02 09:17:06     2.0
2013-01-02 09:35:06     2.0
2013-01-02 09:36:06     3.0
2013-01-02 09:37:06     4.0

When I try to apply to actual data I get

# Sample data
data = '{"High":{"1611221400000":0.99615,"1611222300000":0.9751,"1611223200000":1.035,"1611224100000":0.9894,"1611225000000":1.385,"1611225900000":1.345,"1611226800000":1.235,"1611227700000":1.245,"1611228600000":1.315,"1611229500000":1.295,"1611230400000":1.28,"1611231300000":1.295,"1611232200000":1.415,"1611233100000":1.415,"1611234000000":1.355,"1611234900000":1.385,"1611235800000":1.335,"1611236700000":1.325,"1611237600000":1.365,"1611238500000":1.445,"1611239400000":1.515,"1611240300000":1.475,"1611241200000":1.405,"1611242100000":1.375,"1611243000000":1.255,"1611243900000":1.225,"1611307800000":1.375,"1611308700000":1.415,"1611309600000":1.495}}'
df2 = pd.read_json(data)

df2.rolling("1D").max()
# keeps rolling from previous day

    High
Date    
2021-01-21 09:30:00     0.99615
2021-01-21 09:45:00     0.99615
2021-01-21 10:00:00     1.03500
2021-01-21 10:15:00     1.03500
2021-01-21 10:30:00     1.38500
2021-01-21 10:45:00     1.38500
2021-01-21 11:00:00     1.38500
2021-01-21 11:15:00     1.38500
2021-01-21 11:30:00     1.38500
2021-01-21 11:45:00     1.38500
2021-01-21 12:00:00     1.38500
2021-01-21 12:15:00     1.38500
2021-01-21 12:30:00     1.41500
2021-01-21 12:45:00     1.41500
2021-01-21 13:00:00     1.41500
2021-01-21 13:15:00     1.41500
2021-01-21 13:30:00     1.41500
2021-01-21 13:45:00     1.41500
2021-01-21 14:00:00     1.41500
2021-01-21 14:15:00     1.44500
2021-01-21 14:30:00     1.51500
2021-01-21 14:45:00     1.51500
2021-01-21 15:00:00     1.51500
2021-01-21 15:15:00     1.51500
2021-01-21 15:30:00     1.51500
2021-01-21 15:45:00     1.51500
2021-01-22 09:30:00     1.51500 # <- value got rolled from previous day
2021-01-22 09:45:00     1.51500
2021-01-22 10:00:00     1.51500

Pandas version = 0.25.1

Both DFs have DatetimeIndex, dtype='datetime64[ns]', freq=None

Any idea why this is happening?

question from:https://stackoverflow.com/questions/65848107/pandas-rolling-max-for-time-series-data

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

In both cases the rolling window opens a filter of one day (equals 24 hours).

I chaged your first example a bit, please see the output:

df = pd.DataFrame({'B': [0, 4, 3, 3, 4, 2, 1, 2, 3, 4]},
                  index = [pd.Timestamp('20130101 09:00:00'),
                           pd.Timestamp('20130101 09:02:02'),
                           pd.Timestamp('20130101 09:03:03'),
                           pd.Timestamp('20130101 09:04:05'),
                           pd.Timestamp('20130101 09:15:06'),                          
                           pd.Timestamp('20130102 09:13:06'), # <-- minus 3 minutes
                           pd.Timestamp('20130102 09:17:06'),
                           pd.Timestamp('20130102 09:35:06'),
                           pd.Timestamp('20130102 09:36:06'),
                           pd.Timestamp('20130102 09:37:06')])
df.rolling("1D").max()
>>> 
                       B
2013-01-01 09:00:00  0.0
2013-01-01 09:02:02  4.0
2013-01-01 09:03:03  4.0
2013-01-01 09:04:05  4.0
2013-01-01 09:15:06  4.0
2013-01-02 09:13:06  4.0 # <-- overlap of days
2013-01-02 09:17:06  2.0
2013-01-02 09:35:06  2.0
2013-01-02 09:36:06  3.0
2013-01-02 09:37:06  4.0

This means in both cases the rolling is doing the same.

If you want to get the rolling maximum per day you maybe want to do somthing like this:

df = df.groupby(df.index.day).rolling('1D').max()

and

df2 = df2.groupby(df2.index.day).rolling('1D').max()

which will return you a DataFrame with MultiIndex.

The MultiIndex can be reduced in the next step using

df.index = df.index.droplevel(0) 

and

df2.index = df2.index.droplevel(0) 

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...