I have some timeseries data as a Pandas dataframe which starts off with observations at 15 mins past the hour and 45 mins past (time intervals of 30 mins) then changes frequency to every minute. I want to resample the data so that it has a regular frequency of every 30 minutes, at 15 past and 45 past the hours for the whole dataframe.
I thought of two ways of achieving this.
1. Simply filter the dataframe for all observations at 15min and 45min, using the time-series data as a column in the dataframe.
2. Re-set the index so the time-series data is part of a multi-index (the 0th level of the index is the weather station, the 1st level is the time of the observation) and use the Pandas date-time timeseries functionality such as resample()
.
The original dataframe, weather, looks like this:
parsed_time Pressure Temp Hum
Station (index)
Bow 1 2018-04-15 14:15:00 1012 20.0 87
2 2018-04-15 14:45:00 1013 20.0 87
3 2018-04-15 15:15:00 1012 21.0 87
4 2018-04-15 15:45:00 1014 22.0 86
5 2018-04-15 16:00:00 1015 22.0 86
6 2018-04-15 16:01:00 1012 25.0 86
7 2018-04-15 16:02:00 1012 25.0 86
Stratford 8 2018-04-15 14:15:00 1011 18.0 87
9 2018-04-15 14:45:00 1011 18.0 87
10 2018-04-15 15:15:00 1012 18.0 87
11 2018-04-15 15:45:00 1014 19.0 86
12 2018-04-15 16:00:00 1014 19.0 86
13 2018-04-15 16:01:00 1015 19.0 86
14 2018-04-15 16:02:00 1016 20.0 86
15 2018-04-15 16:04:00 1016 20.0 86
With method 1, I have the problem that my boolean select operations don't seem to work as expected. For example
weather_test = weather[weather['parsed_time'].dt.minute == (15 & 45)]
gives parsed_time values like this:
2018-04-15 14:13:00
2018-04-15 15:13:00
weather_test = weather[weather['parsed_time'].dt.minute == (15 | 45)]
results in parsed_time values like this:
2018-04-15 14:47:00
2018-04-15 14:47:00
I can't find anything in the docs to explain this behaviour. What I want is pressure, temp, humidity by station at the following times:
2018-04-15 14:45:00
2018-04-15 15:15:00
2018-04-15 15:45:00
2018-04-15 16:15:00
and so on.
With method 2, I thought of resampling the data so that observations for which I have minute-by-minute data are replaced by the mean of the previous 30 minutes. This functionality only seems to work if the parsed_time column is part of the index, so I used the following code to set the parsed_time as part of a multi-index:
weather.set_index('parsed_time', append=True, inplace=True)
weather.index.set_names('station', level=0, inplace=True)
weather = weather.reset_index(level=1, drop=True)
to end up with a dataframe that looks like this:
Pressure Temp Hum
Station parsed_time
Bow 2018-04-15 14:15:00 1012 20.0 87
2018-04-15 14:45:00 1013 20.0 87
2018-04-15 15:15:00 1012 21.0 87
2018-04-15 15:45:00 1014 22.0 86
2018-04-15 16:00:00 1015 22.0 86
2018-04-15 16:01:00 1012 25.0 86
2018-04-15 16:02:00 1012 25.0 86
Stratford 2018-04-15 14:15:00 1011 18.0 87
2018-04-15 14:45:00 1011 18.0 87
2018-04-15 15:15:00 1012 18.0 87
2018-04-15 15:45:00 1014 19.0 86
2018-04-15 16:00:00 1014 19.0 86
2018-04-15 16:01:00 1015 19.0 86
2018-04-15 16:02:00 1016 20.0 86
2018-04-15 16:04:00 1016 20.0 86
Note that the sampling of observations varies from every 30 minutes at :15 past and :45 past to every minute (e.g. :01, :02, :14, etc), and it also varies by station - not all stations have every observation.
I tried this:
weather_test = weather.resample('30min', level=1).mean()
but this resamples without an offset and also gets rid of the station level in the multi-index.
The desired result is this:
Pressure Temp Hum
Station parsed_time
Bow 2018-04-15 14:15:00 1012 20.0 87
2018-04-15 14:45:00 1013 20.0 87
2018-04-15 15:15:00 1012 21.0 87
2018-04-15 15:45:00 1014 22.0 86
2018-04-15 16:15:00 1013 24.0 86
Stratford 2018-04-15 14:15:00 1011 18.0 87
2018-04-15 14:45:00 1011 18.0 87
2018-04-15 15:15:00 1012 18.0 87
2018-04-15 15:45:00 1014 19.0 86
2018-04-15 16:15:00 1015 19.5 86
where the minute-by-minute observations have been resampled as the mean over a 30-minute interval at :15 and :45 past the hour.
Keeping the station as a level in the multi-index is essential. I can't use the time index as an index on its own as the values repeat for each station (and are not unique).
All help appreciated as I have been going round in circles with this one for a while now. Thanks!
I have looked at quite a few previous posts including:
Boolean filter using a timestamp value on a dataframe in Python
How do I round datetime column to nearest quarter hour
and: Resampling a pandas dataframe with multi-index containing timeseries
which seems a bit complicated for something that should be quite simple ...
and the docs: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.resample.html
Thanks!
See Question&Answers more detail:
os