Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
341 views
in Technique[技术] by (71.8m points)

python - Filling in missing hourly data in Pandas

I have a dataframe containing time series with hourly measurements with the following structure: name, time, output. For each name the measurements come from more or less the same time period. I am trying to fill in the missing values, such that for each day all 24h appear in the time column.

So I'm expecting a table like this:

 name  time                   output 
 x     2018-02-22 00:00:00    100 
       ...                    
 x     2018-02-22 23:00:00    200 
 x     2018-02-24 00:00:00    300 
       ...                    
 x     2018-02-24 23:00:00    300 
 y     2018-02-22 00:00:00    100 
       ...                   
 y     2018-02-22 23:00:00    200 
 y     2018-02-25 00:00:00    300 
       ...                         
 y     2018-02-25 23:00:00    300

For this I groupby name and then try to apply a custom function that adds the missing timestamps in the corresponding dataframe.

def add_missing_hours(df):
    start_date = df.time.iloc[0].date()
    end_date = df.time.iloc[-1].date()
    dates_range = pd.date_range(start_date, end_date, freq = '1H')
    new_dates = set(dates_range) - set(df.time)
    name = df["name"].iloc[0]
    df = df.append(pd.DataFrame({'GSRN':[name]*len(new_dates), 'time': new_dates}))
    return df

For some reason the name column is dropped when I create the DataFrame, but I can't understand why. Does anyone know why or have a better idea how to fill in the missing timestamps?

Edit 1:

This is different than the [question here][1] because they didn't need all 24 values/day -- resampling between 2pm and 10pm will only give the values in between.

Edit 2:

I found a (not great) solution by creating a multi index with all name-timestamps pairs and combining with the table. Code below for anyone interested, but still interested in a better solution:

start_date = datetime.datetime.combine(df.time.min().date(),datetime.time(0, 0))
end_date = datetime.datetime.combine(df.time.max().date(),datetime.time(23, 0))
new_idx = pd.date_range(start_date, end_date, freq = '1H')

mux = pd.MultiIndex.from_product([df['name'].unique(),new_idx], names=('name','time'))
df_complete = pd.DataFrame(index=mux).reset_index().combine_first(df)
df_complete = df_complete.groupby(["name",df_complete.time.dt.date]).filter(lambda g: (g["output"].count() == 0))

The last line removes any days that were completely missing for the specific name in the initial dataframe.

question from:https://stackoverflow.com/questions/65848171/filling-in-missing-hourly-data-in-pandas

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

try:

1st create dataframe starting from min date to max date with hour as an interval. Then concatenate them together.

df.time = pd.to_datetime(df.time)
min_date = df.time.min()
max_date = df.time.max()
dates_range = pd.date_range(min_date, max_date, freq = '1H')
df.set_index('time', inplace=True)
df3=pd.DataFrame(dates_range).set_index(0)
df4 = df3.join(df)

df4:

                   name output
2018-02-22 00:00:00 x   100.0
2018-02-22 00:00:00 y   100.0
2018-02-22 01:00:00 NaN NaN
2018-02-22 02:00:00 NaN NaN
2018-02-22 03:00:00 NaN NaN
... ... ...
2018-02-25 19:00:00 NaN NaN
2018-02-25 20:00:00 NaN NaN
2018-02-25 21:00:00 NaN NaN
2018-02-25 22:00:00 NaN NaN
2018-02-25 23:00:00 y   300.0
98 rows × 2 columns

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...