I have a dataframe with a column for the time zone and a column for the datetime. I would like to convert these to UTC first to join with other data, and then I'll have some calculations to convert from UTC to the viewers local time zone eventually.
datetime time_zone
2016-09-19 01:29:13 America/Bogota
2016-09-19 02:16:04 America/New_York
2016-09-19 01:57:54 Africa/Cairo
def create_utc(df, column, time_format='%Y-%m-%d %H:%M:%S'):
timezone = df['TZ']
df[column + '_utc'] = df[column].dt.tz_localize(timezone).dt.tz_convert('UTC').dt.strftime(time_format)
df[column + '_utc'].replace('NaT', np.nan, inplace=True)
df[column + '_utc'] = pd.to_datetime(df[column + '_utc'])
return df
That was my flawed attempt. The error is that the truth is ambiguous which makes sense because the 'timezone' variable is referring to a column. How do I refer to the value in the same row?
Edit: here are some results from the answers below on one day of data (394,000 rows and 22 unique time zones). Edit2: I added a groupby example in case someone wants to see the results. It is the fastest, by far.
%%timeit
for tz in df['TZ'].unique():
df.ix[df['TZ'] == tz, 'datetime_utc2'] = df.ix[df['TZ'] == tz, 'datetime'].dt.tz_localize(tz).dt.tz_convert('UTC')
df['datetime_utc2'] = df['datetime_utc2'].dt.tz_localize(None)
1 loops, best of 3: 1.27 s per loop
%%timeit
df['datetime_utc'] = [d['datetime'].tz_localize(d['TZ']).tz_convert('UTC') for i, d in df.iterrows()]
df['datetime_utc'] = df['datetime_utc'].dt.tz_localize(None)
1 loops, best of 3: 50.3 s per loop
df['datetime_utc'] = pd.concat([d['datetime'].dt.tz_localize(tz).dt.tz_convert('UTC') for tz, d in df.groupby('TZ')])
**1 loops, best of 3: 249 ms per loop**
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…