I'm using Python 2.7, panda 0.14.1-2, numpy 1.8.1-1. I have to use Python 2.7 because I'm coupling it with something that doesn't work on Python 3
I'm trying to analyze a csv files that outputs Month, Day and Hour in separate columns, and would look something like the following:
Month Day Hour Value
1 1 1 105
1 1 2 30
1 1 3 85
1 1 4 52
1 1 5 65
I basically want to create a timestamp from those columns, and use "2005" as the year, and set this new timestamp column to be the index.
I've read a lot of similar questions (here and here) but they all rely on doing during read_csv(). I don't have a year column, so I don't think this applies to me (aside from loading dataframe, inserting column, writing, and redoing read_csv... seems convoluted).
After loading the dataframe, I insert a Year column in position 0
df.insert(0, "Year", 2005)
So now I've got
Year Month Day Hour Value
2005 1 1 1 105
2005 1 1 2 30
2005 1 1 3 85
2005 1 1 4 52
2005 1 1 5 65
df.types tells me that all columns are int64 types.
Then I've tried doing this:
df['Datetime'] = pd.to_datetime(df.Year*1000000 + df.Month*10000 + df.Day+100 + df.Hour, format="%Y%M%d%H")
But I'm getting "TypeError: 'long' object is unsliceable"
On the other hand, the following runs without errors.
df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")
As 2.7 doesn't like the %Y%M%d%H, as pointed by @EdChum, I've tried doing it in two steps: creating a datetime without hours, and adding the hours after. But: the output is not what I wanted
In [1]: # Do it without hours first (otherwise doesn't work in Python 2.7)
df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")
In [2]: df['Datetime']
Out [2]:
0 2005-01-01 00:01:00
1 2005-01-01 00:01:00
...
13 2005-01-01 00:01:00
14 2005-01-01 00:01:00
...
8745 2005-01-31 00:12:00
8746 2005-01-31 00:12:00
...
8758 2005-01-31 00:12:00
8759 2005-01-31 00:12:00
8758 is supposed to be 2005-12-31 for example. What is wrong with that?
Once I resolve that, I'll be able to re-add the hours:
In [3]: # Then add the hours
df['Datetime'] = df['Datetime'] + pd.to_timedelta(df['Hour'], unit="h")
See Question&Answers more detail:
os