It seems that for OLS linear regression to work well in Pandas, the arguments must be floats. I'm starting with a csv (called "gameAct.csv") of the form:
date, city, players, sales
2014-04-28,London,111,1091.28
2014-04-29,London,100,1100.44
2014-04-28,Paris,87,1001.33
...
I want to perform linear regression of how sales depend on date (as time moves forward, how do sales move?). The problem with my code below seems to be with dates not being float values. I would appreciate help on how to resolve this indexing problem in Pandas.
My current (non-working, but compiling code):
import pandas as pd
from pandas import DataFrame, Series
import statsmodels.formula.api as sm
df = pd.read_csv('gameAct.csv')
df.columns = ['date', 'city', 'players', 'sales']
city_data = df[df['city'] == 'London']
result = sm.ols(formula = 'sales ~ date', data = city_data).fit()
As I vary the city value, I get R^2 = 1 results, which is wrong. I have also attempted index_col = 0, parse_dates == True'
in defining the dataframe df
, but without success.
I suspect there is a better way to read in such csv files to perform basic regression over dates, and also for more general time series analysis. Help, examples, and resources are appreciated!
Note, with the above code, if I convert the dates index (for a given city) to an array, the values in this array are of the form:
'xefxbbxbf2014-04-28'
How does one produce an AIC analysis over all of the non-sales parameters? (e.g. the result might be that sales depend most linearly on date and city).
See Question&Answers more detail:
os