Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
304 views
in Technique[技术] by (71.8m points)

python - Regression with Date variable using Scikit-learn

I have a Pandas DataFrame with a date column (eg: 2013-04-01) of dtype datetime.date. When I include that column in X_train and try to fit the regression model, I get the error float() argument must be a string or a number. Removing the date column avoided this error.

What is the proper way to take the date into account in the regression model?

Code

data = sql.read_frame(...)
X_train = data.drop('y', axis=1)
y_train = data.y

rf = RandomForestRegressor().fit(X_train, y_train)

Error

TypeError                                 Traceback (most recent call last)
<ipython-input-35-8bf6fc450402> in <module>()
----> 2 rf = RandomForestRegressor().fit(X_train, y_train)

C:Python27libsite-packagessklearnensembleforest.pyc in fit(self, X, y, sample_weight)
    292                 X.ndim != 2 or
    293                 not X.flags.fortran):
--> 294             X = array2d(X, dtype=DTYPE, order="F")
    295 
    296         n_samples, self.n_features_ = X.shape

C:Python27libsite-packagessklearnutilsvalidation.pyc in array2d(X, dtype, order, copy)
     78         raise TypeError('A sparse matrix was passed, but dense data '
     79                         'is required. Use X.toarray() to convert to dense.')
---> 80     X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
     81     _assert_all_finite(X_2d)
     82     if X is X_2d and copy:

C:Python27libsite-packages
umpycore
umeric.pyc in asarray(a, dtype, order)
    318 
    319     """
--> 320     return array(a, dtype, copy=False, order=order)
    321 
    322 def asanyarray(a, dtype=None, order=None):

TypeError: float() argument must be a string or a number
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

The best way is to explode the date into a set of categorical features encoded in boolean form using the 1-of-K encoding (e.g. as done by DictVectorizer). Here are some features that can be extracted from a date:

  • hour of the day (24 boolean features)
  • day of the week (7 boolean features)
  • day of the month (up to 31 boolean features)
  • month of the year (12 boolean features)
  • year (as many boolean features as they are different years in your dataset) ...

That should make it possible to identify linear dependencies on periodic events on typical human life cycles.

Additionally you can also extract the date a single float: convert each date as the number of days since the min date of your training set and divide by the difference of the number of days between the max date and the number of days of the min date. That numerical feature should make it possible to identify long term trends between the output of the event date: e.g. a linear slope in a regression problem to better predict evolution on forth-coming years that cannot be encoded with the boolean categorical variable for the year feature.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...