Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.1k views
in Technique[技术] by (71.8m points)

scikit learn - Random Forest on Panel Data using Python

So I am having some troubles running a random forest regression on panel data.

The data currently looks like this:

enter image description here

I want to conduct a random forest regression which predicts KwH for each ID over time based on the variables I have. I have split my data into training and test samples using the following code:

from sklearn.model_selection import train_test_split
X = df[['hour', 'day', 'month', 'dayofweek', 'apparentTemperature',
       'summary', 'household_size', 'work_from_home', 'num_rooms',
       'int_in_renew', 'int_in_gen', 'conc_abt_cc', 'feel_abt_lifestyle',
       'smrt_meter_help', 'avg_gender', 'avg_age', 'house_type', 'sum_insul',
       'total_lb', 'total_fridges', 'bigg_apps', 'small_apps',
       'look_at_meter']]
y = df[['KwH']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

I then wish to train my model and test it against the testing sample however I am unsure of how to do this. I have tried this code:

from sklearn.ensemble import RandomForestRegressor
rfc = RandomForestRegressor(n_estimators=200)
rfc.fit(X_train, y_train)

However I get the following error message:

A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Im not sure if the error is fundamentally in the way my data is arranged or the way I am doing the random forest so any help with this and then testing the data against the test sample after would be greatly appreciated.

Thanks in advance.

question from:https://stackoverflow.com/questions/65891664/random-forest-on-panel-data-using-python

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Simply switching y = df[['KwH']] to y = df['KwH'] or y = df.KwH should solve this.

This is because scikit-learn doesn't expect y to be a dataframe, and selecting columns with the double [[...]] precisely is returning a dataframe.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...