Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
173 views
in Technique[技术] by (71.8m points)

python - Machine learning - Structure of input data

I am trying out machine learning in python to predict future values. My data (X1, ... , X8, Y) can be seen in the attached figure.

Description of my data

For testing, I have started by using sklearn RandomForestRegressor because the value that I am trying to predict is a float. My input data is originally a mix of data types (strings, floats, integers and True/False statements). All of which I have converted to numbers. Each string is represented my a unique integer. Each true/false is represented by either 1 or 0.

The examples I find online are usually either numbers (regressor problems?) or strings (classifier problems?).

Is this the correct approach for mixed input data types?

I am greatful for any input.

'''
X1,X2,X3,X4,X5,X6,X7,X8,Y
93,150,18,10,63,641.1024566,9,0,49.87777112
93,371,19,3,62,641.1024566,1,0,48.85200719
93,150,19,4,62,641.1024566,12,1,41.67165968
93,196,19,6,62,641.1024566,11,1,47.1851408
93,416,19,9,414,641.1024566,5,1,46.67225884
93,196,19,9,375,647.0940683,7,0,43.35530258
93,416,19,10,428,641.1024566,1,1,46.80047933
93,196,19,10,430,641.1024566,6,0,50.19832235
93,196,19,11,579,629.1192331,4,1,46.55482325
93,416,20,2,422,641.1024566,3,1,48.21090473
93,196,20,3,429,641.1024566,10,1,47.95446375
93,150,20,3,429,641.1024566,11,1,48.08268424
93,196,20,4,430,641.1024566,12,1,47.69802277
93,196,20,5,427,641.1024566,11,1,46.99281007
93,196,20,5,424,641.1024566,10,1,47.31336129
93,206,20,6,6,641.1024566,2,1,47.1851408
93,196,20,6,427,491.312163,11,1,35.66926303
93,196,20,9,430,641.1024566,4,1,47.24925105
93,416,20,8,362,641.1024566,8,1,48.08268424
'''


# Normalize input values
predictors = list(set(list(df1.columns))-set(target_column))
maximumPredictor = df1[predictors].max()
df1[predictors] = df1[predictors]/maximumPredictor
df1.describe().transpose()

df2[predictors] = df2[predictors]/maximumPredictor
df2.describe().transpose()

X = df1[predictors].values
y = df1[target_column].values
X_predict = df2[predictors].values


# Split data to evaluate the model with a portion of input data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)


# This is the regressor
regressor = RandomForestRegressor(n_estimators=50,
                                  )

# Train regressor
regressor.fit(X_train, y_train.ravel())

# Make a prediction from test data
y_pred = regressor.predict(X_test)
question from:https://stackoverflow.com/questions/66063190/machine-learning-structure-of-input-data

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

If the strings repeat and you have encoded string 1 as 93 and string 2 as 94 and so on your approach is fine. Else you cannot use this approach, this resource might be useful: https://stats.stackexchange.com/questions/339656/mix-of-text-and-numeric-data.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...