If you read the docs for OneHotEncoder
you'll see the input for fit
is "Input array of type int". So you need to do two steps for your one hot encoded data
from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)
print new_cat_features # [1 2 0]
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
print ohe.fit_transform(new_cat_features)
Output:
[[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
EDIT
As of 0.20
this became a bit easier, not only because OneHotEncoder
now handles strings nicely, but also because we can transform multiple columns easily using ColumnTransformer
, see below for an example
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np
X = np.array([['apple', 'red', 1, 'round', 0],
['orange', 'orange', 2, 'round', 0.1],
['bannana', 'yellow', 2, 'long', 0],
['apple', 'green', 1, 'round', 0.2]])
ct = ColumnTransformer(
[('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),], # the column numbers I want to apply this to
remainder='passthrough' # This leaves the rest of my columns in place
)
print(ct2.fit_transform(X)) # Notice the output is a string
Output:
[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…