I try to encode a number of columns containing categorical data ("Yes"
and "No"
) in a large pandas dataframe. The complete dataframe contains over 400 columns so I look for a way to encode all desired columns without having to encode them one by one. I use Scikit-learn LabelEncoder
to encode the categorical data.
The first part of the dataframe does not have to be encoded, however I am looking for a method to encode all the desired columns containing categorical date directly without split and concatenate the dataframe.
To demonstrate my question I first tried to solve it on a small part of the dataframe. However get stuck at the last part where the data is fitted and transformed and get a ValueError: bad input shape (4,3)
. The code as I ran:
# Create a simple dataframe resembling large dataframe
data = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
# Import required module
from sklearn.preprocessing import LabelEncoder
# Create an object of the label encoder class
labelencoder = LabelEncoder()
# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:]) # First column does not need to be encoded
Complete error report:
labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):
File "<ipython-input-47-b4986a719976>", line 1, in <module>
labelencoder.fit_transform(data.ix[:, 1:])
File "C:AnacondaAnaconda3libsite-packagessklearnpreprocessinglabel.py", line 129, in fit_transform
y = column_or_1d(y, warn=True)
File "C:AnacondaAnaconda3libsite-packagessklearnutilsvalidation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (4, 3)
Does anyone know how to do this?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…