Well, I am surprised, but it turns out that sklearn's decision tree cannot handle categorical data indeed. There is a Github issue on this (#4899) from June 2015, but it is still open (UPDATE: it is now closed, but continued in #12866, so the issue is still not resolved).
The problem with coding categorical variables as integers, as you have done here, is that it imposes an order on them, which may or may not be meaningful, depending on the case; for example, you could encode ['low', 'medium', 'high']
as [0, 1, 2]
, since 'low' < 'medium' < 'high'
(we call these categorical variables ordinal), although you are still implicitly making the additional (and possibly undesired) assumption that the distance between 'low'
and 'medium'
is the same with the distance between 'medium'
and 'high'
(of no impact in decision trees, but of importance e.g. in k-nn and clustering). But this approach fails completely in cases like, say, ['red','green','blue']
or ['male','female']
, since we cannot claim any meaningful relative order between them.
So, for non-ordinal categorical variables, the way to properly encode them for use in sklearn's decision tree is to use the OneHotEncoder
module. The Encoding categorical features section of the user's guide might also be helpful.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…