python - Can sklearn DecisionTreeClassifier truly work with categorical data?

Question

Welcome To Ask or Share your Answers For Others

python - Can sklearn DecisionTreeClassifier truly work with categorical data?

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Can sklearn DecisionTreeClassifier truly work with categorical data?

While working with the DecisionTreeClassifier I visualized it using graphviz, and I have to say, to my astonishment, it seems it takes categorical data and uses it as continuous data.

All my features are categorical and for example you can see the following tree (please note that the first feature, X[0], has 6 possible values 0, 1, 2, 3, 4, 5: From what I found here the class uses a tree class which is a binary tree, so it is a limitation in sklearn.

Anyone knows a way that I am missing to use the tree categorically? (I know it is not better for the task but as I need categories currently I am using one hot vectors on the data).

A sample of the original data looks like this:

f1 f2 f3  f4  f5  f6  f7  f8  f9  f10  c1  c2  c3
0  C  S  O   1   2   1   1   2   1    2   0   0   0
1  D  S  O   1   3   1   1   2   1    2   0   0   0
2  C  S  O   1   3   1   1   2   1    1   0   0   0
3  D  S  O   1   3   1   1   2   1    2   0   0   0
4  D  A  O   1   3   1   1   2   1    2   0   0   0
5  D  A  O   1   2   1   1   2   1    2   0   0   0
6  D  A  O   1   2   1   1   2   1    1   0   0   0
7  D  A  O   1   2   1   1   2   1    2   0   0   0
8  D  K  O   1   3   1   1   2   1    2   0   0   0
9  C  R  O   1   3   1   1   2   1    1   0   0   0

where X[0] = f1 and I encoded strings to integers as sklearn does not accept strings.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T17:52:29+0000

Well, I am surprised, but it turns out that sklearn's decision tree cannot handle categorical data indeed. There is a Github issue on this (#4899) from June 2015, but it is still open (UPDATE: it is now closed, but continued in #12866, so the issue is still not resolved).

The problem with coding categorical variables as integers, as you have done here, is that it imposes an order on them, which may or may not be meaningful, depending on the case; for example, you could encode ['low', 'medium', 'high'] as [0, 1, 2], since 'low' < 'medium' < 'high' (we call these categorical variables ordinal), although you are still implicitly making the additional (and possibly undesired) assumption that the distance between 'low' and 'medium' is the same with the distance between 'medium' and 'high' (of no impact in decision trees, but of importance e.g. in k-nn and clustering). But this approach fails completely in cases like, say, ['red','green','blue'] or ['male','female'], since we cannot claim any meaningful relative order between them.

So, for non-ordinal categorical variables, the way to properly encode them for use in sklearn's decision tree is to use the OneHotEncoder module. The Encoding categorical features section of the user's guide might also be helpful.

Categories

python - Can sklearn DecisionTreeClassifier truly work with categorical data?

python - Can sklearn DecisionTreeClassifier truly work with categorical data?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags