python - Using the predict_proba() function of RandomForestClassifier in the safe and right way

Question

Welcome To Ask or Share your Answers For Others

python - Using the predict_proba() function of RandomForestClassifier in the safe and right way

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Using the predict_proba() function of RandomForestClassifier in the safe and right way

I'm using Scikit-learn to apply machine learning algorithm on my data sets. Sometimes I need to have the probabilities of labels/classes instead of the labels/classes themselves. Instead of having Spam/Not Spam as labels of emails, I wish to have only for example: 0.78 probability a given email is Spam.

For such purpose, I'm using predict_proba() with RandomForestClassifier as following:

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
    min_samples_split=1, random_state=0)
scores = cross_val_score(clf, X, y)
print(scores.mean())

classifier = clf.fit(X,y)
predictions = classifier.predict_proba(Xtest)
print(predictions)

And I got those results:

 [ 0.4  0.6]
 [ 0.1  0.9]
 [ 0.2  0.8]
 [ 0.7  0.3]
 [ 0.3  0.7]
 [ 0.3  0.7]
 [ 0.7  0.3]
 [ 0.4  0.6]

Where the second column is for class: Spam. However, I have two main issues with the results about which I am not confident. The first issue is that the results represent the probabilities of the labels without being affected by the size of my data? The second issue is that the results show only one digit which is not very specific in some cases where the 0.701 probability is very different from 0.708. Is there any way to get the next 5 digit for example?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:37:12+0000

A RandomForestClassifier is a collection of DecisionTreeClassifier's. No matter how big your training set, a decision tree simply returns: a decision. One class has probability 1, the other classes have probability 0.

The RandomForest simply votes among the results. predict_proba() returns the number of votes for each class (each tree in the forest makes its own decision and chooses exactly one class), divided by the number of trees in the forest. Hence, your precision is exactly 1/n_estimators. Want more "precision"? Add more estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive. You normally don't want more than 100 estimators, and often not that many.

Categories

python - Using the predict_proba() function of RandomForestClassifier in the safe and right way

python - Using the predict_proba() function of RandomForestClassifier in the safe and right way

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags