You are very likely using the average='micro'
parameter to calculate the F1-score. According to the docs, specifying 'micro'
as the averaging startegy will:
Calculate metrics globally by counting the total true positives, false negatives and false positives.
In classification tasks where every test case is guaranteed to be assigned to exactly one class, computing a micro F1-score is equivalent to computing the accuracy score. Just check it out:
from sklearn.metrics import accuracy_score, f1_score
y_true = [[1, 0, 0]]*950 + [[0, 1, 0]]*30 + [[0, 0, 1]]*20
y_pred = [[1, 0, 0]]*1000
print(accuracy_score(y_true, y_pred)) # 0.95
print(f1_score(y_true, y_pred, average='micro')) # 0.9500000000000001
You basically computed the same metric twice. By specifying average='macro'
instead, the F1-score will be computed for each label independently first, and then averaged:
print(f1_score(y_true, y_pred, average='macro')) # 0.3247863247863248
As you can see, the overall F1-score depends on the averaging strategy, and a macro F1-score of less than 0.33 is a clear indicator of a model's deficiency in the prediction task.
EDIT:
Since the OP asked when to choose which strategy, and I think it might be useful for others as well, I will try to elaborate a bit on this issue.
scikit-learn
actually implements four different stratgies for metrics that support averages for multiclass and multilabel classification tasks. Conveniently, the classification_report
will return all of those that apply for a given classification task for Precision, Recall and F1-score:
from sklearn.metrics import classification_report
# The same example but without nested lists. This avoids sklearn to interpret this as a multilabel problem.
y_true = [0 for i in range(950)] + [1 for i in range(30)] + [2 for i in range(20)]
y_pred = [0 for i in range(1000)]
print(classification_report(y_true, y_pred, zero_division=0))
######################### output ####################
precision recall f1-score support
0 0.95 1.00 0.97 950
1 0.00 0.00 0.00 30
2 0.00 0.00 0.00 20
accuracy 0.95 1000
macro avg 0.32 0.33 0.32 1000
weighted avg 0.90 0.95 0.93 1000
All of them provide a different perspective depending on how much emphasize one puts on the class distributions.
micro
average is a global strategy that basically ignores that there is a distinction between classes. This might be useful or justified if someone is really just interested in overall disagreement in terms of true postives, false negatives and false positives, and is not concerned about differences within the classes. As hinted before, if the underlying problem is not a multilabel classification task, this actually equals the accuracy score. (This is also why the classification_report
function returned accuracy
instead of micro avg
).
macro
average as a strategy will calculate each metric for each label separately and return their unweighted mean. This is suitable if each class is of equal importance and the result shall not be skewed in favor of any of the classes in the dataset.
weighted
average will also first calculate each metric for each label separately. But the average is weighted according to the classes' support. This is desirable if the importance of the classes is proportional to their importance, i.e. a class that is underrepresented is considered less important.
samples
average is only meaningful for multilabel classification and therefore not returned by classification_report
in this example and also not discussed here ;)
So the choice of averaging strategy and the resulting number to trust really depends on the importance of the classes. Do I even care about class differences (if no --> micro average) and if so, are all classes equally important (if yes --> macro average) or is the class with higher support more important (--> weighted average).