I have a problem in understanding how micro-averaging works for F1-score. First, I'll explain the problem I have and the approach I've tried to solve it.
I have a dataset of documents which are further subcategorised by events. These documents are to be classified into True
and False
based on some category but I also want to capture the imbalance in events in my F1-score.
I initially evaluated my model by F1-score, with "binary" averaging using sklearn.metrics.precision_recall_fscore_support. However, this only captures my precision and recall for whether or not it classified samples in the positive-class (True
) correctly.
I, then, transformed my targets and predictions into a multi-class problem by treating each of the events as a separate class, mapped to a numeric value.
For example, if I had 4 events, I would map each positive-class target and prediction to either 1,2,3,4. I would then calculate precision, recall, and F1-score with micro averaging in an attempt to highlight this event imbalance.
I'm not sure if I'm tackling this problem correctly, but as I imagined the problem, if I had a bunch of positive predictions that were overwhelmingly in one class, this would be captured by micro averaging?
So say, as an example:
true = [1,0,0,0,0,1,1,0,1,1,1]
pred = [0,1,1,0,0,1,1,0,1,1,1]
cats = [2,1,4,1,3,4,4,3,4,3,4]
would change into
true = [2, 0, 0, 0, 0, 4, 4, 0, 4, 3, 4]
pred = [0, 1, 4, 0, 0, 4, 4, 0, 4, 3, 4]
where each positive class would be assigned a value corresponding to its event.
I then run:
precision_recall_fscore_support(cat_true, cat_pred, average="micro", labels=list(set(cats)))
where I would expect to see the imbalance between say, labels 2
and 4
highlighted. However, I get the exact same scores for binary averaging using binary labels, as I do for my multiclass approach which intends to capture the label imbalance in events.
Am I going about this the wrong way and misunderstanding the purpose of micro-averaging? I am looking to eliminate calculations of correctly classified negative examples (0) as my document set is very imbalanced in True/False values.
My use case is as follows: Some of these events contain less examples of the particular binary positive-class problem and I want to capture this imbalance across the correctly classified examples.
question from:
https://stackoverflow.com/questions/66049724/f1-score-averaging-for-transformed-binary-classification-problem