'Using softmax for multilabel classification (as per Facebook paper)
I came across this paper by some Facebook researchers where they found that using a softmax and CE loss function during training led to improved results over sigmoid + BCE. They do this by changing the one-hot label vector such that each '1' is divided by the number of labels for the given image (e.g. from [0, 1, 1, 0] to [0, 0.5, 0.5, 0]).
However, they do not mention how this could then be used in the inference stage, because the required threshold for selecting the correct labels is not clear.
Does anyone know how this would work?
Solution 1:[1]
I've just stumbled at this paper as well and asked myself the same question. Here's how I'd go about it.
If there is one ground truth tag, the ideal predicted vector would have a single 1 and all other predictions 0. If there were 2 tags, the ideal prediction would have two 0.5 and all others at 0. It makes sense to sort the predicted values by descending confidence and to look at the cumulative probability as we increase the number of candidates for the final number of tags.
We need to distinguish which option was the (sorted) ground truth:
1, 0, 0, 0, 0, ...
0.5, 0.5, 0, 0, 0, ...
1/3, 1/3, 1/3, 0, 0, ...
1/4, 1/4, 1/4, 1/4, 0, ...
1/5, 1/5, 1/5, 1/5, 1/5, 0, ...
The same tag position could have completely different ground truth values: 1.0 when alone, 0.5 when together with another one, 0.1 with 10 of them, and so on. A fixed threshold couldn't tell which was the correct case.
Instead, we can check the descending sort of predicted values and the corresponding cumulative sum. As soon as that sum is above a certain number (let's say 0.95), that's the number of tags that we predict. Tweaking the exact threshold number for the cumulative sum would serve as a way to influence precision and recall.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | meferne |