'Why DETR need to set a empty class?

Why DETR need to set a empty class? It has set a "Background" class, which means non-object, why?



Solution 1:[1]

TL;DR

By default DETR always predict 100 bounding boxes. Empty class is used as a condition to filter out meaningless bounding boxes.

Full explanation

If you look at the source code, the transformer decoder transforms each query from self.query_embed.weight into the output hs:

hs = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])[0]

Then a linear layer self.class_embed maps hs into the object class outputs_class. Another linear layer self.bbox_embed maps the same hs into bounding box outputs_coord:

outputs_class = self.class_embed(hs)
outputs_coord = self.bbox_embed(hs).sigmoid()
out = {'pred_logits': outputs_class[-1], 'pred_boxes': outputs_coord[-1]}

The number of bounding boxes is set to num_queries (by default 100).

detr = DETR(backbone_with_pos_enc, transformer, num_classes=num_classes, num_queries=100)

As you can see now, without the empty class, DETR will always predict 100 bounding boxes (DETR will always try to bound this and that 100 times), even though when there is only one object in the image.

Now, let us consider the example below. There are only two meaningful objects (two birds). But DETR still predicts 100 bounding boxes. Thankfully 98 of the boxes corresponding to "empty class" are discarded (the green box and the blue box below, plus the remaining 96 boxes not shown in the pic below). Only the red box and yellow box having the output class of "bird" are meaningful and hence considered as the prediction.

enter image description here

That is how DETR makes dynamic object prediction. It can predict any number of objects less than or equal to num_queries, but not more than this. If you want DETR to predict more than 100 objects, say 500. Then you can set num_queries to 500 or above.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Raven Cheuk