'Pytorch nn.CrossEntropyLoss() always returns 0

I am building a multi-class Vision Transformer Network. When passing my values through my loss function, it always returns zero. My output layer consisits of 37 Dense Layers with a softmax-unit on each on of them. criterion is created with nn.CrossEntropyLoss().The output of criterion is 0.0 for every iteration. I am using the colab notebook. I printed out the output and label for one iteration:

for output, label in zip(iter(ouputs_t), iter(labels_t)):
                      loss += criterion(
                          output,
                          # reshape label from (Batch_Size) to (Batch_Size, 1)
                          torch.reshape(label, (label.shape[0] , 1 ))
                          )

output: tensor([[0.1534],
        [0.5797],
        [0.6554],
        [0.4066],
        [0.2683],
        [0.1773],
        [0.7410],
        [0.5136],
        [0.5695],
        [0.3970],
        [0.4317],
        [0.7216],
        [0.8336],
        [0.4517],
        [0.4004],
        [0.5963],
        [0.3079],
        [0.5956],
        [0.3876],
        [0.2327],
        [0.7919],
        [0.2722],
        [0.3064],
        [0.9779],
        [0.8358],
        [0.1851],
        [0.2869],
        [0.3128],
        [0.4301],
        [0.4740],
        [0.6689],
        [0.7588]], device='cuda:0', grad_fn=<UnbindBackward0>)

label: tensor([[0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [0.]], device='cuda:0')

My Model:

class vit_large_patch16_224_multiTaskNet(nn.Module):
    def __init__(self, output_classes, frozen_feature_layers=False):
        super().__init__()
        
        vit_base_patch16_224 = timm.create_model('vit_large_patch16_224',pretrained=True)
        self.is_frozen = frozen_feature_layers
        # here we get all the modules(layers) before the fc layer at the end
        self.features = nn.ModuleList(vit_base_patch16_224.children())[:-1]
        self.features = nn.Sequential(*self.features)
        if frozen_feature_layers:
            self.freeze_feature_layers()

        # now lets add our new layers 
        in_features = vit_base_patch16_224.head.in_features
        # it helps with performance. you can play with it
        # create more layers, play/experiment with them. 
        self.fc0 = nn.Linear(in_features, 512)
        self.bn_pu = nn.BatchNorm1d(512, eps = 1e-5)
        self.output_modules = nn.ModuleList()
        for i in range(output_classes):
            self.output_modules.append(nn.Linear(512, 1))
        # initialize all fc layers to xavier
        for m in self.modules():
            if isinstance(m, nn.Linear):
                torch.nn.init.xavier_normal_(m.weight, gain = 1)


    def forward(self, input_imgs):
        output = self.features(input_imgs)
        final_cs_token = output[:, 0]
        output = self.bn_pu(F.relu(self.fc0(final_cs_token)))
        output_list= list()       
        for output_modul in self.output_modules:
          output_list.append(torch.sigmoid(output_modul(output)))
        # Convert List to Tensor
        output_tensor = torch.stack(output_list)
        # 
        output_tensor = torch.swapaxes(output_tensor, 0 , 1)
        return output_tensor
    
    def _set_freeze_(self, status):
        for n,p in self.features.named_parameters():
            p.requires_grad = status
        # for m in self.features.children():
        #     for p in m.parameters():
        #         p.requires_grad=status    


    def freeze_feature_layers(self):
        self._set_freeze_(False)

    def unfreeze_feature_layers(self):
        self._set_freeze_(True)


Solution 1:[1]

You are in a multi-class classification scenario, which means you can consider your problem as c-binary class classification done in parallel (where c is the total number of class). Having output_t the logit tensor containing the values outputted by your model's last linear layer and target the ground-truth tensor containing the true classes states for each instance in the batch. You can apply nn.BCEWithLogitsLoss since it works with multi-dimensional tensors out of the box:

With dummy inputs:

>>> output_t = torch.rand(47, 32, 1)
>>> target = torch.randint(0, 2, (47, 32, 1)).float()

Then initializing and calling the loss function:

>>> loss = nn.BCEWithLogitsLoss()
>>> loss(output_t, target)
tensor(0.7246)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ivan