'huggingface sequence classification unfreezing layers

I am using longformer for sequence classification - binary problem

I have downloaded required files

# load model and tokenizer and define length of the text sequence
model = LongformerForSequenceClassification.from_pretrained('allenai/longformer-base-4096',
                                                           gradient_checkpointing=False,
                                                           attention_window = 512)
tokenizer = LongformerTokenizerFast.from_pretrained('allenai/longformer-base-4096', max_length = 1024)

then as shown here I ran the below code

for name, param in model.named_parameters():
     print(name, param.requires_grad)


longformer.embeddings.word_embeddings.weight True
longformer.embeddings.position_embeddings.weight True
longformer.embeddings.token_type_embeddings.weight True
longformer.embeddings.LayerNorm.weight True
longformer.embeddings.LayerNorm.bias True
longformer.encoder.layer.0.attention.self.query.weight True
longformer.encoder.layer.0.attention.self.query.bias True
longformer.encoder.layer.0.attention.self.key.weight True
longformer.encoder.layer.0.attention.self.key.bias True
longformer.encoder.layer.0.attention.self.value.weight True
longformer.encoder.layer.0.attention.self.value.bias True
longformer.encoder.layer.0.attention.self.query_global.weight True
longformer.encoder.layer.0.attention.self.query_global.bias True
longformer.encoder.layer.0.attention.self.key_global.weight True
longformer.encoder.layer.0.attention.self.key_global.bias True
longformer.encoder.layer.0.attention.self.value_global.weight True
longformer.encoder.layer.0.attention.self.value_global.bias True
longformer.encoder.layer.0.attention.output.dense.weight True
longformer.encoder.layer.0.attention.output.dense.bias True
longformer.encoder.layer.0.attention.output.LayerNorm.weight True
longformer.encoder.layer.0.attention.output.LayerNorm.bias True
longformer.encoder.layer.0.intermediate.dense.weight True
longformer.encoder.layer.0.intermediate.dense.bias True
longformer.encoder.layer.0.output.dense.weight True
longformer.encoder.layer.0.output.dense.bias True
longformer.encoder.layer.0.output.LayerNorm.weight True
longformer.encoder.layer.0.output.LayerNorm.bias True
longformer.encoder.layer.1.attention.self.query.weight True
longformer.encoder.layer.1.attention.self.query.bias True
longformer.encoder.layer.1.attention.self.key.weight True
longformer.encoder.layer.1.attention.self.key.bias True
longformer.encoder.layer.1.attention.self.value.weight True
longformer.encoder.layer.1.attention.self.value.bias True
longformer.encoder.layer.1.attention.self.query_global.weight True
longformer.encoder.layer.1.attention.self.query_global.bias True
longformer.encoder.layer.1.attention.self.key_global.weight True
longformer.encoder.layer.1.attention.self.key_global.bias True
longformer.encoder.layer.1.attention.self.value_global.weight True
longformer.encoder.layer.1.attention.self.value_global.bias True
longformer.encoder.layer.1.attention.output.dense.weight True
longformer.encoder.layer.1.attention.output.dense.bias True
longformer.encoder.layer.1.attention.output.LayerNorm.weight True
longformer.encoder.layer.1.attention.output.LayerNorm.bias True
longformer.encoder.layer.1.intermediate.dense.weight True
longformer.encoder.layer.1.intermediate.dense.bias True
longformer.encoder.layer.1.output.dense.weight True
longformer.encoder.layer.1.output.dense.bias True
longformer.encoder.layer.1.output.LayerNorm.weight True
longformer.encoder.layer.1.output.LayerNorm.bias True
longformer.encoder.layer.2.attention.self.query.weight True
longformer.encoder.layer.2.attention.self.query.bias True
longformer.encoder.layer.2.attention.self.key.weight True
longformer.encoder.layer.2.attention.self.key.bias True
longformer.encoder.layer.2.attention.self.value.weight True
longformer.encoder.layer.2.attention.self.value.bias True
longformer.encoder.layer.2.attention.self.query_global.weight True
longformer.encoder.layer.2.attention.self.query_global.bias True
longformer.encoder.layer.2.attention.self.key_global.weight True
longformer.encoder.layer.2.attention.self.key_global.bias True
longformer.encoder.layer.2.attention.self.value_global.weight True
longformer.encoder.layer.2.attention.self.value_global.bias True
longformer.encoder.layer.2.attention.output.dense.weight True
longformer.encoder.layer.2.attention.output.dense.bias True
longformer.encoder.layer.2.attention.output.LayerNorm.weight True
longformer.encoder.layer.2.attention.output.LayerNorm.bias True
longformer.encoder.layer.2.intermediate.dense.weight True
longformer.encoder.layer.2.intermediate.dense.bias True
longformer.encoder.layer.2.output.dense.weight True
longformer.encoder.layer.2.output.dense.bias True
longformer.encoder.layer.2.output.LayerNorm.weight True
longformer.encoder.layer.2.output.LayerNorm.bias True
longformer.encoder.layer.3.attention.self.query.weight True
longformer.encoder.layer.3.attention.self.query.bias True
longformer.encoder.layer.3.attention.self.key.weight True
longformer.encoder.layer.3.attention.self.key.bias True
longformer.encoder.layer.3.attention.self.value.weight True
longformer.encoder.layer.3.attention.self.value.bias True
longformer.encoder.layer.3.attention.self.query_global.weight True
longformer.encoder.layer.3.attention.self.query_global.bias True
longformer.encoder.layer.3.attention.self.key_global.weight True
longformer.encoder.layer.3.attention.self.key_global.bias True
longformer.encoder.layer.3.attention.self.value_global.weight True
longformer.encoder.layer.3.attention.self.value_global.bias True
longformer.encoder.layer.3.attention.output.dense.weight True
longformer.encoder.layer.3.attention.output.dense.bias True
longformer.encoder.layer.3.attention.output.LayerNorm.weight True
longformer.encoder.layer.3.attention.output.LayerNorm.bias True
longformer.encoder.layer.3.intermediate.dense.weight True
longformer.encoder.layer.3.intermediate.dense.bias True
longformer.encoder.layer.3.output.dense.weight True
longformer.encoder.layer.3.output.dense.bias True
longformer.encoder.layer.3.output.LayerNorm.weight True
longformer.encoder.layer.3.output.LayerNorm.bias True
longformer.encoder.layer.4.attention.self.query.weight True
longformer.encoder.layer.4.attention.self.query.bias True
longformer.encoder.layer.4.attention.self.key.weight True
longformer.encoder.layer.4.attention.self.key.bias True
longformer.encoder.layer.4.attention.self.value.weight True
longformer.encoder.layer.4.attention.self.value.bias True
longformer.encoder.layer.4.attention.self.query_global.weight True
longformer.encoder.layer.4.attention.self.query_global.bias True
longformer.encoder.layer.4.attention.self.key_global.weight True
longformer.encoder.layer.4.attention.self.key_global.bias True
longformer.encoder.layer.4.attention.self.value_global.weight True
longformer.encoder.layer.4.attention.self.value_global.bias True
longformer.encoder.layer.4.attention.output.dense.weight True
longformer.encoder.layer.4.attention.output.dense.bias True
longformer.encoder.layer.4.attention.output.LayerNorm.weight True
longformer.encoder.layer.4.attention.output.LayerNorm.bias True
longformer.encoder.layer.4.intermediate.dense.weight True
longformer.encoder.layer.4.intermediate.dense.bias True
longformer.encoder.layer.4.output.dense.weight True
longformer.encoder.layer.4.output.dense.bias True
longformer.encoder.layer.4.output.LayerNorm.weight True
longformer.encoder.layer.4.output.LayerNorm.bias True
longformer.encoder.layer.5.attention.self.query.weight True
longformer.encoder.layer.5.attention.self.query.bias True
longformer.encoder.layer.5.attention.self.key.weight True
longformer.encoder.layer.5.attention.self.key.bias True
longformer.encoder.layer.5.attention.self.value.weight True
longformer.encoder.layer.5.attention.self.value.bias True
longformer.encoder.layer.5.attention.self.query_global.weight True
longformer.encoder.layer.5.attention.self.query_global.bias True
longformer.encoder.layer.5.attention.self.key_global.weight True
longformer.encoder.layer.5.attention.self.key_global.bias True
longformer.encoder.layer.5.attention.self.value_global.weight True
longformer.encoder.layer.5.attention.self.value_global.bias True
longformer.encoder.layer.5.attention.output.dense.weight True
longformer.encoder.layer.5.attention.output.dense.bias True
longformer.encoder.layer.5.attention.output.LayerNorm.weight True
longformer.encoder.layer.5.attention.output.LayerNorm.bias True
longformer.encoder.layer.5.intermediate.dense.weight True
longformer.encoder.layer.5.intermediate.dense.bias True
longformer.encoder.layer.5.output.dense.weight True
longformer.encoder.layer.5.output.dense.bias True
longformer.encoder.layer.5.output.LayerNorm.weight True
longformer.encoder.layer.5.output.LayerNorm.bias True
longformer.encoder.layer.6.attention.self.query.weight True
longformer.encoder.layer.6.attention.self.query.bias True
longformer.encoder.layer.6.attention.self.key.weight True
longformer.encoder.layer.6.attention.self.key.bias True
longformer.encoder.layer.6.attention.self.value.weight True
longformer.encoder.layer.6.attention.self.value.bias True
longformer.encoder.layer.6.attention.self.query_global.weight True
longformer.encoder.layer.6.attention.self.query_global.bias True
longformer.encoder.layer.6.attention.self.key_global.weight True
longformer.encoder.layer.6.attention.self.key_global.bias True
longformer.encoder.layer.6.attention.self.value_global.weight True
longformer.encoder.layer.6.attention.self.value_global.bias True
longformer.encoder.layer.6.attention.output.dense.weight True
longformer.encoder.layer.6.attention.output.dense.bias True
longformer.encoder.layer.6.attention.output.LayerNorm.weight True
longformer.encoder.layer.6.attention.output.LayerNorm.bias True
longformer.encoder.layer.6.intermediate.dense.weight True
longformer.encoder.layer.6.intermediate.dense.bias True
longformer.encoder.layer.6.output.dense.weight True
longformer.encoder.layer.6.output.dense.bias True
longformer.encoder.layer.6.output.LayerNorm.weight True
longformer.encoder.layer.6.output.LayerNorm.bias True
longformer.encoder.layer.7.attention.self.query.weight True
longformer.encoder.layer.7.attention.self.query.bias True
longformer.encoder.layer.7.attention.self.key.weight True
longformer.encoder.layer.7.attention.self.key.bias True
longformer.encoder.layer.7.attention.self.value.weight True
longformer.encoder.layer.7.attention.self.value.bias True
longformer.encoder.layer.7.attention.self.query_global.weight True
longformer.encoder.layer.7.attention.self.query_global.bias True
longformer.encoder.layer.7.attention.self.key_global.weight True
longformer.encoder.layer.7.attention.self.key_global.bias True
longformer.encoder.layer.7.attention.self.value_global.weight True
longformer.encoder.layer.7.attention.self.value_global.bias True
longformer.encoder.layer.7.attention.output.dense.weight True
longformer.encoder.layer.7.attention.output.dense.bias True
longformer.encoder.layer.7.attention.output.LayerNorm.weight True
longformer.encoder.layer.7.attention.output.LayerNorm.bias True
longformer.encoder.layer.7.intermediate.dense.weight True
longformer.encoder.layer.7.intermediate.dense.bias True
longformer.encoder.layer.7.output.dense.weight True
longformer.encoder.layer.7.output.dense.bias True
longformer.encoder.layer.7.output.LayerNorm.weight True
longformer.encoder.layer.7.output.LayerNorm.bias True
longformer.encoder.layer.8.attention.self.query.weight True
longformer.encoder.layer.8.attention.self.query.bias True
longformer.encoder.layer.8.attention.self.key.weight True
longformer.encoder.layer.8.attention.self.key.bias True
longformer.encoder.layer.8.attention.self.value.weight True
longformer.encoder.layer.8.attention.self.value.bias True
longformer.encoder.layer.8.attention.self.query_global.weight True
longformer.encoder.layer.8.attention.self.query_global.bias True
longformer.encoder.layer.8.attention.self.key_global.weight True
longformer.encoder.layer.8.attention.self.key_global.bias True
longformer.encoder.layer.8.attention.self.value_global.weight True
longformer.encoder.layer.8.attention.self.value_global.bias True
longformer.encoder.layer.8.attention.output.dense.weight True
longformer.encoder.layer.8.attention.output.dense.bias True
longformer.encoder.layer.8.attention.output.LayerNorm.weight True
longformer.encoder.layer.8.attention.output.LayerNorm.bias True
longformer.encoder.layer.8.intermediate.dense.weight True
longformer.encoder.layer.8.intermediate.dense.bias True
longformer.encoder.layer.8.output.dense.weight True
longformer.encoder.layer.8.output.dense.bias True
longformer.encoder.layer.8.output.LayerNorm.weight True
longformer.encoder.layer.8.output.LayerNorm.bias True
longformer.encoder.layer.9.attention.self.query.weight True
longformer.encoder.layer.9.attention.self.query.bias True
longformer.encoder.layer.9.attention.self.key.weight True
longformer.encoder.layer.9.attention.self.key.bias True
longformer.encoder.layer.9.attention.self.value.weight True
longformer.encoder.layer.9.attention.self.value.bias True
longformer.encoder.layer.9.attention.self.query_global.weight True
longformer.encoder.layer.9.attention.self.query_global.bias True
longformer.encoder.layer.9.attention.self.key_global.weight True
longformer.encoder.layer.9.attention.self.key_global.bias True
longformer.encoder.layer.9.attention.self.value_global.weight True
longformer.encoder.layer.9.attention.self.value_global.bias True
longformer.encoder.layer.9.attention.output.dense.weight True
longformer.encoder.layer.9.attention.output.dense.bias True
longformer.encoder.layer.9.attention.output.LayerNorm.weight True
longformer.encoder.layer.9.attention.output.LayerNorm.bias True
longformer.encoder.layer.9.intermediate.dense.weight True
longformer.encoder.layer.9.intermediate.dense.bias True
longformer.encoder.layer.9.output.dense.weight True
longformer.encoder.layer.9.output.dense.bias True
longformer.encoder.layer.9.output.LayerNorm.weight True
longformer.encoder.layer.9.output.LayerNorm.bias True
longformer.encoder.layer.10.attention.self.query.weight True
longformer.encoder.layer.10.attention.self.query.bias True
longformer.encoder.layer.10.attention.self.key.weight True
longformer.encoder.layer.10.attention.self.key.bias True
longformer.encoder.layer.10.attention.self.value.weight True
longformer.encoder.layer.10.attention.self.value.bias True
longformer.encoder.layer.10.attention.self.query_global.weight True
longformer.encoder.layer.10.attention.self.query_global.bias True
longformer.encoder.layer.10.attention.self.key_global.weight True
longformer.encoder.layer.10.attention.self.key_global.bias True
longformer.encoder.layer.10.attention.self.value_global.weight True
longformer.encoder.layer.10.attention.self.value_global.bias True
longformer.encoder.layer.10.attention.output.dense.weight True
longformer.encoder.layer.10.attention.output.dense.bias True
longformer.encoder.layer.10.attention.output.LayerNorm.weight True
longformer.encoder.layer.10.attention.output.LayerNorm.bias True
longformer.encoder.layer.10.intermediate.dense.weight True
longformer.encoder.layer.10.intermediate.dense.bias True
longformer.encoder.layer.10.output.dense.weight True
longformer.encoder.layer.10.output.dense.bias True
longformer.encoder.layer.10.output.LayerNorm.weight True
longformer.encoder.layer.10.output.LayerNorm.bias True
longformer.encoder.layer.11.attention.self.query.weight True
longformer.encoder.layer.11.attention.self.query.bias True
longformer.encoder.layer.11.attention.self.key.weight True
longformer.encoder.layer.11.attention.self.key.bias True
longformer.encoder.layer.11.attention.self.value.weight True
longformer.encoder.layer.11.attention.self.value.bias True
longformer.encoder.layer.11.attention.self.query_global.weight True
longformer.encoder.layer.11.attention.self.query_global.bias True
longformer.encoder.layer.11.attention.self.key_global.weight True
longformer.encoder.layer.11.attention.self.key_global.bias True
longformer.encoder.layer.11.attention.self.value_global.weight True
longformer.encoder.layer.11.attention.self.value_global.bias True
longformer.encoder.layer.11.attention.output.dense.weight True
longformer.encoder.layer.11.attention.output.dense.bias True
longformer.encoder.layer.11.attention.output.LayerNorm.weight True
longformer.encoder.layer.11.attention.output.LayerNorm.bias True
longformer.encoder.layer.11.intermediate.dense.weight True
longformer.encoder.layer.11.intermediate.dense.bias True
longformer.encoder.layer.11.output.dense.weight True
longformer.encoder.layer.11.output.dense.bias True
longformer.encoder.layer.11.output.LayerNorm.weight True
longformer.encoder.layer.11.output.LayerNorm.bias True
classifier.dense.weight True
classifier.dense.bias True
classifier.out_proj.weight True
classifier.out_proj.bias True

My questions

  1. why for all layers param.requires_grad is True? Shouldnt it be False at least for classifier. layers? Aren't we training them?
  2. Does param.requires_grad==True mean that particular layer is freezed? I am confused with wording requires_grad. Does it mean freezed?
  3. if i want to train some of the previous layers as shown here , should I use below code?

for name, param in model.named_parameters():

if name.startswith("..."): # choose whatever you like here

param.requires_grad = False

  1. considering it takes a lot of time to train, is there specific recommendation regarding layers that I should train? To begin with I am planning to train -

all layers starting with longformer.encoder.layer.11. and

`classifier.dense.weight` 
`classifier.dense.bias` 
`classifier.out_proj.weight` 
`classifier.out_proj.bias`
  1. Do i need add any additional layers such as dropout or is that already taken care by LongformerForSequenceClassification.from_pretrained? I am not seeing any dropout layers in the above output and that's why asking the question

#------------------ update 1

How could I know which layers are frozen by using below code from the answer given by @joe32140 ? My guess is everything except last 4 layers from my output shown in my original question gets frozen. But is there any easier way to check?

for param in model.base_model.parameters():
    param.requires_grad = False


Solution 1:[1]

  1. requires_grad==True means that we will compute the gradient of this tensor, so the default setting is we will train/finetune all layers.
  2. You can only train the output layer by freezing the encoder with
for param in model.base_model.parameters():
    param.requires_grad = False
  1. Yes, dropout is used in huggingface output layer implementation. See here: https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/longformer/modeling_longformer.py#L1938

  2. As for update 1, yes, base_model refers to layers excluding the output classification head. However, it's actually two layers instead of four where each layer has a weight and a bias tensors.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1