'how to train a bert model from scratch with huggingface?

i find a answer of training model from scratch in this question: How to train BERT from scratch on a new domain for both MLM and NSP?

one answer use Trainer and TrainingArguments like this:

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir= "/path/to/output/dir/for/training/arguments"
    overwrite_output_dir=True,
    num_train_epochs=2,
    per_gpu_train_batch_size= 16,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
trainer.save_model("path/to/your/model")

but huggingface official doc Fine-tuning a pretrained model also use Trainer and TrainingArguments in the same way to finetune . so when I use Trainer and TrainingArguments to train model, Do I train model from scratch or just finetune?



Solution 1:[1]

Hey first of all thank you for linking my question, I will do my best to clarify it :)

First of all there is no big difference between pre-training & fine-tuning. The only difference is in pre-training you train your model from scratch, in order words you initialized the weights by initial value (it can be random or zero) however in fine-tuning you actually load a pre-trained model and then train it again for a downstream task, so basically what you are doing is initializing weights by pre-trained model. Therefore you can use the knowledge that is captured by the pre-trained model.

Lets try to understand fine-tuning and pre-training architecture. In the following diagram shows us the overview of pre-training architecture.

enter image description here

When you fine-tune BERT model you change that task specific area and labels. When you change task specific area, you change the overall architecture. You replace the heads. Also this change has affected the naming of model that you are using in Transformers. For example BertForPreTraining uses both MLM and NSP head at the same time, whereas BertForSequenceClassification uses a linear layer as a head just like in NSP. But what they have in common is that they wrap BERT model. So we just change "task specific" area (architecture). This is what they meant by stating "There is minimal difference between the pre-trained architecture and the final downstream architecture." in BERT paper.

If you want to change BERT's pre-training task, you must change the architecture of task specific area and input labels just like I did in "Same Sentence Prediction: A new Pre-training Task for BERT"(Github Repo). Same goes for the fine-tuning process. If you want to customize fine-tuning architecture for a downstream task, all you need to do this changing the architecture of task specific area.

So it doesn't matter using Trainer for pre-training or fine-tuning. Trainer will basically updates the weights of model according to training loss. If you use pre-trained BERT with downstream task specific heads, it will update weights in both BERT model and task specific heads (unless you tell it otherwise by freezing the weights of BERT model). If you use untrained BERT model with task specific heads it will also update weights.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Khan9797