'Solving "CUDA out of memory" when fine-tuning GPT-2 (HuggingFace)
I get the reoccuring CUDA out of memory error when using the HuggingFace Transformers library to fine-tune a GPT-2 model and can't seem to solve it, despite my 6 GB GPU capacity, which I thought should be enough for fine-tuning on texts. The error reads as follows:
File "GPT\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "GPT\lib\site-packages\transformers\modeling_utils.py", line 1763, in forward
x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 6.00 GiB total capacity; 4.28 GiB already allocated; 24.50 MiB free; 4.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I already set batch size to as low as 2 and reduced training examples without success. I also tried to migrate the code to Colab, where the 12GB RAM were quickly consumed. My examples are rather long, some counting 2.400 characters, but they should be truncated by the model automatically. My (German) examples look like this:
Er geht in fremde Wohnungen, balgt sich mit Freund und Feind, ist
zudringlich zu unsern Sämereien und Kirschen. Wenn die Gesellschaft nicht groß
ist, lasse ich sie gelten und streue ihnen sogar Getreide. Sollten sie hier
aber doch zu viel werden, so hilft die Windbüchse, und sie werden in den
Meierhof hinabgescheucht. Als einen bösen Feind zeigte sich der Rotschwanz. Er
flog zu dem Bienenhause und schnappte die Tierchen weg. Da half nichts, als ihn
ohne Gnade mit der Windbüchse zu töten.
Ich wollte
Ihnen mein Wort halten, liebe Mama, aber die Versuchung war zu groß. Da bin ich
eines Abends in den Keller gegangen und hab' aus allen Fässern den Spund
herausgeklopft. Bis auf den letzten Tropfen ist das Gift ausgeronnen aus den
Fässern. Der Schade war groß, aber der Teufel war aus dem Haus. «
Andor lachte. »Mama, das Geschrei hätten Sie hören sollen! Als ob der
Weltuntergang gekommen wäre. Er bedauerte beinahe seine
Schroffheit. Nun, nachlaufen wird er ihnen nicht, die werden schon selber
kommen. Aber bewachen wird er seine Kolonie bei Tag und bei Nacht lassen
müssen. Hol' der Teufel diesen Mercy. Muß der gerade in Högyész ein Kastell
haben. Wenn einer von den Schwarzwäldern dahin kommt und ihn verklagt.
Is there a problem with the data formatting maybe? If anyone has a hint on how to solve this, it would be very welcome.
EDIT: Thank you Timbus Calin for the answer, I described in the comment how adding the block_size
flag to the config.json solved the problem. Here is the whole configuration for reference:
{
"model_name_or_path": "dbmdz/german-gpt2",
"train_file": "Fine-Tuning Dataset/train.txt",
"validation_file": "Fine-Tuning Dataset/test.txt",
"output_dir": "Models",
"overwrite_output_dir": true,
"per_device_eval_batch_size": 8,
"per_device_train_batch_size": 8,
"block_size": 100,
"task_type": "text-generation",
"do_train": true,
"do_eval": true
}
Solution 1:[1]
- If the memory problems still persist, you could opt for
DistillGPT2
, as it has a 33% reduction in the parameters of the network (the forward pass is also twice as fast). Particularly for a small GPU memory like 6GB VRAM, it could be a solution/alternative to your problem. - At the same time, it depends on how you preprocess the data. Indeed,
the model is capable of "receiving" a maximum length of
N
tokens (could be for example512/768
) depending on the models you choose. I recently trained a named entity recognition model and the model had a maximum length of768
tokens. However, when I manually set the dimension of the padded tokens in my PyTorchDataLoader()
to a big number, I also got OOM memory (even on3090 24GB VRAM
). As I reduced the dimension of the tokens to a much smaller one (512
instead of768
for example) the training started to work and I did not get any issues with the lack of memory.
TLDR: Reducing the number of tokens in the preprocessing phase, regardless of the max capacity of the network, can also help to solve your memories problem.
Note that reducing the number of tokens to process in a sequence is different from the dimension of a token.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 |