'Pretraining a language model on a small custom corpus

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.

For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.

Putting it as a pipeline, I would describe this as:

  1. Using a pre-trained BERT tokenizer.
  2. Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
  3. Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
  4. Generating text that resembles the text within the small custom corpus.

Does this sound familiar? Is it possible with hugging-face?



Solution 1:[1]

I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:

  1. Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.

  2. Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia and BookCorpus text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides the initial data collections. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.

Is it possible with hugging-face?

I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared. This will provide you with Tensorflow checkpoint, which can be easily converted to Pytorch checkpoint if you'd like to work with Pytorch/Transformers.

Solution 2:[2]

It is entirely possible to both pre-train and further pre-train BERT (or almost any other model that is available in the huggingface library).

Regarding the tokenizer - if you are pre-training on a a small custom corpus (and therefore using a trained bert checkpoint), then you have to use the tokenizer that was used to train Bert. Otherwise, you will just confuse the model.

If your use case is text generation (from some initial sentence/part of sentence), then I can advise you to check gpt-2 (https://huggingface.co/gpt2). I haven't used GPT-2, but with some basic research I think you can do:

from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')

and follow this tutorial: https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171 on how to train the gpt-2 model.

Note: I am not sure if DeBERTa-V3, for example, can be pre-trained as usual. I have checked their github repo and it seems that for V3 there is no official pre-training code (https://github.com/microsoft/DeBERTa/issues/71). However, I think that using huggingface we can actually do it. Once I have time I will run a pre-training script and verify this.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2