'tokenization with huggingFace BartTokenizer
I am trying to use BART pretrained model to train a pointer generator network with huggingface transformer library. example input of the task:
from transformers import BartTokenizer
source = "remind me to write thank you letters to invited"
target = "[IN:CREATE_REMINDER remind [SL:PERSON_REMINDED me ] to [SL:TODO write thank you letters to invited ] ]"
First I added special tokens to the tokenizer
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
ontologyToken = ["[IN:CREATE_REMINDER", "[SL:PERSON_REMINDED", "]"etc...]
for item in ontologyToken:
tokenizer.add_tokens(item, special_tokens=True)
Then I tried to tokenize them
sourceToken = tokenizer(source)["input_ids"]
targetToken = tokenizer(target)["input_ids"]
print(sourceToken)
print(targetToken)
Output: (highlighted ones are the special tokens)
sourceToken: [0, 5593, 2028, 162, 7, 3116, 3392, 47, 5430, 7, 4036, 2]
targetToken: [0, **50265**, 5593, 2028, **50266**, 1794, **742**, 560, **50267**, 29631, 3392, 47, 5430, 7, 4036, **742**, **742**, 2]
Since my model is a pointer generator network, it involves computing attention and pointing to a specific source token and use that as an output token hence the targetToken has to contain all the tokens present in the sourceToken. But clearly not all the sourceToken is present in the targetToken as they seem to have been tokenized differently.
In other words I want my target tokens to be tokenized in a way so that if all my special tokens are removed, my target sentence would be identifical to the source sentence.
So I decoded them to see what is going on.
print([tokenizer.decode(x) for x in sourceToken])
print([tokenizer.decode(x) for x in targetToken])
output:
['<s>', 'rem', 'ind', ' me', ' to', ' write', ' thank', ' you', ' letters', ' to', ' invited', '</s>']
['<s>', '[IN:CREATE_REMINDER', 'rem', 'ind', '[SL:PERSON_REMINDED', 'me', ']', 'to', '[SL:TODO', 'write', ' thank', ' you', ' letters', ' to', ' invited', ']', ']', '</s>']
We can see that every single word that comes after a special token is tokenized differently. For example, in sourceToken, the word "me" is tokenized as " me" with a space bar, the targetToken doesn't have that. How do I make it so that targetToken can be tokenized the same as the source.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|