'tokenization with huggingFace BartTokenizer

I am trying to use BART pretrained model to train a pointer generator network with huggingface transformer library. example input of the task:

from transformers import BartTokenizer

source = "remind me to write thank you letters to invited"
target = "[IN:CREATE_REMINDER remind [SL:PERSON_REMINDED me ] to [SL:TODO write thank you letters to invited ] ]"

First I added special tokens to the tokenizer

tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")
ontologyToken = ["[IN:CREATE_REMINDER", "[SL:PERSON_REMINDED", "]"etc...]
for item in ontologyToken:
    tokenizer.add_tokens(item, special_tokens=True)

Then I tried to tokenize them

sourceToken = tokenizer(source)["input_ids"]
targetToken = tokenizer(target)["input_ids"]
print(sourceToken)
print(targetToken)

Output: (highlighted ones are the special tokens)

sourceToken: [0, 5593, 2028, 162, 7, 3116, 3392, 47, 5430, 7, 4036, 2]
targetToken: [0, **50265**, 5593, 2028, **50266**, 1794, **742**, 560, **50267**, 29631, 3392, 47, 5430, 7, 4036, **742**, **742**, 2]

Since my model is a pointer generator network, it involves computing attention and pointing to a specific source token and use that as an output token hence the targetToken has to contain all the tokens present in the sourceToken. But clearly not all the sourceToken is present in the targetToken as they seem to have been tokenized differently.

In other words I want my target tokens to be tokenized in a way so that if all my special tokens are removed, my target sentence would be identifical to the source sentence.

So I decoded them to see what is going on.

print([tokenizer.decode(x) for x in sourceToken])
print([tokenizer.decode(x) for x in targetToken])

output:

['<s>', 'rem', 'ind', ' me', ' to', ' write', ' thank', ' you', ' letters', ' to', ' invited', '</s>']
['<s>', '[IN:CREATE_REMINDER', 'rem', 'ind', '[SL:PERSON_REMINDED', 'me', ']', 'to', '[SL:TODO', 'write', ' thank', ' you', ' letters', ' to', ' invited', ']', ']', '</s>']

We can see that every single word that comes after a special token is tokenized differently. For example, in sourceToken, the word "me" is tokenized as " me" with a space bar, the targetToken doesn't have that. How do I make it so that targetToken can be tokenized the same as the source.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source