- BPE was introduced byΒ Senrich in the paper Neural Machine translation for rare words with subword units
The first step in BPE is to split all the strings into words using any tokenizer. After word tokenization, letβs assume we have the following words with their frequencies as given below:
[(βcarβ, 5), (βcableβ, 3), (βtabletβ, 1), (βwatchβ, 2), (βchairβ, 5), (βmouseβ, 1)]
The desired vocabulary size is a hyperparameter for BPE. In the example, letβs assume we want a total of 17 tokens in the vocabulary. All the unique characters and symbols in the words are included as base vocabulary. In this the base vocabulary would be
[βaβ, βbβ, βcβ, βeβ, βhβ, βiβ, βlβ, βmβ, βoβ, βrβ, βsβ ,βtβ, βuβ, βwβ]Β size=14
Next, all the words are split into the base vocabulary characters, which can be represented as follows:
[(βcβ,βaβ,βrβ , 5), (βcβ,βaβ,βbβ,βlβ,βeβ, 3), (βtβ,βaβ,βbβ,βlβ,βeβ,βtβ, 1), (βwβ,βaβ,βtβ,βcβ,βhβ, 2), (βcβ,βhβ,βaβ,βiβ,βrβ, 5), (βmβ,βoβ,βuβ,βsβ,βeβ, 1)]
The BPE algorithm then counts the occurrence of every symbol pair and picks the one with the highest frequency.Β In the above example, the pair βcaβ occurs 5 times in car and 3 times in cable, making a total of 8 occurrences, the highest of all pairs. It is followed by 7 occurrences of ch (2 from watch and 5 from chair) and so on.
[βaβ, βbβ, βcβ, βeβ, βhβ, βiβ, βlβ, βmβ, βoβ, βrβ, βsβ ,βtβ, βuβ, βwβ,Β βcaβ]Β size=15
And the tokenized words become
[(βcaβ,βrβ , 5), (βcaβ,βbβ,βlβ,βeβ, 3), (βtβ,βaβ,βbβ,βlβ,βeβ,βtβ, 1), (βwβ,βaβ,βtβ,βcβ,βhβ, 2), (βcβ,βhβ,βaβ,βiβ,βrβ, 5), (βmβ,βoβ,βuβ,βsβ,βeβ, 1)]
Next highest occurrence is of βchβ, which is added to the vocabulary and all paired occurrences of c and h are merged together.
Vocab:Β [βaβ, βbβ, βcβ, βeβ, βhβ, βiβ, βlβ, βmβ, βoβ, βrβ, βsβ ,βtβ, βuβ, βwβ,Β βcaβ, βchβ]Β size=16
Tokenized input:Β [(βcaβ,βrβ , 5), (βcaβ,βbβ,βlβ,βeβ, 3), (βtβ,βaβ,βbβ,βlβ,βeβ,βtβ, 1), (βwβ,βaβ,βtβ,βchβ, 2), (βchβ,βaβ,βiβ,βrβ, 5), (βmβ,βoβ,βuβ,βsβ,βeβ, 1)]
Since target vocab size = 17, BPE will choose the next most frequent pair βcaβ and βrβ which occurs 5 times. They will be merged andΒ βcarβΒ will be added to the vocabulary
Final vocab:Β [βaβ, βbβ, βcβ, βeβ, βhβ, βiβ, βlβ, βmβ, βoβ, βrβ, βsβ ,βtβ, βuβ, βwβ, βcaβ, βchβ,βcarβ]
Final Tokenized Input:Β [(βcarβ , 5), (βcaβ,βbβ,βlβ,βeβ, 3), (βtβ,βaβ,βbβ,βlβ,βeβ,βtβ, 1), (βwβ,βaβ,βtβ,βchβ, 2), (βchβ,βaβ,βiβ,βrβ, 5), (βmβ,βoβ,βuβ,βsβ,βeβ, 1)]
Now that BPE has been trained, the same tokenization merges will be applied to new words. Say, we get a new word βcabβ, it will get tokenized intoΒ [βcaβ, βbβ]. However, if the new word is βcardβ, it will get split intoΒ [βcarβ, β[UNK]β]Β since the letter d is not in the vocabulary. Practically, this never happens because all characters occur in the corpus at least once. However, UNK (unknown) token may be encountered if a symbol like punctuation or number was not added in the vocabulary but is a part of a new word.