The first step in BPE is to split all the strings into words using any tokenizer. After word tokenization, let’s assume we have the following words with their frequencies as given below:

[(β€œcar”, 5), (β€œcable”, 3), (β€œtablet”, 1), (β€œwatch”, 2), (β€œchair”, 5), (β€œmouse”, 1)]

The desired vocabulary size is a hyperparameter for BPE. In the example, let’s assume we want a total of 17 tokens in the vocabulary. All the unique characters and symbols in the words are included as base vocabulary. In this the base vocabulary would be

[β€˜a’, β€˜b’, β€˜c’, β€˜e’, β€˜h’, β€˜i’, β€˜l’, β€˜m’, β€˜o’, β€˜r’, β€˜s’ ,’t’, β€˜u’, β€˜w’]Β size=14

Next, all the words are split into the base vocabulary characters, which can be represented as follows:

[(β€˜c’,’a’,’r’ , 5), (β€˜c’,’a’,’b’,’l’,’e’, 3), (β€˜t’,’a’,’b’,’l’,’e’,’t’, 1), (β€˜w’,’a’,’t’,’c’,’h’, 2), (β€˜c’,’h’,’a’,’i’,’r’, 5), (β€˜m’,’o’,’u’,’s’,’e’, 1)]

The BPE algorithm then counts the occurrence of every symbol pair and picks the one with the highest frequency.Β In the above example, the pair β€œca” occurs 5 times in car and 3 times in cable, making a total of 8 occurrences, the highest of all pairs. It is followed by 7 occurrences of ch (2 from watch and 5 from chair) and so on.

[β€˜a’, β€˜b’, β€˜c’, β€˜e’, β€˜h’, β€˜i’, β€˜l’, β€˜m’, β€˜o’, β€˜r’, β€˜s’ ,’t’, β€˜u’, β€˜w’,Β β€˜ca’]Β size=15

And the tokenized words become

[(β€˜ca’,’r’ , 5), (β€˜ca’,’b’,’l’,’e’, 3), (β€˜t’,’a’,’b’,’l’,’e’,’t’, 1), (β€˜w’,’a’,’t’,’c’,’h’, 2), (β€˜c’,’h’,’a’,’i’,’r’, 5), (β€˜m’,’o’,’u’,’s’,’e’, 1)]

Next highest occurrence is of β€œch”, which is added to the vocabulary and all paired occurrences of c and h are merged together.

Vocab:Β [β€˜a’, β€˜b’, β€˜c’, β€˜e’, β€˜h’, β€˜i’, β€˜l’, β€˜m’, β€˜o’, β€˜r’, β€˜s’ ,’t’, β€˜u’, β€˜w’,Β β€˜ca’, β€˜ch’]Β size=16

Tokenized input:Β [(β€˜ca’,’r’ , 5), (β€˜ca’,’b’,’l’,’e’, 3), (β€˜t’,’a’,’b’,’l’,’e’,’t’, 1), (β€˜w’,’a’,’t’,’ch’, 2), (β€˜ch’,’a’,’i’,’r’, 5), (β€˜m’,’o’,’u’,’s’,’e’, 1)]

Since target vocab size = 17, BPE will choose the next most frequent pair β€˜ca’ and β€˜r’ which occurs 5 times. They will be merged andΒ β€˜car’ will be added to the vocabulary

Final vocab:Β [β€˜a’, β€˜b’, β€˜c’, β€˜e’, β€˜h’, β€˜i’, β€˜l’, β€˜m’, β€˜o’, β€˜r’, β€˜s’ ,’t’, β€˜u’, β€˜w’, β€˜ca’, β€˜ch’,’car’]

Final Tokenized Input:Β [(β€˜car’ , 5), (β€˜ca’,’b’,’l’,’e’, 3), (β€˜t’,’a’,’b’,’l’,’e’,’t’, 1), (β€˜w’,’a’,’t’,’ch’, 2), (β€˜ch’,’a’,’i’,’r’, 5), (β€˜m’,’o’,’u’,’s’,’e’, 1)]

Now that BPE has been trained, the same tokenization merges will be applied to new words. Say, we get a new word β€œcab”, it will get tokenized intoΒ [β€œca”, β€œb”]. However, if the new word is β€œcard”, it will get split intoΒ [β€œcar”, β€œ[UNK]”]Β since the letter d is not in the vocabulary. Practically, this never happens because all characters occur in the corpus at least once. However, UNK (unknown) token may be encountered if a symbol like punctuation or number was not added in the vocabulary but is a part of a new word.


Refs

  1. https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c