Byte Pair Encoding

BPE was introduced by Senrich in the paper Neural Machine translation for rare words with subword units

The first step in BPE is to split all the strings into words using any tokenizer. After word tokenization, let’s assume we have the following words with their frequencies as given below:

[(“car”, 5), (“cable”, 3), (“tablet”, 1), (“watch”, 2), (“chair”, 5), (“mouse”, 1)]

The desired vocabulary size is a hyperparameter for BPE. In the example, let’s assume we want a total of 17 tokens in the vocabulary. All the unique characters and symbols in the words are included as base vocabulary. In this the base vocabulary would be

[‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’] size=14

Next, all the words are split into the base vocabulary characters, which can be represented as follows:

[(‘c’,’a’,’r’ , 5), (‘c’,’a’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’c’,’h’, 2), (‘c’,’h’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

The BPE algorithm then counts the occurrence of every symbol pair and picks the one with the highest frequency. In the above example, the pair “ca” occurs 5 times in car and 3 times in cable, making a total of 8 occurrences, the highest of all pairs. It is followed by 7 occurrences of ch (2 from watch and 5 from chair) and so on.

[‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’, ‘ca’] size=15

And the tokenized words become

[(‘ca’,’r’ , 5), (‘ca’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’c’,’h’, 2), (‘c’,’h’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

Next highest occurrence is of “ch”, which is added to the vocabulary and all paired occurrences of c and h are merged together.

Vocab: [‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’, ‘ca’, ‘ch’] size=16

Tokenized input: [(‘ca’,’r’ , 5), (‘ca’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’ch’, 2), (‘ch’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

Since target vocab size = 17, BPE will choose the next most frequent pair ‘ca’ and ‘r’ which occurs 5 times. They will be merged and ‘car’ will be added to the vocabulary

Final vocab: [‘a’, ‘b’, ‘c’, ‘e’, ‘h’, ‘i’, ‘l’, ‘m’, ‘o’, ‘r’, ‘s’ ,’t’, ‘u’, ‘w’, ‘ca’, ‘ch’,’car’]

Final Tokenized Input: [(‘car’ , 5), (‘ca’,’b’,’l’,’e’, 3), (‘t’,’a’,’b’,’l’,’e’,’t’, 1), (‘w’,’a’,’t’,’ch’, 2), (‘ch’,’a’,’i’,’r’, 5), (‘m’,’o’,’u’,’s’,’e’, 1)]

Now that BPE has been trained, the same tokenization merges will be applied to new words. Say, we get a new word “cab”, it will get tokenized into [“ca”, “b”]. However, if the new word is “card”, it will get split into [“car”, “[UNK]”] since the letter d is not in the vocabulary. Practically, this never happens because all characters occur in the corpus at least once. However, UNK (unknown) token may be encountered if a symbol like punctuation or number was not added in the vocabulary but is a part of a new word.

Refs

https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c

🪴 Nishanth Gobi

Byte Pair Encoding

Refs §

Refs