Tokenisation is the task of splitting the text into tokens which would then be converted to numbers. These numbers are then used by machine learning models for further processing.

Splitting tokens is not as trivial as it sounds


Simplest Tokenisation

  • The simplest way to split tokens is based on space.
  • For example,
    Let’s go to the beach today!”` 
    ->  `[“Let’s”, “go”, “to”, “the”, “beach”, “today!”]
    
  • Notice the words “Let’s” and “today!”. If we don’t pay attention to punctuations and simply split on spaces, our vocabulary will explode. In this case, we haven’t converted the text to lowercase too. So every other word will have it’s opposite case variant and words with every possible punctuation will be part of the vocabulary.

Rule based Tokenisation

  • A better way to tokenise is to have more rules
  • For example if we split based on space and punctuation we can get,
    [“Let”, “‘“, “s”, “go”, “to”, “the”, “beach”, “today”, “!”]
    
  • While this is decent, rule-based tokenisation can have an exploding vocabulary problem Why?

Rule based Tokenisation with OOV

  • Keep the most frequently occurring words in the vocab and everything else would become out-of-vocabulary (OOV)
  • And when a new word is encountered at prediction time, it is either ignored or is assigned the out of vocabulary token
  • Although this seems like a good workaround, it fails to get any value from the OOV token. If two very different words like “bank” and “bake” are both OOV, they will get the same id, no matter how different they mean

Character Tokenisation

Character tokens solve the OOV problem but representing the input as a sequence of characters increases the sequence length which makes it challenging to learn relationships between characters to form meaningful words.


Subword Tokenisation

Frequently occurring words should be in the vocabulary, whereas rare words should be split into frequent sub words.

Eg. The word “refactoring” can be split into “re”, “factor”, and “ing”. Subwords “re”, “factor” and “ing” occur more frequently than the word refactoring, and its overall meaning is also kept intact.


Refs

  1. https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c