Tokenisation is the task of splitting the text into tokens which would then be converted to numbers. These numbers are then used by machine learning models for further processing.

Splitting tokens is not as trivial as it sounds


Simplest Tokenisation

  • The simplest way to split tokens is based on space.
  • For example,
    Let’s go to the beach today!”` 
    ->  `[“Let’s”, “go”, “to”, “the”, “beach”, “today!”]
    
  • Notice the words “Let’s” and “today!”. If we don’t pay attention to punctuations and simply split on spaces, our vocabulary will explode. In this case, we haven’t converted the text to lowercase too. So every other word will have it’s opposite case variant and words with every possible punctuation will be part of the vocabulary.

Rule based Tokenisation

  • A better way to tokenise is to have more rules
  • For example if we split based on space and punctuation we can get,
    [“Let”, “‘“, “s”, “go”, “to”, “the”, “beach”, “today”, “!”]
    
  • While this is decent, word tokenisation can have an exploding vocabulary problem

Rule based Tokenisation with OOV


Character Tokenisation


Subword Tokenisation


Refs

  1. https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c