Tokenization

Tokenisation is the task of splitting the text into tokens which would then be converted to numbers. These numbers are then used by machine learning models for further processing.

Splitting tokens is not as trivial as it sounds

Simplest Tokenisation

The simplest way to split tokens is based on space.

For example,

Let’s go to the beach today!”` 
->  `[“Let’s”, “go”, “to”, “the”, “beach”, “today!”]

Notice the words “Let’s” and “today!”. If we don’t pay attention to punctuations and simply split on spaces, our vocabulary will explode. In this case, we haven’t converted the text to lowercase too. So every other word will have it’s opposite case variant and words with every possible punctuation will be part of the vocabulary.

Rule based Tokenisation

A better way to tokenise is to have more rules

For example if we split based on space and punctuation we can get,

[“Let”, “‘“, “s”, “go”, “to”, “the”, “beach”, “today”, “!”]

While this is decent, word tokenisation can have an exploding vocabulary problem

Rule based Tokenisation with OOV

Character Tokenisation

Subword Tokenisation

Refs

https://towardsdatascience.com/a-comprehensive-guide-to-subword-tokenisers-4bbd3bad9a7c

🪴 Nishanth Gobi

Tokenization

Simplest Tokenisation

Rule based Tokenisation

Rule based Tokenisation with OOV

Character Tokenisation

Subword Tokenisation

Refs

Graph View

Table of Contents

🪴 Nishanth Gobi

Tokenization

Simplest Tokenisation §

Rule based Tokenisation §

Rule based Tokenisation with OOV §

Character Tokenisation §

Subword Tokenisation §

Refs §

Graph View

Table of Contents

Simplest Tokenisation

Rule based Tokenisation

Rule based Tokenisation with OOV

Character Tokenisation

Subword Tokenisation

Refs