Things to look out for when tokenizing text

  1. Small words
  2. Hyphened and non-hyphened words
  3. Special characters
  4. Capitalized words
  5. Numbers
  6. Periods