Steps to clean data

  • Read data file
  • Convert text into word list
  • Remove punctuations
  • Sentence Tokenization
  • Word Tokenization
  • Removing Stop Words
  • Stemming
In [1]:
import pathlib
import string
import nltk

Data Source: http://www.gutenberg.org/ebooks/1661 (Plain Text UTF-8 version)

In [2]:
file_path = pathlib.Path('./data/sherlock_holmes/the_adventures_of_sherlock_holmes.txt')

with open(file_path, 'r') as f:
    text = f.read()       
    f.close()
    
words = text[1:].split()

print(words[:10])
['Project', "Gutenberg's", 'The', 'Adventures', 'of', 'Sherlock', 'Holmes,', 'by', 'Arthur', 'Conan']

Remove punctuations

In [3]:
punctuation_table = str.maketrans('', '', string.punctuation)

words = [word.translate(punctuation_table).lower() for word in words]

print(words[:10])
['project', 'gutenbergs', 'the', 'adventures', 'of', 'sherlock', 'holmes', 'by', 'arthur', 'conan']

Sentence Tokenization

In [4]:
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[100])
It is peculiarly
strong and stiff."

Word tokenization

In [5]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text[1:])
words = [token.lower() for token in tokens if token.isalpha()]
print(f'Number of words: {len(words)}')
print(words[:10])
Number of words: 105766
['project', 'gutenberg', 'the', 'adventures', 'of', 'sherlock', 'holmes', 'by', 'arthur', 'conan']

Remove Stop Words

In [6]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
words = [word for word in words if word not in stop_words]

print(f'Number of words after removing Stop Words: {len(words)}')
print(words[:10])
Number of words after removing Stop Words: 46660
['project', 'gutenberg', 'adventures', 'sherlock', 'holmes', 'arthur', 'conan', 'doyle', 'ebook', 'use']

Stemming

In [7]:
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]
print(stemmed[:10])
['project', 'gutenberg', 'adventur', 'sherlock', 'holm', 'arthur', 'conan', 'doyl', 'ebook', 'use']

Additional possible cleaning Steps for other data

  • Handling large documents and large collections of text documents that do not fit into memory.
  • Extracting text from markup like HTML, PDF, or other structured document formats.
  • Transliteration of characters from other languages into English.
  • Decoding Unicode characters into a normalized form, such as UTF8.
  • Handling of domain specific words, phrases, and acronyms.
  • Handling or removing numbers, such as dates and amounts.
  • Locating and correcting common typos and misspellings.

Tomas Mikolov (one of the developers of word2vec) on Text Cleaning for Word Embeddings

There is no universal answer. It all depends on what you plan to use the vectors for. In my experience, it is usually good to disconnect (or remove) punctuation from words, and sometimes also convert all characters to lowercase. One can also replace all numbers (possibly greater than some constant) with some single token such as .

All these pre-processing steps aim to reduce the vocabulary size without removing any important content (which in some cases may not be true when you lowercase certain words, ie. ‘Bush’ is different than ‘bush’, while ‘Another’ has usually the same sense as ‘another’). The smaller the vocabulary is, the lower is the memory complexity, and the more robustly are the parameters for the words estimated. You also have to pre-process the test data in the same way.

In short, you will understand all this much better if you will run experiments.