Steps to clean data

Read data file
Convert text into word list
Remove punctuations
Sentence Tokenization
Word Tokenization
Removing Stop Words
Stemming

In [1]:

import pathlib
import string
import nltk

Data Source: http://www.gutenberg.org/ebooks/1661 (Plain Text UTF-8 version)

In [2]:

file_path = pathlib.Path('./data/sherlock_holmes/the_adventures_of_sherlock_holmes.txt')

with open(file_path, 'r') as f:
    text = f.read()       
    f.close()
    
words = text[1:].split()

print(words[:10])

['Project', "Gutenberg's", 'The', 'Adventures', 'of', 'Sherlock', 'Holmes,', 'by', 'Arthur', 'Conan']

Remove punctuations¶

In [3]:

punctuation_table = str.maketrans('', '', string.punctuation)

words = [word.translate(punctuation_table).lower() for word in words]

print(words[:10])

['project', 'gutenbergs', 'the', 'adventures', 'of', 'sherlock', 'holmes', 'by', 'arthur', 'conan']

Sentence Tokenization¶

In [4]:

from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences[100])

It is peculiarly
strong and stiff."

Word tokenization¶

In [5]:

from nltk.tokenize import word_tokenize
tokens = word_tokenize(text[1:])
words = [token.lower() for token in tokens if token.isalpha()]
print(f'Number of words: {len(words)}')
print(words[:10])

Number of words: 105766
['project', 'gutenberg', 'the', 'adventures', 'of', 'sherlock', 'holmes', 'by', 'arthur', 'conan']

Remove Stop Words¶

In [6]:

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
words = [word for word in words if word not in stop_words]

print(f'Number of words after removing Stop Words: {len(words)}')
print(words[:10])

Number of words after removing Stop Words: 46660
['project', 'gutenberg', 'adventures', 'sherlock', 'holmes', 'arthur', 'conan', 'doyle', 'ebook', 'use']

Stemming¶

In [7]:

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
stemmed = [porter.stem(word) for word in words]
print(stemmed[:10])

['project', 'gutenberg', 'adventur', 'sherlock', 'holm', 'arthur', 'conan', 'doyl', 'ebook', 'use']

Additional possible cleaning Steps for other data¶

Handling large documents and large collections of text documents that do not fit into memory.
Extracting text from markup like HTML, PDF, or other structured document formats.
Transliteration of characters from other languages into English.
Decoding Unicode characters into a normalized form, such as UTF8.
Handling of domain specific words, phrases, and acronyms.
Handling or removing numbers, such as dates and amounts.
Locating and correcting common typos and misspellings.

Tomas Mikolov (one of the developers of word2vec) on Text Cleaning for Word Embeddings¶

There is no universal answer. It all depends on what you plan to use the vectors for. In my experience, it is usually good to disconnect (or remove) punctuation from words, and sometimes also convert all characters to lowercase. One can also replace all numbers (possibly greater than some constant) with some single token such as .

All these pre-processing steps aim to reduce the vocabulary size without removing any important content (which in some cases may not be true when you lowercase certain words, ie. ‘Bush’ is different than ‘bush’, while ‘Another’ has usually the same sense as ‘another’). The smaller the vocabulary is, the lower is the memory complexity, and the more robustly are the parameters for the words estimated. You also have to pre-process the test data in the same way.

…

In short, you will understand all this much better if you will run experiments.