Photo by Gia Oris on Unsplash

In this article, we’ll try to understand some basic concepts related to Natural Language Processing (NLP). Here we’ll try to keep the focus on the theoretical part more than programming practices.

Introduction

But why should we pre-process the text? It is because computers are best at understanding numerical data. So, we’ll convert strings into numerical form and then pass these numerical data into models to make them work.

We’ll learn techniques like Tokenization, Normalization, Stemming, Lemmatization, Corpus, Stopwords, Part of speech, a bag of words, n-grams, and word embedding. These techniques are enough to make a computer understand the data with text.

Tokenization

It is the process of converting the long strings of text into smaller pieces, i.e. tokens, hence it’s called tokenization. Suppose we have a string like, “Tokenize this sentence for the testing purposes.”. Then this looks like, {“Tokenize”, “this”, “sentence”, “for”, “the”, “testing”, “purpose”, “.”} after tokenization is processed.

This is word tokenizations, similarly, we can perform characterized tokenization.

Normalization

It is the process of generalizing all words like converting to the same case, removing punctuations, expanding contractions, or converting words to their equivalent. So after normalization, our sentence would look like, {“tokenize”, “this”, “sentence”, “for”, “the”, “testing”, “purpose”}.

Stemming

Stemming is the process of removing affixes (suffixes, prefixes, infixes, circumfixes) from a word. For example, running will be converted to run. So after stemming, our sentence would look like, {“tokenize”, “this”, “sentence”, “for”, “the”, “test”, “purpose”}.

Lemmatization

It is the process of capturing canonical forms based on a word’s lemma. In simple terms, for uniformity in the corpus, we use simple form of words. For example, better will be converted to good.

Corpus

Corpus or body in Latin is the collection of text. It refers to the collection generated from our text data. You might see corpora in some places which is the plural form of the corpus. It is the dictionary for our NLP models. Computers work with numbers instead of strings, so all these strings are represented in numerical forms as follows: {“tokenize”:1, “this”:2, “sentence”:3, “for”:4, “the”:5, “test”:6, “purpose”:7}

Stop words

There are some words in a sentence that have no part of it in the context or meaning of the sentence. These words are stop words. Before passing data as input we remove them from the corpus. Stop words include words like, “the”, “a”, “and”. Basically, the words which tend to occur frequently.

Part Of Speech (POS)

POS tagging consists of assigning a category tag to the tokenized parts of the sentence, such that all the words fall under one of these categories: nouns, verbs, adjectives, etc. This helps in understanding the role of the word in a sentence.

Bag of Words

It is a representation of sentences such that a machine learning model can understand. Here, the main focus is on the occurrences of words instead of the sequence of that word. So the generated dictionary for our sentence looks like, {“tokenize”:1, “this”:1, “sentence”:1, “for”:1, “test”:1, “purpose”:1}. There are many limitations to this algorithm.

It fails to convey the meaning of a sentence. As it only focuses on the number of occurrences, highly occurring words dominates the model. There are other algorithms that solve these limitations.

n-grams

Instead of storing the number of occurrences, we can focus on getting a sequence of N items at the time of text selection. It is much more useful for storing the context of sentences. Here N could be any number of consecutive words. For example, trigrams contain 3 consecutive words:

{“tokenize this sentence”, “this sentence for”, “sentence for test”, “for test purpose”}

Even for humans, this seems to be more appropriate as it contains information regarding the sequence of occurrences.

tf-idf vectorizer

tf-idf stands for term frequency-inverse document frequency. In this vectorizer each first we count the number of occurrences of a word in sentences and divide that number with the number of occurrences of that word in the entire document. It could be represented as a term frequency/document frequency.

This vectorizer works perfectly without removing stop words too because it gives low importance to higher occurrences of words. NLP vectorizing of text most commonly uses tf-idf vectorizer.

Conclusion

Here, we went through most of the terms used for Natural Language Processing in layman’s terms. You can try working with these concepts from python libraries like NLTK and Spacy.

Read this article if you want to learn to create a Neural Network with NLP.