In this article, we’ll try to understand some basic concepts related to Natural Language Processing (NLP). I will be focusing on the theoretical aspects over programming practices.
Why should one pre-process text, anyway? It is because computers are best at understanding numerical data. So, we convert strings into numerical form and then pass this numerical data into models to make them work.
We’ll be looking into techniques like Tokenization, Normalization, Stemming, Lemmatization, Corpus, Stopwords, Part of speech, a bag of words, n-grams, and word embedding. These techniques are enough to make a computer understand data with the text.
It is the process of converting long strings of text into smaller pieces or tokens, hence the name- Tokenization.
Suppose we have a string like, “Tokenize this sentence for the testing purposes.”
In this case, after tokenization is processed the sentence would look like, {“Tokenize”, “this”, “sentence”, “for”, “the”, “testing”, “purpose”, “.”}
This would be an example of- word tokenization, we can perform characterized tokenization similarly.
It is the process of generalizing all words by converting these into the same case, removing punctuations, expanding contractions, or converting words to their equivalents.
Normalization would get rid of punctuations and case-sensitivity in the aforementioned example and our sentence would then look like this, {“tokenize”, “this”, “sentence”, “for”, “the”, “testing”, “purpose”}.
Stemming is the process of removing affixes (suffixes, prefixes, infixes, circumfixes) from a word. For example, running will be converted to run. So after Stemming, our sentence would look like, {“tokenize”, “this”, “sentence”, “for”, “the”, “test”, “purpose”}.
It is the process of capturing canonical forms based on a word’s lemma. In simple terms, for uniformity in the corpus, we use simple forms of words. For example, the word better will be converted to good.
Corpus or body in Latin, is the collection of text. It refers to the collection generated from our text data. You might see corpora in some places which is the plural form of the corpus. It is the dictionary for our NLP models. Computers work with numbers instead of strings, so all these strings are represented in numerical forms as follows: {“tokenize”:1, “this”:2, “sentence”:3, “for”:4, “the”:5, “test”:6, “purpose”:7}
There are some words in a sentence that play no part in the context or meaning of the sentence. These words are called Stop words. Before passing data as input we remove them from the corpus. Stop words include words like, “the”, “a”, “and”. These words tend to occur frequently in a sentence structure.
POS tagging consists of assigning a category tag to the tokenized parts of the sentence, such that all the words fall under one of these categories: nouns, verbs, adjectives, etc. This helps in understanding the role of a word in the sentence.
It is a representation of sentences such that a machine learning model can understand. Here, the main focus is on the occurrences of words as opposed to the sequence of that word. So the generated dictionary for our sentence looks like this: {“tokenize”:1, “this”:1, “sentence”:1, “for”:1, “test”:1, “purpose”:1}. There are many limitations to this algorithm.
It fails to convey the meaning of a sentence. As it only focuses on the number of occurrences, words with high occurrences dominate the model. We have to then rely on other algorithms to solve these limitations.
Instead of storing the number of occurrences, we can focus on getting a sequence of N items at the time of text selection. It is much more useful for storing the context of sentences. Here N could be any number of consecutive words. For example, trigrams contain 3 consecutive words:
{“tokenize this sentence”, “this sentence for”, “sentence for test”, “for test purpose”}
Even for humans, this seems more appropriate as it conveys information regarding the sequence of occurrences.
tf-idf stands for term frequency-inverse document frequency. In this vectorizer, for every first occurrence of a word, we count the number of occurrences in sentences and divide that number with the number of occurrences of the word in the entire document. It could be represented as a term frequency/document frequency.
This vectorizer works perfectly without removing stop words too as it gives low importance to words with higher occurrences. NLP vectorizing of text most commonly uses tf-idf vectorizer.
Here, we went through most of the terms used for Natural Language Processing in layman’s terms. You can try working with these concepts from python libraries like NLTK and Spacy.
Also, please check out our blog on learning RNN, GRU, and LSTM with examples of sentiment analysis.