Text Normalization - In text normalization, we undergo several steps to normalize the text to a lower level.
Sentence Segmentation - Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as a different data so now the whole corpus gets reduced to sentences.
Tokenisation- After segmenting the sentences, each sentence is then further divided into tokens.
Token is a term used for any word or number or special character occurring in a sentence.
Under tokenisation, every word, number, and special character is considered separately and each of them is now a separate token.
Removing Stop words, Special Characters, and Numbers - In this step, the tokens which are not necessary are removed from the token list.
Converting text to a common case - After the stop words removal, we convert the whole text into a similar case, preferably lower case.
This ensures that the case sensitivity of the machine does not consider the same words as different just because of different cases.
Stemming - In this step, the remaining words are reduced to their root words. In other words, stemming is the process in which the affixes of words are removed and the words are converted to their base form.
Lemmatization - In lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one.
With this, we have normalized our text to tokens which are the simplest form of words present in the corpus.
Now it is time to convert the tokens into numbers. For this, we would use the Bag of Words algorithm.
Study more about Natural Language Processing at Natural Language Processing Class 10