Natural Language Processing unit 7 Class 10 Aritificial Intelligence CBSE conveys the connction between human langauges and machine processing.

Here we discuss about Applications of NLP, Chatbots, Text Classificaiton, NLP processes such as Text Normalization and Bag of words algorithm


Introduction to NLP

Till now we have covered two domains of AI that is Data Science & Computer vision.

  1. Data Science: It is all about applying mathematical and statistical principles to data or in simple words, Data Science is the study of Data, This data can be of 3 types – AudioVisual and Textual.
    • Data Science works around numbers and tabular data.
  2. Computer Vision: In simple words is identifying the symbols from the given object (pictures) and learning the pattern and alert or predicting the future object using the camera.
    • Computer Vision is all about visual data like images and videos.
Domains of AI Class 10

Natural Language Process (NLP)

Natural Language Processing (NLP) is the sub-field of AI that focuses on the ability of a computer to understand human language (command) as spoken or written and to give an output by processing it, is called Natural Language Processing (NLP). It is a component of Artificial Intelligence.

Applications of Natural Language Processing

Some of the applications of Natural Language Processing that are used in the real-life scenario:

Automatic Summarization

  1. Automatic summarization is relevant not only for summarizing the meaning of documents and information but also to understand the emotional meanings within the information (such as in collecting data from social media)

Sentiment Analysis

  1. Definition: Identify sentiment among several posts or even in the same post where emotion is not always explicitly expressed.
  2. Companies use it to identify opinions and sentiments to understand what customers think about their products and services.

Text classification

  1. Text classification makes it possible to assign predefined categories to a document and organize it to help you find the information you need or simplify some activities.
  2. For example, an application of text categorization is spam filtering in email.

Virtual Assistants

  1. Nowadays Google Assistant, Cortana, Siri, Alexa, etc have become an integral part of our lives. Not only can we talk to them but they also have the ability to make our lives easier.
  2. By accessing our data, they can help us in keeping notes of our tasks, making calls for us, sending messages, and a lot more.
  3. With the help of speech recognition, these assistants can not only detect our speech but can also make sense of it.
  4. According to recent research, a lot more advancements are expected in this field in the near future

Natural Language Processing: Getting Started

Natural Language Processing is all about how machines try to understand and interpret human language and operate accordingly. But how can Natural Language Processing be used to solve the problems around us?

Revisiting the AI Project Cycle

Related ArticleAI Project Cycle

ChatBots

One of the most common applications of Natural Language Processing is a chatbot. Let us try some of the chatbots and see how they work.

mitsuku chatbot
  • Mitsuku Bot*
  • https://www.pandorabots.com/mitsuku/
Cleverbot

Clever Bot*
https://www.cleverbot.com/

  • Jabberwacky*
  • http://www.jabberwacky.com/
  • Haptik*
  • https://haptik.ai/contact-us

Image and Brand Rights belong to the respective owners*

Types of ChatBots

With the help of this experience, we can understand that there are 2 types of chatbots around us: Script-bot and Smart-bot. Let us understand what each of

Script Bot

  1. Script bots are easy to make
  2. Script bots work around a script that is programmed in them
  3. Mostly they are free and are easy to integrate into a messaging platform
  4. No or little language processing skills
  5. Limited functionality
  6. Example: the bots which are deployed in the customer care section of various companies

Smart Bot

  1. Smart bots are flexible and powerful
  2. Smart bots work on bigger databases and other resources directly
  3. Smart bots learn with more data
  4. Coding is required to take this up on board
  5. Wide functionality
  6. Example: Google Assistant, Alexa, Cortana, Siri, etc.

Human Language VS Computer Language

Human Language

  1. Our brain keeps on processing the sounds that it hears around itself and tries to make sense of them all the time.
    • Example: In the classroom, as the teacher delivers the session, our brain is continuously processing everything and storing it someplace. Also, while this is happening, when your friend whispers something, the focus of your brain automatically shifts from the teacher’s speech to your friend’s conversation.
    • So now, the brain is processing both the sounds but is prioritizing the one on which our interest lies.
  2. The sound reaches the brain through a long channel. As a person speaks, the sound travels from his mouth and goes to the listener’s eardrum. The sound striking the eardrum is converted into neuron impulses, gets transported to the brain, and then gets processed.
  3. After processing the signal, the brain gains an understanding of its meaning of it. If it is clear, the signal gets stored. Otherwise, the listener asks for clarity from the speaker. This is how human languages are processed by humans.

Computer Language

  1. Computers understand the language of numbers. Everything that is sent to the machine has to be converted to numbers.
  2. While typing, if a single mistake is made, the computer throws an error and does not process that part. The communications made by the machines are very basic and simple.
  3. Now, if we want the machine to understand our language, how should this happen? What are the possible difficulties a machine would face in processing natural language? Let us take a look at some of them here:

Arrangement of the words and meaning

There are rules in human language. There are nouns, verbs, adverbs, and adjectives. A word can be a noun at one time and an adjective some other time. There are rules to provide structure to a language.

Syntax: Syntax refers to the grammatical structure of a sentence.

Besides the matter of arrangement, there’s also meaning behind the language we use. Human communication is complex. There are multiple characteristics of the human language that might be easy for a human to understand but extremely difficult for a computer to understand.

Analog with a programming language:

Semantics: It refers to the meaning of the sentence.

Let’s understand Semantics and Syntax with some examples:

  1. Different syntax, same semantics: 2+3 = 3+2
    • Here the way these statements are written is different, but their meanings are the same that is 5.
  2. Different semantics, same syntax: 2/3 (Python 2.7) ≠ 2/3 (Python 3)
    • Here the statements written have the same syntax but their meanings are different. In Python 2.7, this statement would result in 1 while in Python 3, it would give an output of 1.5.

Multiple Meanings of a word

To understand let us have an example of the following three sentences:

  1. His face turned red after he found out that he had taken the wrong bag
    • What does this mean? Is he feeling ashamed because he took another person’s bag instead of his? Is he feeling angry because he did not manage to steal the bag that he has been targeting?
  2. The red car zoomed past his nose
    • Probably talking about the color of the car, that traveled close to him in a flash.
  3. His face turns red after consuming the medicine
    • Is he having an allergic reaction? Or is he not able to bear the taste of that medicine?

Here we can see that context is important. We understand a sentence almost intuitively, depending on our history of using the language, and the memories that have been built within.

In all three sentences, the word red has been used in three different ways which according to the context of the statement changes its meaning completely. Thus, in natural language, it is important to understand that a word can have multiple meanings and the meanings fit into the statement according to its context of it.

Perfect Syntax, no Meaning

Sometimes, when the statement is perfectly correct syntax but there is no meaning to it.

Example: Chickens feed extravagantly while the moon drinks tea

This statement is correct grammatically but does this make any sense? In Human language, a perfect
balance of syntax and semantics is important for better understanding.

Data Processing

As we have already gone through some of the complications in human languages above, now it is time to see how Natural Language Processing makes it possible for machines to understand and speak Natural Languages just like humans.

Since we all know that the language of computers is Numerical, the very first step that comes to our mind is to convert our language to numbers. This conversion takes a few steps to happen. The first step to it is Text Normalisation.

Text Normalisation helps in cleaning up the textual data in such a way that it comes down to a level where its complexity is lower than the actual data.

1. Text Normalisation

In-Text Normalization, we undergo several steps to normalize the text to a lower level. That is, we will be working on text from multiple documents and the term used for the whole textual data from all the documents altogether is known as corpus.

2. Sentence Segmentation

Under sentence segmentation, the whole corpus is divided into sentences. Each sentence is taken as a different data so now the whole corpus gets reduced to sentences.

Example:

Before Sentence Segmentation

“You want to see the dreams with close eyes and achieve them? They’ll remain dreams, look for AIMs and your eyes have to stay open for a change to be seen.”

After Sentence Segmentation

  1. You want to see the dreams with close eyes and achieve them?
  2. They’ll remain dreams, look for AIMs and your eyes have to stay open for a change to be seen.

3. Tokenisation

After segmenting the sentences, each sentence is then further divided into tokens. A “Token” is a term used for any word or number or special character occurring in a sentence.

Under Tokenisation, every word, number, and special character is considered separately and each of them is now a separate token.

Corpus: A corpus is a large and structured set of machine-readable texts that have been produced in a natural communicative setting.

OR

A corpus can be defined as a collection of text documents. It can be thought of as just a bunch of text files in a directory, often alongside many other directories of text files.

Example:

1. You want to see the dreams with close eyes and achieve them?You want tothe dreamswithcloseeyesand achievethem?

4. Removal of Stopwords

In this step, the tokens which are not necessary are removed from the token list. To make it easier for the computer to focus on meaningful terms, these words are removed.

Along with these words, a lot of times our corpus might have special characters and/or numbers.

Removal of special characters and/or numbers depends on the type of corpus that we are working on and whether we should keep them in it or not.

For example: if you are working on a document containing email IDs, then you might not want to remove the special characters and numbers whereas in some other textual data if these characters do not make sense, then you can remove them along with the stopwords.

Stopwords: Stopwords are the words that occur very frequently in the corpus but do not add any value to it.

Examples: a, an, and, are, as, for, it, is, into, in, if, on, or, such, the, there, to.

Example

  1. You want to see the dreams with close eyes and achieve them?
    • the removed words would be
    • to, the, and, ?
  2. The outcome would be:
    • You want see dreams with close eyes achieve them

5. Converting text to a common case

As the name suggests, we e convert the whole text into a similar case, preferably lower case. This ensures that the case sensitivity of the machine does not consider the same words as different just because of different cases.

Converting Text to common case NLP AI

6. Stemming

Definition: Stemming is a technique used to extract the base form of the words by removing affixes from them. It is just like cutting down the branches of a tree to its stems.

The stemmed words (words which we get after removing the affixes) might not be meaningful.

Example:

WordsAffixesStem
healing ingheal
dreamssdream
studiesesstudi

7. Lemmatization

Definition: In lemmatization, the word we get after affix removal (also known as lemma) is a meaningful one and it takes a longer time to execute than stemming.

Lemmatization makes sure that a lemma is a word with meaning

Example:

WordsAffixeslemma
healing ingheal
dreamssdream
studiesesstudy

Difference between stemming and lemmatization

Stemming
  1. The stemmed words might not be meaningful.
  2. Caring ➔ Car
lemmatization
  1. The lemma word is a meaningful one.
  2. Caring ➔ Care

Bag of word Algorithm

Bag of Words is a Natural Language Processing model which helps in extracting features out of the text which can be helpful in machine learning algorithms. In a bag of words, we get the occurrences of each word and construct the vocabulary for the corpus.

Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews). Bag of Words vectors is easy to interpret.

The text on the left in this image is the normalized corpus that we have got after going through all the steps of text processing. Now, as we put this text into the bag of words algorithm, the algorithm returns to us the unique words out of the corpus and their occurrences in it. As you can see at the right, it shows us a list of words appearing in the corpus, and the numbers corresponding to it show how many times the word has occurred in the text body.

The bag of words gives us two things:

  • A vocabulary of words for the corpus
  • The frequency of these words (number of times it has occurred in the whole corpus).

Here calling this algorithm a “bag” of words symbolizes that the sequence of sentences or tokens does not matter in this case as all we need are the unique words and their frequency in it.

Steps of the bag of words algorithm

  1. Text Normalisation: Collecting data and pre-processing it
  2. Create Dictionary: Making a list of all the unique words occurring in the corpus. (Vocabulary)
  3. Create document vectors: For each document in the corpus, find out how many times the word from the unique list of words has occurred.
  4. Create document vectors for all the documents.

Example:

Step 1: Collecting data and pre-processing it.

Raw Data

  • Document 1: Aman and Anil are stressed
  • Document 2: Aman went to a therapist
  • Document 3: Anil went to download a health chatbot

Processed Data

  • Document 1: [aman, and, anil, are, stressed ]
  • Document 2: [aman, went, to, a, therapist]
  • Document 3: [anil, went, to, download, a, health, chatbot]

Note that no tokens have been removed in the stopwords removal step. It is because we have very little data and since the frequency of all the words is almost the same, no word can be said to have lesser value than the other.

Step 2: Create Dictionary

Definition of Dictionary:

Dictionary in NLP means a list of all the unique words occurring in the corpus. If some words are repeated in different documents, they are all written just once while creating the dictionary. (Source: CBSE)

Dictionary:

amanandanilarestressedwent
downloadhealthchatbottherapistato

Some words are repeated in different documents, they are all written just once, while creating the dictionary, we create a list of unique words.

Step 3: Create a document vector

Definition of Document Vector: The document Vector contains the frequency of each word of the vocabulary in a particular document.

How to make a document vector table?

In the document, vector vocabulary is written in the top row. Now, for each word in the document, if it matches the vocabulary, put a 1 under it. If the same word appears again, increment the previous value by 1. And if the word does not occur in that document, put a 0 under it.

amanandanilarestressedwenttoatherapistdownloadhealthchatbot
111110000000

step 4: Creating a document vector table for all documents

amanandanilare stressedwenttoatherapistdownloadhealthchatbot
111110000000
100001111000
001001110111

In this table, the header row contains the vocabulary of the corpus and three rows correspond to three different documents. Take a look at this table and analyze the positioning of 0s and 1s in it.

Finally, this gives us the document vector table for our corpus. But the tokens have still not converted to numbers. This leads us to the final steps of our algorithm: TFIDF.

TFIDF

TFIDF stands for Term Frequency & Inverse Document Frequency.

Term Frequency

Term Frequency: Term frequency is the frequency of a word in one document.

Term frequency can easily be found in the document vector table in that table we mention the frequency of each word of the vocabulary in each document.

Example of Term Frequency:

amanandanilare stressedwenttoatherapistdownloadhealthchatbot
111110000000
100001111000
001001110111

Here, as we can see that the frequency of each word for each document has been recorded in the table. These numbers are nothing but the Term Frequencies!

Inverse Document Frequency

To understand IDF (Inverse Document Frequency) we should understand DF (Document Frequency) first.

DF (Document Frequency)

Definition of Document Frequency (DF): Document Frequency is the number of documents in which the word occurs irrespective of how many times it has occurred in those documents. (Source: CBSE)

Example of Document Frequency:

amanandanilarestressedwenttoatherapistdownloadhealthchatbot
212112221111

We can observe from the table is:

  • Document frequency of ‘aman’, ‘anil’, ‘went’, ‘to’ and ‘a’ is 2 as they have occurred in two documents.
  • Rest of them occurred in just one document hence the document frequency for them is one.

IDF (Inverse Document Frequency)

Definition of Inverse Document Frequency (IDF): In the case of inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator.

Example of Inverse Document Frequency:

amanandanilarestressedwenttoatherapistdownloadhealthchatbot
3/23/13/23/13/13/23/23/23/13/13/13/1

Formula of TFIDF

The formula of TFIDF for any word W becomes:

TFIDF(W) = TF(W) * log( IDF(W) )

We don’t need to calculate the log values by ourselves. We simply have to use the log function in the calculator and find out!

Example of TFIDF:

amanandanilarestressedwenttoatherapistdownloadhealthchatbot
1*log(3/2)1*log(3)1*log(3/2)1*log(3)1*log(3)0*log(3/2)0*log(3/2)0*log(3/2)0*log(3)0*log(3)0*log(3)0*log(3)
1*log(3/2)0*log(3)0*log(3/2)0*log(3)0*log(3)1*log(3/2)1*log(3/2)1*log(3/2)1*log(3)0*log(3)0*log(3)0*log(3)
0*log(3/2)0*log(3)1*log(3/2)0*log(3)0*log(3)1*log(3/2)1*log(3/2)1*log(3/2)0*log(3)1*log(3)1*log(3)1*log(3)

Here, we can see that the IDF values for Aman in each row are the same and a similar pattern is followed for all the words in the vocabulary. After calculating all the values, we get:

amanandanilarestressedwenttoatherapistdownloadhealthchatbot
0.1760.4770.1760.4770.4770000000
0.17600000.1760.1760.1760.477000
000.176000.1760.1760.17600.4770.4770.477

Finally, the words have been converted to numbers. These numbers are the values of each document. Here, we can see that since we have less amount of data, words like ‘are’ and ‘and’ also have a high value. But as the IDF value increases, the value of that word decreases. That is, for example:

  • Total Number of documents: 10
  • Number of documents in which ‘and’ occurs: 10

Therefore, IDF(and) = 10/10 = 1

Which means: log(1) = 0. Hence, the value of ‘and’ becomes 0.

On the other hand, the number of documents in which ‘pollution’ occurs: 3

IDF(pollution) = 10/3 = 3.3333…

This means log(3.3333) = 0.522; which shows that the word ‘pollution’ has considerable value in the corpus.

Important concepts to remember:

  1. Words that occur in all the documents with high term frequencies have the least values and are considered to be the stopwords.
  2. For a word to have a high TFIDF value, the word needs to have a high term frequency but less document frequency which shows that the word is important for one document but is not a common word for all documents.
  3. These values help the computer understand which words are to be considered while processing the natural language. The higher the value, the more important the word is for a given corpus.

Applications of TFIDF

TFIDF is commonly used in the Natural Language Processing domain. Some of its applications are:

  1. Document Classification – Helps in classifying the type and genre of a document.
  2. Topic Modelling – It helps in predicting the topic for a corpus.
  3. Information Retrieval System – To extract the important information out of a corpus.
  4. Stop word filtering – Helps in removing the unnecessary words from a text body.

Related ArticleImportant Questions on NLP

Frequently Asked Questions (FAQ’s)