Through a step-by-step process, calculate TFIDF for the given corpus and mention the word(s) having the highest value
Document 1: We are going to Mumbai
Document 2: Mumbai is a famous place.
Document 3: We are going to a famous place.
Document 4: I am famous in Mumbai.
Topic | Natural Language Processing (AI Domain) |
Type | Long answer type |
Class | 10 |
Term Frequency:-
Term frequency is the frequency of a word in one document. Term frequency can easily be found from the document vector table as in that table we mention the frequency of each word of the vocabulary in each document.
We | Are | Going | To | Mumbai |
Is |
a |
famous | place |
I |
am | in |
1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
Inverse Document Frequency:-
We | Are | Going | To | Mumbai | Is | a | famous | place | I | am | in |
2 | 2 | 2 | 2 | 3 | 1 | 2 | 3 | 2 | 1 | 1 | 1 |
Talking about inverse document frequency, we need to put the document frequency in the denominator while the total number of documents is the numerator. Here, the total number of documents are 4, hence inverse document frequency becomes:
We | Are | Going | To | Mumbai | Is | a | famous | place | I | am | in |
4/2 | 4/2 | 4/2 | 4/2 | 4/3 | 4/1 | 4/2 | 4/3 | 4/2 | 4/1 | 4/1 | 4/1 |
Study more about Natural Language Processing at Natural Language Processing Class 10