# 9. Document 1: Amit and Amita are twins Document 2: Amit lives with his grandparents in Shimla. Document 3: Amita lives with her parents in Delhi. Create a step-by-step approach to implement a bag of words algorithm.

more_vert

more_vert

verified

Step-by-step approach to implement a bag of words algorithm

Step 1: Create document vectors for the given documents (Term Frequency Table)

Document amit and amita are twins lives with his grandparents in shimla her parents delhi
Document 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0
Document 2 1 0 0 0 0 1 1 1 1 1 1 0 0 0
Document 3 0 0 1 0 0 1 1 0 0 1 0 1 1 1

Step 2: Record the occurrence of the word in the document using term frequency table (Document Frequency Table)

Word Document 1 Document 2 Document 3
amit 1 1 0
and 1 0 0
amita 1 0 1
are 1 0 0
twins 1 0 0
lives 0 1 1
with 0 1 1
his 0 1 0
grandparents 0 1 0
in 0 1 1
shimla 0 1 0
her 0 0 1
parents 0 0 1
delhi 0 0 1

Step 3: Draw the inverse document frequency table

Word Document Frequency (df) Inverse Document Frequency (IDF)
amit 2 log(3/2)
and 1 log(3/1)
amita 2 log(3/2)
are 1 log(3/1)
twins 1 log(3/1)
lives 2 log(3/2)
with 2 log(3/2)
his 1 log(3/1)
grandparents 1 log(3/1)
in 2 log(3/2)
shimla 1 log(3/1)
her 1 log(3/1)
parents 1 log(3/1)
delhi 1 log(3/1)

Step 4: The formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log (IDF(W))

For example, for the word "amit" in Document 2:

• TF(W) = 1 (occurs once in the document)
• IDF(W) = log(3/2) (occurs in 2 out of 3 documents)
• TFIDF(W) = 1 * log(3/2) = 0.176

Similarly, you can calculate the TF-IDF values for all the words in the documents.

Note that the term frequency table, document frequency table, and inverse document frequency table are all tables that are used