9. Document 1: Amit and Amita are twins Document 2: Amit lives with his grandparents in Shimla. Document 3: Amita lives with her parents in Delhi. Create a step-by-step approach to implement a bag of words algorithm.

by Deepika Bansal (160 points) in Evaluation asked Dec 24, 2022 1.5k views

1 Answer

by Lalit Kumar (1.3k points) answered Jan 12, 2023

by Lalit Kumar selected Jan 12, 2023

Step-by-step approach to implement a bag of words algorithm

Step 1: Create document vectors for the given documents (Term Frequency Table)

Document	amit	and	amita	are	twins	lives	with	his	grandparents	in	shimla	her	parents	delhi
Document 1	0	1	1	1	1	0	0	0	0	0	0	0	0	0
Document 2	1	0	0	0	0	1	1	1	1	1	1	0	0	0
Document 3	0	0	1	0	0	1	1	0	0	1	0	1	1	1

Step 2: Record the occurrence of the word in the document using term frequency table (Document Frequency Table)

Word	Document 1	Document 2	Document 3
amit	1	1	0
and	1	0	0
amita	1	0	1
are	1	0	0
twins	1	0	0
lives	0	1	1
with	0	1	1
his	0	1	0
grandparents	0	1	0
in	0	1	1
shimla	0	1	0
her	0	0	1
parents	0	0	1
delhi	0	0	1

Step 3: Draw the inverse document frequency table

Word	Document Frequency (df)	Inverse Document Frequency (IDF)
amit	2	log(3/2)
and	1	log(3/1)
amita	2	log(3/2)
are	1	log(3/1)
twins	1	log(3/1)
lives	2	log(3/2)
with	2	log(3/2)
his	1	log(3/1)
grandparents	1	log(3/1)
in	2	log(3/2)
shimla	1	log(3/1)
her	1	log(3/1)
parents	1	log(3/1)
delhi	1	log(3/1)

Step 4: The formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log (IDF(W))

For example, for the word "amit" in Document 2:

Similarly, you can calculate the TF-IDF values for all the words in the documents.

Note that the term frequency table, document frequency table, and inverse document frequency table are all tables that are used