Step-by-step approach to implement a bag of words algorithm
Step 1: Create document vectors for the given documents (Term Frequency Table)
Document | amit | and | amita | are | twins | lives | with | his | grandparents | in | shimla | her | parents | delhi |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Document 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Document 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
Document 3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
Step 2: Record the occurrence of the word in the document using term frequency table (Document Frequency Table)
Word | Document 1 | Document 2 | Document 3 |
---|---|---|---|
amit | 1 | 1 | 0 |
and | 1 | 0 | 0 |
amita | 1 | 0 | 1 |
are | 1 | 0 | 0 |
twins | 1 | 0 | 0 |
lives | 0 | 1 | 1 |
with | 0 | 1 | 1 |
his | 0 | 1 | 0 |
grandparents | 0 | 1 | 0 |
in | 0 | 1 | 1 |
shimla | 0 | 1 | 0 |
her | 0 | 0 | 1 |
parents | 0 | 0 | 1 |
delhi | 0 | 0 | 1 |
Step 3: Draw the inverse document frequency table
Word | Document Frequency (df) | Inverse Document Frequency (IDF) |
---|---|---|
amit | 2 | log(3/2) |
and | 1 | log(3/1) |
amita | 2 | log(3/2) |
are | 1 | log(3/1) |
twins | 1 | log(3/1) |
lives | 2 | log(3/2) |
with | 2 | log(3/2) |
his | 1 | log(3/1) |
grandparents | 1 | log(3/1) |
in | 2 | log(3/2) |
shimla | 1 | log(3/1) |
her | 1 | log(3/1) |
parents | 1 | log(3/1) |
delhi | 1 | log(3/1) |
Step 4: The formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log (IDF(W))
For example, for the word "amit" in Document 2:
Similarly, you can calculate the TF-IDF values for all the words in the documents.
Note that the term frequency table, document frequency table, and inverse document frequency table are all tables that are used