Step-by-step approach to implement a bag of words algorithm
Step 1: Create document vectors for the given documents (Term Frequency Table)
| Document | amit | and | amita | are | twins | lives | with | his | grandparents | in | shimla | her | parents | delhi |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Document 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Document 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| Document 3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
Step 2: Record the occurrence of the word in the document using term frequency table (Document Frequency Table)
| Word | Document 1 | Document 2 | Document 3 |
|---|---|---|---|
| amit | 1 | 1 | 0 |
| and | 1 | 0 | 0 |
| amita | 1 | 0 | 1 |
| are | 1 | 0 | 0 |
| twins | 1 | 0 | 0 |
| lives | 0 | 1 | 1 |
| with | 0 | 1 | 1 |
| his | 0 | 1 | 0 |
| grandparents | 0 | 1 | 0 |
| in | 0 | 1 | 1 |
| shimla | 0 | 1 | 0 |
| her | 0 | 0 | 1 |
| parents | 0 | 0 | 1 |
| delhi | 0 | 0 | 1 |
Step 3: Draw the inverse document frequency table
| Word | Document Frequency (df) | Inverse Document Frequency (IDF) |
|---|---|---|
| amit | 2 | log(3/2) |
| and | 1 | log(3/1) |
| amita | 2 | log(3/2) |
| are | 1 | log(3/1) |
| twins | 1 | log(3/1) |
| lives | 2 | log(3/2) |
| with | 2 | log(3/2) |
| his | 1 | log(3/1) |
| grandparents | 1 | log(3/1) |
| in | 2 | log(3/2) |
| shimla | 1 | log(3/1) |
| her | 1 | log(3/1) |
| parents | 1 | log(3/1) |
| delhi | 1 | log(3/1) |
Step 4: The formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log (IDF(W))
For example, for the word "amit" in Document 2:
Similarly, you can calculate the TF-IDF values for all the words in the documents.
Note that the term frequency table, document frequency table, and inverse document frequency table are all tables that are used