Step-by-step approach to implement a bag of words algorithm
Step 1: Create document vectors for the given documents (Term Frequency Table)
| Document | amit | and | amita | are | twins | lives | with | his | grandparents | in | shimla | her | parents | delhi |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Document 1 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Document 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| Document 3 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 |
Step 2: Record the occurrence of the word in the document using term frequency table (Document Frequency Table)
| Word | Document 1 | Document 2 | Document 3 |
|---|---|---|---|
| amit | 1 | 1 | 0 |
| and | 1 | 0 | 0 |
| amita | 1 | 0 | 1 |
| are | 1 | 0 | 0 |
| twins | 1 | 0 | 0 |
| lives | 0 | 1 | 1 |
| with | 0 | 1 | 1 |
| his | 0 | 1 | 0 |
| grandparents | 0 | 1 | 0 |
| in | 0 | 1 | 1 |
| shimla | 0 | 1 | 0 |
| her | 0 | 0 | 1 |
| parents | 0 | 0 | 1 |
| delhi | 0 | 0 | 1 |
Step 3: Draw the inverse document frequency table
| Word | Document Frequency (df) | Inverse Document Frequency (IDF) |
|---|---|---|
| amit | 2 | log(3/2) |
| and | 1 | log(3/1) |
| amita | 2 | log(3/2) |
| are | 1 | log(3/1) |
| twins | 1 | log(3/1) |
| lives | 2 | log(3/2) |
| with | 2 | log(3/2) |
| his | 1 | log(3/1) |
| grandparents | 1 | log(3/1) |
| in | 2 | log(3/2) |
| shimla | 1 | log(3/1) |
| her | 1 | log(3/1) |
| parents | 1 | log(3/1) |
| delhi | 1 | log(3/1) |
Step 4: The formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log (IDF(W))
For example, for the word "amit" in Document 2:
Similarly, you can calculate the TF-IDF values for all the words in the documents.
Note that the term frequency table, document frequency table, and inverse document frequency table are all tables that are used
Ask the community — students and mentors are here to help, and you can search past answers too.
Ask a Question arrow_forward