Text normalization involves transforming text into a standard(base) form that can be easily processed by NLP algorithms. Here are some common methods for text normalization:
- Tokenization: Breaking text into individual words or tokens.
- bringing the text into common case: Converting all words to lowercase to avoid case sensitivity.
- Stop word removal: Removing frequently occurring words that do not carry much meaning, such as "the," "and," and "a."
- Stemming or Lemmatization: Reducing inflected or derived words to their base or dictionary form, such as "playing" to "play" or "cars" to "car."
Here's how we can normalize the text in the given documents:
Document 1: