How to find the most frequent words before and after a given word in a given text in python?

Learn how to find the most frequent words before and after a given word in a given text in python? with practical examples, diagrams, and best practices. Covers python-2.7, nlp, nltk development te...

Uncovering Word Context: Finding Frequent Neighbors in Text with Python

A magnifying glass hovering over text, highlighting words, symbolizing text analysis and pattern recognition.

Learn how to use Python and NLTK to identify the most frequent words appearing immediately before and after a specific target word in a given text corpus.

Analyzing the words that frequently appear around a target word can provide valuable insights into its usage, context, and semantic relationships within a given text. This technique is fundamental in Natural Language Processing (NLP) for tasks like understanding word associations, building recommendation systems, or even improving search relevance. This article will guide you through the process of extracting and counting these neighboring words using Python, focusing on efficiency and clarity.

Understanding the Core Concept: N-grams and Context Windows

At the heart of this problem lies the concept of n-grams. An n-gram is a contiguous sequence of 'n' items from a given sample of text or speech. For our purpose, we're interested in bigrams (n=2) or trigrams (n=3) that involve our target word. Specifically, we'll be looking for words that form a bigram with the target word (one word before or one word after).

The process involves tokenizing the text into individual words, iterating through these words, and when the target word is encountered, checking its immediate predecessors and successors. We then count the occurrences of these neighboring words to determine their frequency.

flowchart TD
    A[Start] --> B{Input Text & Target Word}
    B --> C[Tokenize Text into Words]
    C --> D{Initialize Counters for Before/After Words}
    D --> E{Iterate Through Words}
    E -- Current Word == Target Word? --> F{Yes}
    F --> G["Check Word Before (if exists)"]
    G --> H["Increment 'Before' Counter"]
    F --> I["Check Word After (if exists)"]
    I --> J["Increment 'After' Counter"]
    J --> E
    H --> E
    E -- No --> E
    E -- End of Words --> K[Sort Counters by Frequency]
    K --> L[Output Most Frequent Words]
    L --> M[End]

Workflow for finding frequent neighboring words.

Step-by-Step Implementation in Python

We'll use Python's built-in capabilities and the collections.Counter class for efficient counting. While NLTK is a powerful library for NLP, for this specific task of finding immediate neighbors, a manual approach can be quite straightforward and efficient, especially if NLTK is not already a dependency in your project. However, we will also show how NLTK can be integrated for more advanced preprocessing.

from collections import Counter
import re

def find_frequent_neighbors(text, target_word, num_results=5):
    # Normalize target word to lowercase for case-insensitive matching
    target_word_lower = target_word.lower()

    # Tokenize text into words, converting to lowercase and removing punctuation
    # A simple regex for word tokenization
    words = re.findall(r'\b\w+\b', text.lower())

    before_words = Counter()
    after_words = Counter()

    for i, word in enumerate(words):
        if word == target_word_lower:
            # Check word before
            if i > 0:
                before_words[words[i-1]] += 1
            # Check word after
            if i < len(words) - 1:
                after_words[words[i+1]] += 1
    
    print(f"\n--- Analysis for '{target_word}' ---")
    print("Most frequent words BEFORE:")
    for word, count in before_words.most_common(num_results):
        print(f"  '{word}': {count}")

    print("\nMost frequent words AFTER:")
    for word, count in after_words.most_common(num_results):
        print(f"  '{word}': {count}")

# Example Usage:
corpus = (
    "The quick brown fox jumps over the lazy dog. "
    "The dog barks loudly. The fox is quick. "
    "The quick dog runs fast. The quick brown dog is happy. "
    "A quick glance reveals the quick fox. The quick quick quick brown fox." 
    "The quick brown fox is a quick animal."
)

find_frequent_neighbors(corpus, "quick")
find_frequent_neighbors(corpus, "dog")
find_frequent_neighbors(corpus, "fox")
find_frequent_neighbors(corpus, "the")

Python function to find and count frequent words before and after a target word.

💡

For more robust tokenization, especially when dealing with complex text (e.g., contractions, hyphenated words, different languages), consider using NLTK's word_tokenize function. Remember to download the 'punkt' tokenizer data if you use NLTK for the first time (nltk.download('punkt')).

Integrating NLTK for Advanced Preprocessing

While the re.findall approach is simple, NLTK offers more sophisticated tokenization and preprocessing capabilities. Here's how you can adapt the function to use NLTK for tokenization, which can handle edge cases better.

from collections import Counter
import nltk
from nltk.tokenize import word_tokenize

# Ensure you have the 'punkt' tokenizer data downloaded
# nltk.download('punkt') 

def find_frequent_neighbors_nltk(text, target_word, num_results=5):
    target_word_lower = target_word.lower()

    # Use NLTK's word_tokenize for more robust tokenization
    words = [word.lower() for word in word_tokenize(text) if word.isalpha()] # Filter out punctuation

    before_words = Counter()
    after_words = Counter()

    for i, word in enumerate(words):
        if word == target_word_lower:
            if i > 0:
                before_words[words[i-1]] += 1
            if i < len(words) - 1:
                after_words[words[i+1]] += 1
    
    print(f"\n--- NLTK Analysis for '{target_word}' ---")
    print("Most frequent words BEFORE:")
    for word, count in before_words.most_common(num_results):
        print(f"  '{word}': {count}")

    print("\nMost frequent words AFTER:")
    for word, count in after_words.most_common(num_results):
        print(f"  '{word}': {count}")

# Example Usage with NLTK:
corpus = (
    "The quick brown fox jumps over the lazy dog. "
    "The dog barks loudly. The fox is quick. "
    "The quick dog runs fast. The quick brown dog is happy. "
    "A quick glance reveals the quick fox. The quick quick quick brown fox." 
    "The quick brown fox is a quick animal."
)

find_frequent_neighbors_nltk(corpus, "quick")
find_frequent_neighbors_nltk(corpus, "dog")

Python function using NLTK for tokenization to find frequent neighboring words.

ℹ️

The word.isalpha() check in the NLTK example helps filter out punctuation tokens that word_tokenize might produce, ensuring only actual words are considered as neighbors. You might adjust this filtering based on your specific needs (e.g., keeping numbers if they are relevant).

Further Enhancements and Considerations

This basic approach can be extended in several ways:

Stop Word Removal: For many NLP tasks, common words like 'the', 'a', 'is' (stop words) are removed as they don't carry much semantic weight. NLTK provides a list of stop words that can be used for filtering.
Stemming/Lemmatization: To treat different forms of a word (e.g., 'run', 'running', 'ran') as the same, you can apply stemming or lemmatization. NLTK also offers tools for this.
Window Size: Instead of just immediate neighbors, you could expand the 'window' to include 2 or 3 words before and after the target word.
Performance for Large Corpora: For extremely large texts, consider using more advanced NLP libraries or techniques that are optimized for performance, such as spaCy or Gensim, which can handle large datasets more efficiently.