How to find the most frequent words before and after a given word in a given text in python?
Categories:
Uncovering Word Context: Finding Frequent Neighbors in Text with Python

Learn how to use Python and NLTK to identify the most frequent words appearing immediately before and after a specific target word in a given text corpus.
Analyzing the words that frequently appear around a target word can provide valuable insights into its usage, context, and semantic relationships within a given text. This technique is fundamental in Natural Language Processing (NLP) for tasks like understanding word associations, building recommendation systems, or even improving search relevance. This article will guide you through the process of extracting and counting these neighboring words using Python, focusing on efficiency and clarity.
Understanding the Core Concept: N-grams and Context Windows
At the heart of this problem lies the concept of n-grams. An n-gram is a contiguous sequence of 'n' items from a given sample of text or speech. For our purpose, we're interested in bigrams (n=2) or trigrams (n=3) that involve our target word. Specifically, we'll be looking for words that form a bigram with the target word (one word before or one word after).
The process involves tokenizing the text into individual words, iterating through these words, and when the target word is encountered, checking its immediate predecessors and successors. We then count the occurrences of these neighboring words to determine their frequency.
flowchart TD A[Start] --> B{Input Text & Target Word} B --> C[Tokenize Text into Words] C --> D{Initialize Counters for Before/After Words} D --> E{Iterate Through Words} E -- Current Word == Target Word? --> F{Yes} F --> G["Check Word Before (if exists)"] G --> H["Increment 'Before' Counter"] F --> I["Check Word After (if exists)"] I --> J["Increment 'After' Counter"] J --> E H --> E E -- No --> E E -- End of Words --> K[Sort Counters by Frequency] K --> L[Output Most Frequent Words] L --> M[End]
Workflow for finding frequent neighboring words.
Step-by-Step Implementation in Python
We'll use Python's built-in capabilities and the collections.Counter
class for efficient counting. While NLTK is a powerful library for NLP, for this specific task of finding immediate neighbors, a manual approach can be quite straightforward and efficient, especially if NLTK is not already a dependency in your project. However, we will also show how NLTK can be integrated for more advanced preprocessing.
from collections import Counter
import re
def find_frequent_neighbors(text, target_word, num_results=5):
# Normalize target word to lowercase for case-insensitive matching
target_word_lower = target_word.lower()
# Tokenize text into words, converting to lowercase and removing punctuation
# A simple regex for word tokenization
words = re.findall(r'\b\w+\b', text.lower())
before_words = Counter()
after_words = Counter()
for i, word in enumerate(words):
if word == target_word_lower:
# Check word before
if i > 0:
before_words[words[i-1]] += 1
# Check word after
if i < len(words) - 1:
after_words[words[i+1]] += 1
print(f"\n--- Analysis for '{target_word}' ---")
print("Most frequent words BEFORE:")
for word, count in before_words.most_common(num_results):
print(f" '{word}': {count}")
print("\nMost frequent words AFTER:")
for word, count in after_words.most_common(num_results):
print(f" '{word}': {count}")
# Example Usage:
corpus = (
"The quick brown fox jumps over the lazy dog. "
"The dog barks loudly. The fox is quick. "
"The quick dog runs fast. The quick brown dog is happy. "
"A quick glance reveals the quick fox. The quick quick quick brown fox."
"The quick brown fox is a quick animal."
)
find_frequent_neighbors(corpus, "quick")
find_frequent_neighbors(corpus, "dog")
find_frequent_neighbors(corpus, "fox")
find_frequent_neighbors(corpus, "the")
Python function to find and count frequent words before and after a target word.
word_tokenize
function. Remember to download the 'punkt' tokenizer data if you use NLTK for the first time (nltk.download('punkt')
).Integrating NLTK for Advanced Preprocessing
While the re.findall
approach is simple, NLTK offers more sophisticated tokenization and preprocessing capabilities. Here's how you can adapt the function to use NLTK for tokenization, which can handle edge cases better.
from collections import Counter
import nltk
from nltk.tokenize import word_tokenize
# Ensure you have the 'punkt' tokenizer data downloaded
# nltk.download('punkt')
def find_frequent_neighbors_nltk(text, target_word, num_results=5):
target_word_lower = target_word.lower()
# Use NLTK's word_tokenize for more robust tokenization
words = [word.lower() for word in word_tokenize(text) if word.isalpha()] # Filter out punctuation
before_words = Counter()
after_words = Counter()
for i, word in enumerate(words):
if word == target_word_lower:
if i > 0:
before_words[words[i-1]] += 1
if i < len(words) - 1:
after_words[words[i+1]] += 1
print(f"\n--- NLTK Analysis for '{target_word}' ---")
print("Most frequent words BEFORE:")
for word, count in before_words.most_common(num_results):
print(f" '{word}': {count}")
print("\nMost frequent words AFTER:")
for word, count in after_words.most_common(num_results):
print(f" '{word}': {count}")
# Example Usage with NLTK:
corpus = (
"The quick brown fox jumps over the lazy dog. "
"The dog barks loudly. The fox is quick. "
"The quick dog runs fast. The quick brown dog is happy. "
"A quick glance reveals the quick fox. The quick quick quick brown fox."
"The quick brown fox is a quick animal."
)
find_frequent_neighbors_nltk(corpus, "quick")
find_frequent_neighbors_nltk(corpus, "dog")
Python function using NLTK for tokenization to find frequent neighboring words.
word.isalpha()
check in the NLTK example helps filter out punctuation tokens that word_tokenize
might produce, ensuring only actual words are considered as neighbors. You might adjust this filtering based on your specific needs (e.g., keeping numbers if they are relevant).Further Enhancements and Considerations
This basic approach can be extended in several ways:
- Stop Word Removal: For many NLP tasks, common words like 'the', 'a', 'is' (stop words) are removed as they don't carry much semantic weight. NLTK provides a list of stop words that can be used for filtering.
- Stemming/Lemmatization: To treat different forms of a word (e.g., 'run', 'running', 'ran') as the same, you can apply stemming or lemmatization. NLTK also offers tools for this.
- Window Size: Instead of just immediate neighbors, you could expand the 'window' to include 2 or 3 words before and after the target word.
- Performance for Large Corpora: For extremely large texts, consider using more advanced NLP libraries or techniques that are optimized for performance, such as spaCy or Gensim, which can handle large datasets more efficiently.