Getting the bounding box of the recognized words using python-tesseract

Learn getting the bounding box of the recognized words using python-tesseract with practical examples, diagrams, and best practices. Covers python, image-processing, ocr development techniques with...

Extracting Bounding Boxes for Words with Python-Tesseract

An illustration of an OCR process showing text recognition and bounding boxes around words on a document.

Learn how to use Python-Tesseract to not only recognize text but also retrieve the precise bounding box coordinates for each detected word, enabling advanced OCR applications.

Optical Character Recognition (OCR) is a technology that enables the conversion of different types of documents, such as scanned paper documents, PDF files, or images captured by a digital camera, into editable and searchable data. While basic OCR provides the recognized text, many advanced applications require more granular information, specifically the location of each recognized word or character on the original image. This article will guide you through using python-tesseract to achieve this, focusing on word-level bounding box extraction.

Understanding Tesseract's Output Formats

Tesseract, the OCR engine behind python-tesseract, can output recognized text in various formats. Beyond plain text, it offers structured data that includes bounding box information. This is crucial for tasks like highlighting text in an image, creating interactive PDFs, or performing layout analysis. The image_to_data() function in python-tesseract is specifically designed for this purpose, providing a detailed breakdown of recognized components.

flowchart TD
    A[Input Image] --> B{Tesseract OCR Engine}
    B --> C["image_to_data() function"]
    C --> D["Pandas DataFrame Output"]
    D --> E{"Extract Bounding Box Data"}
    E --> F["Word Coordinates (left, top, width, height)"]
    E --> G["Confidence Score"]
    E --> H["Recognized Text"]
    F --> I["Further Processing (e.g., Drawing Boxes)"]

Flowchart of Tesseract's data extraction process for bounding boxes.

Prerequisites and Setup

Before diving into the code, ensure you have Tesseract OCR installed on your system and the pytesseract library installed in your Python environment. You'll also need OpenCV (cv2) for image loading and manipulation, and Pillow (PIL) for image handling with pytesseract.

1. Install Tesseract OCR Engine

Download and install Tesseract from its official GitHub page or via your system's package manager. For Windows, consider installers like UB-Mannheim's. Remember to add Tesseract to your system's PATH environment variable.

2. Install Python Libraries

Use pip to install the necessary Python packages: pip install pytesseract opencv-python Pillow.

3. Verify Installation

Run tesseract --version in your terminal to confirm Tesseract is installed. In Python, try import pytesseract to ensure the library is accessible.

Extracting Word Bounding Boxes

The pytesseract.image_to_data() function is the key to obtaining detailed OCR results, including bounding box information. It returns a dictionary that can be easily converted into a Pandas DataFrame for better readability and manipulation. Each row in this DataFrame corresponds to a recognized component (page, block, paragraph, line, word), and includes its text, confidence score, and bounding box coordinates.

import cv2
import pytesseract
from PIL import Image

# Path to your Tesseract executable (change if necessary)
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def get_word_bounding_boxes(image_path):
    # Load the image using OpenCV
    img = cv2.imread(image_path)
    if img is None:
        print(f"Error: Could not load image from {image_path}")
        return

    # Convert the image to RGB (Tesseract expects RGB or grayscale)
    rgb_img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

    # Use image_to_data to get detailed OCR results
    # output_type=pytesseract.Output.DICT returns a dictionary
    data = pytesseract.image_to_data(rgb_img, output_type=pytesseract.Output.DICT)

    n_boxes = len(data['level'])
    for i in range(n_boxes):
        # Level 5 corresponds to words
        if data['level'][i] == 5:
            text = data['text'][i]
            conf = int(data['conf'][i])
            # Filter out empty text and low confidence results
            if text.strip() and conf > 60: # Adjust confidence threshold as needed
                x, y, w, h = data['left'][i], data['top'][i], data['width'][i], data['height'][i]
                
                # Draw bounding box on the image
                cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
                # Put text label (optional)
                cv2.putText(img, text, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 1)
                
                print(f"Word: '{text}', Confidence: {conf}, Bounding Box: (x={x}, y={y}, w={w}, h={h})")

    # Display the image with bounding boxes
    cv2.imshow('Detected Words', img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

# Example usage:
# Make sure you have an image file named 'sample.png' in the same directory
# or provide a full path to your image.
# get_word_bounding_boxes('sample.png')

Python code to extract word-level bounding boxes and draw them on an image.

💡

The data dictionary returned by image_to_data() contains keys like level, page_num, block_num, par_num, line_num, word_num, left, top, width, height, conf, and text. The level key indicates the hierarchy (1=page, 2=block, 3=paragraph, 4=line, 5=word). Filtering by level == 5 ensures you're processing individual words.

Interpreting the Output

The image_to_data() function provides coordinates relative to the top-left corner of the image. The left and top values represent the x and y coordinates of the top-left corner of the bounding box, while width and height define its dimensions. The conf (confidence) score indicates how certain Tesseract is about the recognition, ranging from 0 to 100. Filtering by confidence can help remove spurious detections.

An example image with recognized words highlighted by green bounding boxes and red text labels.

An example of an image with word-level bounding boxes drawn using the provided Python script.

Advanced Considerations and Best Practices

For more robust OCR, consider pre-processing your images. This can include resizing, binarization, noise reduction, and deskewing. Tesseract's performance is highly dependent on image quality. Additionally, for specific languages, ensure you have the corresponding language data packs installed and specify the language using the lang parameter in pytesseract.image_to_data().

⚠️

Tesseract's image_to_data() can sometimes return empty strings or very low confidence scores for non-text regions or poorly recognized words. Always include checks for text.strip() and conf values to filter out unreliable results, especially when drawing boxes or performing further analysis.