Tesseract OCR: Recognize complete dictionary words only

Learn tesseract ocr: recognize complete dictionary words only with practical examples, diagrams, and best practices. Covers image-processing, cordova, ocr development techniques with visual explana...

Tesseract OCR: Recognizing Complete Dictionary Words Only

Illustration of a magnifying glass over text, highlighting specific words, representing OCR and dictionary lookup.

Learn how to configure Tesseract OCR to improve accuracy by restricting recognition to complete dictionary words, enhancing results for specific use cases.

Tesseract OCR is a powerful open-source optical character recognition engine. While highly versatile, its default configuration might sometimes produce fragmented or incorrect words, especially with noisy images or unusual fonts. For applications where only complete, valid dictionary words are expected (e.g., document processing, data entry validation), restricting Tesseract's output to known words can significantly improve accuracy and reduce post-processing efforts. This article will guide you through configuring Tesseract to achieve this specific recognition behavior.

Understanding Tesseract's Dictionary and Word Lists

Tesseract uses language-specific data files, which include dictionaries and character sets, to perform recognition. By default, it attempts to recognize any sequence of characters that form a plausible word. However, you can provide Tesseract with custom word lists or instruct it to use its internal dictionary more strictly. This is achieved primarily through configuration variables and external files.

flowchart TD
    A[Input Image] --> B{Tesseract OCR Engine}
    B --> C{Default Recognition}
    C --> D[Output: All recognized characters/words]
    B --> E{Custom Configuration (Word List/Dictionary Mode)}
    E --> F[Output: Only dictionary words]
    style C fill:#f9f,stroke:#333,stroke-width:2px
    style F fill:#bbf,stroke:#333,stroke-width:2px

Comparison of Tesseract's default vs. dictionary-restricted recognition flow.

Configuring Tesseract for Dictionary-Only Recognition

There are several ways to guide Tesseract towards recognizing only complete dictionary words. The most common approach involves using the load_system_dawg and load_freq_dawg parameters, along with potentially providing a custom user-defined word list. A 'dawg' (Directed Acyclic Word Graph) is Tesseract's internal representation of a dictionary.

Method 1: Using Tesseract Configuration Parameters

Tesseract allows you to set various parameters to control its behavior. To enforce dictionary-only recognition, you can set load_system_dawg and load_freq_dawg to true (which is often the default, but explicitly setting them ensures it) and, more importantly, control how aggressively Tesseract uses these dictionaries. The wordrec_enable_assoc parameter can also be relevant.

tesseract image.png output -l eng --oem 3 --psm 3 \
  -c load_system_dawg=true \
  -c load_freq_dawg=true \
  -c wordrec_enable_assoc=true \
  -c assume_fixed_image_scaling=1 \
  -c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"

Example Tesseract command-line options for dictionary-focused recognition.

In the above command:

  • -l eng: Specifies the English language.
  • --oem 3: Uses the latest Tesseract OCR engine mode.
  • --psm 3: Assumes a page of text, which is a common page segmentation mode.
  • -c load_system_dawg=true: Ensures the system dictionary is loaded.
  • -c load_freq_dawg=true: Ensures the frequency dictionary is loaded.
  • -c wordrec_enable_assoc=true: Enables word association, which helps in forming valid words.
  • -c tessedit_char_whitelist: Restricts characters to a specific set, which can implicitly help in reducing non-word output.

Method 2: Providing a Custom User-Defined Word List

For highly specific scenarios where the default dictionary isn't sufficient or you have a very limited set of expected words, you can provide Tesseract with a custom word list. This is done using the user-words configuration file. Each word should be on a new line in this file.

apple
banana
orange
grape
kiwi

Save this content as my_words.txt. Then, you can instruct Tesseract to use this file:

tesseract image.png output -l eng --oem 3 --psm 3 \
  --user-words my_words.txt \
  -c load_system_dawg=false \
  -c load_freq_dawg=false

Using a custom user-defined word list with Tesseract.

Method 3: Combining Dictionary and User Words

Often, the best approach is to leverage Tesseract's robust internal dictionaries while supplementing them with your own domain-specific terms. This provides a good balance between general accuracy and specialized vocabulary.

tesseract image.png output -l eng --oem 3 --psm 3 \
  --user-words my_words.txt \
  -c load_system_dawg=true \
  -c load_freq_dawg=true

Combining system dictionaries with a custom word list.

Practical Steps for Implementation (Cordova Example)

If you're integrating Tesseract into a Cordova application, the process involves passing these configuration parameters to the Tesseract library via its API. While the exact API might vary slightly depending on the Cordova plugin you use, the core principle remains the same: provide a configuration object or an array of parameters.

1. Prepare your Tesseract configuration file

Create a text file (e.g., config.txt) containing the Tesseract parameters, one per line. For example:

load_system_dawg true
load_freq_dawg true
wordrec_enable_assoc true

Or, if using a user word list:

load_system_dawg false
load_freq_dawg false
user_words_file /path/to/your/my_words.txt

2. Place custom word lists and config files

Ensure your my_words.txt and config.txt files are accessible to your Cordova application. This usually means placing them in a location that the plugin can read, often within the app's assets or documents directory.

3. Call the Tesseract OCR plugin with configuration

When invoking the Tesseract OCR plugin in your Cordova app, pass the path to your configuration file or the parameters directly. Here's a conceptual example (syntax may vary by plugin):

cordova.plugins.TesseractPlugin.recognizeImage(
  'path/to/image.jpg',
  'eng',
  {
    tessedit_config_file: 'path/to/config.txt',
    // Or pass individual parameters if the plugin supports it
    // load_system_dawg: 'true',
    // load_freq_dawg: 'true'
  },
  function(result) {
    console.log('OCR Result:', result.text);
  },
  function(error) {
    console.error('OCR Error:', error);
  }
);