Tesseract OCR: Recognize complete dictionary words only
Categories:
Tesseract OCR: Recognizing Complete Dictionary Words Only
Learn how to configure Tesseract OCR to improve accuracy by restricting recognition to complete dictionary words, enhancing results for specific use cases.
Tesseract OCR is a powerful open-source optical character recognition engine. While highly versatile, its default configuration might sometimes produce fragmented or incorrect words, especially with noisy images or unusual fonts. For applications where only complete, valid dictionary words are expected (e.g., document processing, data entry validation), restricting Tesseract's output to known words can significantly improve accuracy and reduce post-processing efforts. This article will guide you through configuring Tesseract to achieve this specific recognition behavior.
Understanding Tesseract's Dictionary and Word Lists
Tesseract uses language-specific data files, which include dictionaries and character sets, to perform recognition. By default, it attempts to recognize any sequence of characters that form a plausible word. However, you can provide Tesseract with custom word lists or instruct it to use its internal dictionary more strictly. This is achieved primarily through configuration variables and external files.
flowchart TD A[Input Image] --> B{Tesseract OCR Engine} B --> C{Default Recognition} C --> D[Output: All recognized characters/words] B --> E{Custom Configuration (Word List/Dictionary Mode)} E --> F[Output: Only dictionary words] style C fill:#f9f,stroke:#333,stroke-width:2px style F fill:#bbf,stroke:#333,stroke-width:2px
Comparison of Tesseract's default vs. dictionary-restricted recognition flow.
Configuring Tesseract for Dictionary-Only Recognition
There are several ways to guide Tesseract towards recognizing only complete dictionary words. The most common approach involves using the load_system_dawg
and load_freq_dawg
parameters, along with potentially providing a custom user-defined word list. A 'dawg' (Directed Acyclic Word Graph) is Tesseract's internal representation of a dictionary.
.traineddata
) for the language you are processing. These files contain the default system dictionaries.Method 1: Using Tesseract Configuration Parameters
Tesseract allows you to set various parameters to control its behavior. To enforce dictionary-only recognition, you can set load_system_dawg
and load_freq_dawg
to true
(which is often the default, but explicitly setting them ensures it) and, more importantly, control how aggressively Tesseract uses these dictionaries. The wordrec_enable_assoc
parameter can also be relevant.
tesseract image.png output -l eng --oem 3 --psm 3 \
-c load_system_dawg=true \
-c load_freq_dawg=true \
-c wordrec_enable_assoc=true \
-c assume_fixed_image_scaling=1 \
-c tessedit_char_whitelist="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
Example Tesseract command-line options for dictionary-focused recognition.
In the above command:
-l eng
: Specifies the English language.--oem 3
: Uses the latest Tesseract OCR engine mode.--psm 3
: Assumes a page of text, which is a common page segmentation mode.-c load_system_dawg=true
: Ensures the system dictionary is loaded.-c load_freq_dawg=true
: Ensures the frequency dictionary is loaded.-c wordrec_enable_assoc=true
: Enables word association, which helps in forming valid words.-c tessedit_char_whitelist
: Restricts characters to a specific set, which can implicitly help in reducing non-word output.
Method 2: Providing a Custom User-Defined Word List
For highly specific scenarios where the default dictionary isn't sufficient or you have a very limited set of expected words, you can provide Tesseract with a custom word list. This is done using the user-words
configuration file. Each word should be on a new line in this file.
apple
banana
orange
grape
kiwi
Save this content as my_words.txt
. Then, you can instruct Tesseract to use this file:
tesseract image.png output -l eng --oem 3 --psm 3 \
--user-words my_words.txt \
-c load_system_dawg=false \
-c load_freq_dawg=false
Using a custom user-defined word list with Tesseract.
--user-words
, you might want to explicitly set load_system_dawg=false
and load_freq_dawg=false
if you only want words from your custom list. Otherwise, Tesseract will combine your list with its internal dictionaries.Method 3: Combining Dictionary and User Words
Often, the best approach is to leverage Tesseract's robust internal dictionaries while supplementing them with your own domain-specific terms. This provides a good balance between general accuracy and specialized vocabulary.
tesseract image.png output -l eng --oem 3 --psm 3 \
--user-words my_words.txt \
-c load_system_dawg=true \
-c load_freq_dawg=true
Combining system dictionaries with a custom word list.
Practical Steps for Implementation (Cordova Example)
If you're integrating Tesseract into a Cordova application, the process involves passing these configuration parameters to the Tesseract library via its API. While the exact API might vary slightly depending on the Cordova plugin you use, the core principle remains the same: provide a configuration object or an array of parameters.
1. Prepare your Tesseract configuration file
Create a text file (e.g., config.txt
) containing the Tesseract parameters, one per line. For example:
load_system_dawg true
load_freq_dawg true
wordrec_enable_assoc true
Or, if using a user word list:
load_system_dawg false
load_freq_dawg false
user_words_file /path/to/your/my_words.txt
2. Place custom word lists and config files
Ensure your my_words.txt
and config.txt
files are accessible to your Cordova application. This usually means placing them in a location that the plugin can read, often within the app's assets or documents directory.
3. Call the Tesseract OCR plugin with configuration
When invoking the Tesseract OCR plugin in your Cordova app, pass the path to your configuration file or the parameters directly. Here's a conceptual example (syntax may vary by plugin):
cordova.plugins.TesseractPlugin.recognizeImage(
'path/to/image.jpg',
'eng',
{
tessedit_config_file: 'path/to/config.txt',
// Or pass individual parameters if the plugin supports it
// load_system_dawg: 'true',
// load_freq_dawg: 'true'
},
function(result) {
console.log('OCR Result:', result.text);
},
function(error) {
console.error('OCR Error:', error);
}
);