pyparsing - defining keywords - compare Literal, Word, Keyword and Combine
Categories:
Mastering Keywords in pyparsing: Literal, Word, Keyword, and Combine

Explore the nuances of defining keywords and specific text patterns in pyparsing, comparing Literal
, Word
, Keyword
, and Combine
for robust parsing.
When building parsers with pyparsing
, accurately identifying and handling keywords and specific text patterns is crucial. pyparsing
offers several powerful primitives for this purpose: Literal
, Word
, Keyword
, and Combine
. While they might seem similar at first glance, each serves a distinct role and is optimized for different scenarios. Understanding their differences is key to writing efficient, correct, and maintainable parsers.
Literal: Exact String Matching
Literal
is the most straightforward way to match an exact string. It's case-sensitive by default and will only match the precise sequence of characters provided. It's ideal for fixed tokens like operators, punctuation, or specific command names that must appear exactly as written.
from pyparsing import Literal
# Matches the exact string "SELECT"
select_keyword = Literal("SELECT")
# Test cases
print(select_keyword.parseString("SELECT"))
# Output: ['SELECT']
try:
print(select_keyword.parseString("select"))
except Exception as e:
print(f"Error: {e}")
# Output: Error: Expected "SELECT" (at char 0), (line:1, col:1)
try:
print(select_keyword.parseString("SELEC T"))
except Exception as e:
print(f"Error: {e}")
# Output: Error: Expected "SELECT" (at char 0), (line:1, col:1)
Using Literal
for exact string matching.
Word: Matching Character Sets
Word
is designed to match sequences of characters from a defined set. It's commonly used for identifiers, numbers, or any token composed of specific character types. Unlike Literal
, Word
doesn't match a fixed string but rather a pattern of characters. It stops matching when it encounters a character not in its defined set.
from pyparsing import Word, alphas, nums
# Matches a sequence of alphabetic characters
identifier = Word(alphas)
# Matches a sequence of numeric characters
number = Word(nums)
# Matches a sequence of alphanumeric characters
alphanumeric_token = Word(alphas + nums)
print(identifier.parseString("myVariable"))
# Output: ['myVariable']
print(number.parseString("12345"))
# Output: ['12345']
print(alphanumeric_token.parseString("item_123"))
# Output: ['item'] - Note: `_` is not in alphas+nums by default
# To include underscore:
alphanumeric_with_underscore = Word(alphas + nums + '_')
print(alphanumeric_with_underscore.parseString("item_123"))
# Output: ['item_123']
Examples of Word
for matching character sequences.
Keyword: Literal with Word Boundary Checks
Keyword
is a specialized form of Literal
that adds an important feature: it ensures the matched string is a standalone word, not just a substring within a larger word. It achieves this by performing word boundary checks, preventing partial matches. This is crucial for distinguishing keywords like "AND" from parts of identifiers like "RANDOM".
from pyparsing import Keyword, Word, alphas
# Define a keyword
and_keyword = Keyword("AND")
# Define a generic identifier
identifier = Word(alphas)
# Test cases for Keyword
print(and_keyword.parseString("AND"))
# Output: ['AND']
try:
print(and_keyword.parseString("RANDOM"))
except Exception as e:
print(f"Error: {e}")
# Output: Error: Expected 'AND' (at char 0), (line:1, col:1)
# Compare with Literal for the same scenario
from pyparsing import Literal
and_literal = Literal("AND")
# This would incorrectly match if not careful with context
# (e.g., if followed by a space, Literal might match 'AND' in 'RANDOM AND')
# The real power of Keyword is when used within a larger grammar
# Example of Keyword preventing partial matches in a grammar
grammar = and_keyword | identifier
print(grammar.parseString("AND"))
# Output: ['AND']
print(grammar.parseString("OR"))
# Output: ['OR']
print(grammar.parseString("RANDOM"))
# Output: ['RANDOM']
Demonstrating Keyword
for robust keyword matching with boundary checks.
Combine: Merging Parsed Tokens
Combine
is not a matching primitive itself, but a modifier that takes a parser expression and concatenates all the matched tokens into a single string. This is particularly useful when you have a sequence of Word
or Literal
expressions that logically form a single token, but pyparsing
would otherwise return them as separate elements in a list. Combine
ensures they are returned as one contiguous string.
from pyparsing import Word, nums, Literal, Combine
# Without Combine: date parts are separate
date_parts = Word(nums, exact=2) + Literal("/") + Word(nums, exact=2) + Literal("/") + Word(nums, exact=4)
print(date_parts.parseString("12/25/2023"))
# Output: ['12', '/', '25', '/', '2023']
# With Combine: date is a single string
combined_date = Combine(Word(nums, exact=2) + Literal("/") + Word(nums, exact=2) + Literal("/") + Word(nums, exact=4))
print(combined_date.parseString("12/25/2023"))
# Output: ['12/25/2023']
# Another example: IP address
ip_segment = Word(nums, min=1, max=3)
ip_address_uncombined = ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment
print(ip_address_uncombined.parseString("192.168.1.1"))
# Output: ['192', '.', '168', '.', '1', '.', '1']
ip_address_combined = Combine(ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment)
print(ip_address_combined.parseString("192.168.1.1"))
# Output: ['192.168.1.1']
Using Combine
to merge multiple parsed tokens into a single string.
flowchart TD A[Start Parsing] --> B{Token Type?} B -->|Exact String| C[Literal("TOKEN")] C --> D{Word Boundary Needed?} D -->|Yes| E[Keyword("TOKEN")] D -->|No| F[Literal("TOKEN")] B -->|Character Set| G[Word(chars)] G --> H{Multiple Parts to One String?} H -->|Yes| I[Combine(expression)] H -->|No| J[Expression (e.g., Word)] I --> K[Result: Single String] J --> L[Result: List of Strings] E --> M[Result: Single String (with boundary check)] F --> N[Result: Single String (no boundary check)]
Decision flow for choosing between Literal
, Word
, Keyword
, and Combine
.
Keyword
over Literal
when defining actual keywords in a language (like if
, else
, SELECT
, FROM
) to prevent accidental partial matches within identifiers. Use Literal
for fixed symbols or operators that are not expected to be part of larger words (e.g., (
, )
, =
, +
).Summary and Best Practices
Choosing the right pyparsing
primitive depends on the specific parsing requirement:
Literal
: Use for exact, fixed strings that do not require word boundary checks, such as operators (+
,-
), punctuation (,
,;
), or very specific, non-keyword tokens.Word
: Use for tokens composed of a sequence of characters from a defined set, like identifiers, numbers, or custom alphanumeric strings. It's flexible for matching patterns, not fixed text.Keyword
: Use for language keywords (e.g.,SELECT
,FROM
,IF
,ELSE
). It's aLiteral
that enforces word boundaries, ensuring the keyword is matched as a whole word and not as a substring of another token.Combine
: Use as a wrapper around other expressions when you want to concatenate their matched parts into a single string result, rather than a list of individual tokens. This is useful for composite tokens like dates, IP addresses, or file paths.
pyparsing
expressions can be combined using operators like +
(sequence), |
(OR), ^
(XOR), &
(AND), and *
(zero or more) to build complex grammars from these basic building blocks.