pyparsing - defining keywords - compare Literal, Word, Keyword and Combine

Learn pyparsing - defining keywords - compare literal, word, keyword and combine with practical examples, diagrams, and best practices. Covers python, pyparsing development techniques with visual e...

Mastering Keywords in pyparsing: Literal, Word, Keyword, and Combine

A magnifying glass examining text with highlighted keywords, representing parsing and pattern matching.

Explore the nuances of defining keywords and specific text patterns in pyparsing, comparing Literal, Word, Keyword, and Combine for robust parsing.

When building parsers with pyparsing, accurately identifying and handling keywords and specific text patterns is crucial. pyparsing offers several powerful primitives for this purpose: Literal, Word, Keyword, and Combine. While they might seem similar at first glance, each serves a distinct role and is optimized for different scenarios. Understanding their differences is key to writing efficient, correct, and maintainable parsers.

Literal: Exact String Matching

Literal is the most straightforward way to match an exact string. It's case-sensitive by default and will only match the precise sequence of characters provided. It's ideal for fixed tokens like operators, punctuation, or specific command names that must appear exactly as written.

from pyparsing import Literal

# Matches the exact string "SELECT"
select_keyword = Literal("SELECT")

# Test cases
print(select_keyword.parseString("SELECT"))
# Output: ['SELECT']

try:
    print(select_keyword.parseString("select"))
except Exception as e:
    print(f"Error: {e}")
# Output: Error: Expected "SELECT" (at char 0), (line:1, col:1)

try:
    print(select_keyword.parseString("SELEC T"))
except Exception as e:
    print(f"Error: {e}")
# Output: Error: Expected "SELECT" (at char 0), (line:1, col:1)

Using Literal for exact string matching.

Word: Matching Character Sets

Word is designed to match sequences of characters from a defined set. It's commonly used for identifiers, numbers, or any token composed of specific character types. Unlike Literal, Word doesn't match a fixed string but rather a pattern of characters. It stops matching when it encounters a character not in its defined set.

from pyparsing import Word, alphas, nums

# Matches a sequence of alphabetic characters
identifier = Word(alphas)

# Matches a sequence of numeric characters
number = Word(nums)

# Matches a sequence of alphanumeric characters
alphanumeric_token = Word(alphas + nums)

print(identifier.parseString("myVariable"))
# Output: ['myVariable']

print(number.parseString("12345"))
# Output: ['12345']

print(alphanumeric_token.parseString("item_123"))
# Output: ['item'] - Note: `_` is not in alphas+nums by default

# To include underscore:
alphanumeric_with_underscore = Word(alphas + nums + '_')
print(alphanumeric_with_underscore.parseString("item_123"))
# Output: ['item_123']

Examples of Word for matching character sequences.

Keyword: Literal with Word Boundary Checks

Keyword is a specialized form of Literal that adds an important feature: it ensures the matched string is a standalone word, not just a substring within a larger word. It achieves this by performing word boundary checks, preventing partial matches. This is crucial for distinguishing keywords like "AND" from parts of identifiers like "RANDOM".

from pyparsing import Keyword, Word, alphas

# Define a keyword
and_keyword = Keyword("AND")

# Define a generic identifier
identifier = Word(alphas)

# Test cases for Keyword
print(and_keyword.parseString("AND"))
# Output: ['AND']

try:
    print(and_keyword.parseString("RANDOM"))
except Exception as e:
    print(f"Error: {e}")
# Output: Error: Expected 'AND' (at char 0), (line:1, col:1)

# Compare with Literal for the same scenario
from pyparsing import Literal
and_literal = Literal("AND")

# This would incorrectly match if not careful with context
# (e.g., if followed by a space, Literal might match 'AND' in 'RANDOM AND')
# The real power of Keyword is when used within a larger grammar

# Example of Keyword preventing partial matches in a grammar
grammar = and_keyword | identifier
print(grammar.parseString("AND"))
# Output: ['AND']
print(grammar.parseString("OR"))
# Output: ['OR']
print(grammar.parseString("RANDOM"))
# Output: ['RANDOM']

Demonstrating Keyword for robust keyword matching with boundary checks.

Combine: Merging Parsed Tokens

Combine is not a matching primitive itself, but a modifier that takes a parser expression and concatenates all the matched tokens into a single string. This is particularly useful when you have a sequence of Word or Literal expressions that logically form a single token, but pyparsing would otherwise return them as separate elements in a list. Combine ensures they are returned as one contiguous string.

from pyparsing import Word, nums, Literal, Combine

# Without Combine: date parts are separate
date_parts = Word(nums, exact=2) + Literal("/") + Word(nums, exact=2) + Literal("/") + Word(nums, exact=4)
print(date_parts.parseString("12/25/2023"))
# Output: ['12', '/', '25', '/', '2023']

# With Combine: date is a single string
combined_date = Combine(Word(nums, exact=2) + Literal("/") + Word(nums, exact=2) + Literal("/") + Word(nums, exact=4))
print(combined_date.parseString("12/25/2023"))
# Output: ['12/25/2023']

# Another example: IP address
ip_segment = Word(nums, min=1, max=3)
ip_address_uncombined = ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment
print(ip_address_uncombined.parseString("192.168.1.1"))
# Output: ['192', '.', '168', '.', '1', '.', '1']

ip_address_combined = Combine(ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment + Literal(".") + ip_segment)
print(ip_address_combined.parseString("192.168.1.1"))
# Output: ['192.168.1.1']

Using Combine to merge multiple parsed tokens into a single string.

flowchart TD
    A[Start Parsing] --> B{Token Type?}
    B -->|Exact String| C[Literal("TOKEN")]
    C --> D{Word Boundary Needed?}
    D -->|Yes| E[Keyword("TOKEN")]
    D -->|No| F[Literal("TOKEN")]
    B -->|Character Set| G[Word(chars)]
    G --> H{Multiple Parts to One String?}
    H -->|Yes| I[Combine(expression)]
    H -->|No| J[Expression (e.g., Word)]
    I --> K[Result: Single String]
    J --> L[Result: List of Strings]
    E --> M[Result: Single String (with boundary check)]
    F --> N[Result: Single String (no boundary check)]

Decision flow for choosing between Literal, Word, Keyword, and Combine.

💡

Always prefer Keyword over Literal when defining actual keywords in a language (like if, else, SELECT, FROM) to prevent accidental partial matches within identifiers. Use Literal for fixed symbols or operators that are not expected to be part of larger words (e.g., (, ), =, +).

Summary and Best Practices

Choosing the right pyparsing primitive depends on the specific parsing requirement:

Literal: Use for exact, fixed strings that do not require word boundary checks, such as operators (+, -), punctuation (,, ;), or very specific, non-keyword tokens.
Word: Use for tokens composed of a sequence of characters from a defined set, like identifiers, numbers, or custom alphanumeric strings. It's flexible for matching patterns, not fixed text.
Keyword: Use for language keywords (e.g., SELECT, FROM, IF, ELSE). It's a Literal that enforces word boundaries, ensuring the keyword is matched as a whole word and not as a substring of another token.
Combine: Use as a wrapper around other expressions when you want to concatenate their matched parts into a single string result, rather than a list of individual tokens. This is useful for composite tokens like dates, IP addresses, or file paths.

ℹ️

Remember that pyparsing expressions can be combined using operators like + (sequence), | (OR), ^ (XOR), & (AND), and * (zero or more) to build complex grammars from these basic building blocks.