Use PLY to match a normal string

Learn use ply to match a normal string with practical examples, diagrams, and best practices. Covers python-2.7, lex, lexer development techniques with visual explanations.

Mastering String Matching with PLY in Python 2.7

Abstract representation of a lexer processing text, with tokens highlighted.

Learn how to effectively use PLY (Python Lex-Yacc) to define and match normal strings within a custom lexer for Python 2.7 applications.

PLY (Python Lex-Yacc) is a powerful parsing tool that implements the lex and yacc parsing tools as a set of Python modules. It's particularly useful for building compilers, interpreters, and domain-specific languages. This article focuses on a fundamental aspect of lexing: defining and matching 'normal' strings, which are sequences of characters that don't have special meaning to the lexer, often enclosed in delimiters like quotes. We'll explore how to configure PLY to correctly identify these strings, specifically within a Python 2.7 environment.

Understanding Lexical Analysis with PLY

Lexical analysis, or lexing, is the first phase of a compiler or interpreter. It takes the source code as input and produces a sequence of tokens. Each token represents a meaningful unit in the language, such as keywords, identifiers, operators, or literals (like numbers and strings). PLY's lex module allows you to define these tokens using regular expressions. The challenge with strings often lies in handling their content, which can include almost any character, and their delimiters.

flowchart TD
    A[Source Code] --> B{Lexer (PLY)};
    B --> C{Token Stream};
    C --> D[Parser (PLY)];
    D --> E[Abstract Syntax Tree];
    E --> F[Further Processing];
    B -- Tokenizes --> G["String Token ('Hello World')"];
    B -- Tokenizes --> H["Identifier Token (variable_name)"];
    B -- Tokenizes --> I["Keyword Token (if)"];

Overview of the Lexical Analysis Process with PLY

Defining a String Token in PLY

To match a normal string in PLY, you need to define a regular expression that captures the string's delimiters and its content. A common pattern for strings is to enclose them in single or double quotes. The content inside the quotes can be any character, often excluding the quote itself unless it's escaped. For Python 2.7, it's important to remember that string literals are typically byte strings by default, though Unicode literals can be specified.

import ply.lex as lex

# List of token names
tokens = (
    'STRING',
    'NUMBER',
    'ID',
    # Other tokens...
)

# Regular expression for a simple string (double quotes)
def t_STRING(t):
    r'"([^"\\]|\\.)*"'
    t.value = t.value[1:-1] # Remove quotes
    return t

# A more robust string definition (single or double quotes, handles escapes)
def t_STRING_ADVANCED(t):
    r"""("([^"\\]|\\.)*")|('([^'\\]|\\.)*')"""
    # Remove quotes from the value
    t.value = t.value[1:-1]
    return t

# Error handling rule
def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)

# Build the lexer
lexer = lex.lex()

# Test it out
data = '"Hello World" "Another string with \"escaped\" quotes"'
lexer.input(data)

while True:
    tok = lexer.token()
    if not tok: break
    print tok

Basic PLY Lexer for Matching Quoted Strings

💡

The regular expression r'"([^"\\]|\\.)*"' is crucial. It matches a double quote, followed by zero or more characters that are either not a double quote or a backslash, OR an escaped character (backslash followed by any character), and finally another double quote. The t.value = t.value[1:-1] line removes the surrounding quotes from the token's value, giving you just the string content.

Handling Escaped Characters and Delimiters

A robust string lexer must correctly handle escaped characters within the string. For example, a double quote inside a double-quoted string must be escaped (\"). The regular expression ([^"\\]|\\.)* handles this by matching either any character that is not a quote or a backslash, OR a backslash followed by any character (which covers common escapes like \n, \t, \", \', etc.). If your language supports both single and double-quoted strings, you'll need a more complex regex or separate rules, as shown in the t_STRING_ADVANCED example.

import ply.lex as lex

tokens = (
    'STRING',
)

def t_STRING(t):
    # Matches strings enclosed in single or double quotes
    # Handles escaped quotes and backslashes within the string
    r"""("([^"\\]|\\.)*")|('([^'\\]|\\.)*')"""
    
    # Remove the enclosing quotes
    t.value = t.value[1:-1]
    
    # Optionally, process escape sequences (e.g., \n to newline)
    # This is a basic example; a full implementation might use codecs.decode
    t.value = t.value.replace('\\"', '"')
    t.value = t.value.replace("\\\'", "'")
    t.value = t.value.replace('\\n', '\n')
    t.value = t.value.replace('\\t', '\t')
    t.value = t.value.replace('\\\\', '\\') # Handle escaped backslashes
    
    return t

def t_error(t):
    print "Illegal character '%s'" % t.value[0]
    t.lexer.skip(1)

lexer = lex.lex()

data = "\"This is a string with \\"escaped quotes\\" and a newline\\n.\" 'Another string with \\'single quotes\\'' and a tab\\t.'"
lexer.input(data)

while True:
    tok = lexer.token()
    if not tok: break
    print tok.type, repr(tok.value)

PLY Lexer with Advanced String Handling (Escapes)

⚠️

When manually processing escape sequences like t.value.replace('\\"', '"'), be careful about the order of replacements. For instance, replacing \\ with \ should generally happen after other escapes to avoid incorrectly processing \n as and then \.

Putting It All Together: A Complete Example

Here's a more complete example demonstrating how to integrate string matching into a basic PLY lexer. This example includes other common tokens to show how the string rule coexists with other lexical rules. Remember that PLY processes rules in the order they appear in your tokens list or by the length of their regex if they are defined as t_TOKENNAME functions without explicit order.

import ply.lex as lex

# List of token names
tokens = (
    'STRING',
    'NUMBER',
    'ID',
    'PLUS',
    'MINUS',
    'TIMES',
    'DIVIDE',
    'LPAREN',
    'RPAREN',
)

# Regular expressions for simple tokens
t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'

# Regular expression for numbers
def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t

# Regular expression for identifiers
def t_ID(t):
    r'[a-zA-Z_][a-zA-Z_0-9]*'
    return t

# Regular expression for strings (single or double quotes, with escapes)
def t_STRING(t):
    r"""("([^"\\]|\\.)*")|('([^'\\]|\\.)*')"""
    t.value = t.value[1:-1] # Remove quotes
    # Basic escape sequence processing
    t.value = t.value.replace('\\"', '"')
    t.value = t.value.replace("\\\'", "'")
    t.value = t.value.replace('\\n', '\n')
    t.value = t.value.replace('\\t', '\t')
    t.value = t.value.replace('\\\\', '\\')
    return t

# Define a rule so we can track line numbers
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

# A string containing ignored characters (spaces and tabs)
t_ignore  = ' \t'

# Error handling rule
def t_error(t):
    print "Illegal character '%s' at line %d" % (t.value[0], t.lexer.lineno)
    t.lexer.skip(1)

# Build the lexer
lexer = lex.lex()

# Test it out
data = "x = 10 + \"Hello World\\n\" / 2 - 'test string'"
lexer.input(data)

print "\n--- Lexer Output ---"
while True:
    tok = lexer.token()
    if not tok: break
    print tok.type, repr(tok.value), "(Line: %d)" % tok.lineno

Complete PLY Lexer with String, Number, and Identifier Tokens