How can I write a quick and dirty interpreter?

Learn how can i write a quick and dirty interpreter? with practical examples, diagrams, and best practices. Covers programming-languages, parsing, interpreter development techniques with visual exp...

Building a Quick and Dirty Interpreter: A Practical Guide

Hero image for How can I write a quick and dirty interpreter?

Learn the fundamental concepts and practical steps to construct a basic interpreter for a simple programming language, from lexical analysis to execution.

Interpreters are fascinating pieces of software that execute code directly, line by line or statement by statement, without prior compilation into machine code. While building a production-ready interpreter is a complex task, creating a 'quick and dirty' one for a simple language can be an excellent way to understand core programming language concepts like lexical analysis, parsing, and execution. This article will guide you through the essential components and a practical approach to building such an interpreter.

The Interpreter's Core Components

Every interpreter, regardless of its complexity, typically follows a pipeline of distinct stages. Understanding these stages is crucial before diving into implementation. We'll focus on three primary phases: Lexical Analysis, Parsing, and Execution.

flowchart TD
    A[Source Code] --> B{Lexical Analysis}
    B --> C{Tokens}
    C --> D{Parsing}
    D --> E{Abstract Syntax Tree (AST)}
    E --> F{Execution}
    F --> G[Result/Output]

The typical pipeline of an interpreter

Phase 1: Lexical Analysis (Tokenization)

The first step is to break down the raw source code into a stream of meaningful units called 'tokens'. A lexer (or tokenizer) reads the input character by character and groups them into these tokens. For example, in the expression 10 + 20, the lexer would identify 10 as a NUMBER token, + as an PLUS token, and 20 as another NUMBER token. Each token typically has a type (e.g., NUMBER, IDENTIFIER, OPERATOR) and a value (the actual text, or 'lexeme').

import re

TOKEN_TYPES = {
    'NUMBER': r'\d+',
    'PLUS': r'\+',
    'MINUS': r'-',
    'MULTIPLY': r'\*', 
    'DIVIDE': r'/',
    'LPAREN': r'\(',
    'RPAREN': r'\)',
    'WHITESPACE': r'\s+'
}

class Token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

    def __repr__(self):
        return f"Token({self.type}, {self.value})"

class Lexer:
    def __init__(self, text):
        self.text = text
        self.pos = 0
        self.tokens = []

    def get_next_token(self):
        if self.pos >= len(self.text):
            return Token('EOF', None)

        for token_type, pattern in TOKEN_TYPES.items():
            match = re.match(pattern, self.text[self.pos:])
            if match:
                value = match.group(0)
                self.pos += len(value)
                if token_type != 'WHITESPACE':
                    return Token(token_type, value)
                else:
                    return self.get_next_token() # Skip whitespace
        
        raise Exception(f"Lexer error: Unexpected character '{self.text[self.pos]}' at position {self.pos}")

    def tokenize(self):
        token = self.get_next_token()
        while token.type != 'EOF':
            self.tokens.append(token)
            token = self.get_next_token()
        self.tokens.append(token) # Add EOF token
        return self.tokens

# Example Usage:
# lexer = Lexer("10 + (20 * 5) - 3")
# tokens = lexer.tokenize()
# print(tokens)

A simple Python lexer for basic arithmetic expressions.

Phase 2: Parsing (Syntax Analysis)

Once you have a stream of tokens, the parser's job is to check if the sequence of tokens conforms to the language's grammar rules and, if so, to build an Abstract Syntax Tree (AST). The AST is a tree representation of the source code's structure, abstracting away the concrete syntax. For example, 10 + 20 might become an Addition node with 10 and 20 as its children. This tree structure is much easier for the interpreter to work with than a flat list of tokens.

class ASTNode:
    pass

class NumberNode(ASTNode):
    def __init__(self, token):
        self.token = token
        self.value = int(token.value)

class BinaryOpNode(ASTNode):
    def __init__(self, left, op, right):
        self.left = left
        self.op = op
        self.right = right

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.token_idx = 0
        self.current_token = self.tokens[self.token_idx]

    def advance(self):
        self.token_idx += 1
        if self.token_idx < len(self.tokens):
            self.current_token = self.tokens[self.token_idx]
        else:
            self.current_token = Token('EOF', None)

    def eat(self, token_type):
        if self.current_token.type == token_type:
            self.advance()
        else:
            raise Exception(f"Parser error: Expected {token_type}, got {self.current_token.type}")

    def factor(self):
        token = self.current_token
        if token.type == 'NUMBER':
            self.eat('NUMBER')
            return NumberNode(token)
        elif token.type == 'LPAREN':
            self.eat('LPAREN')
            node = self.expr()
            self.eat('RPAREN')
            return node
        else:
            raise Exception(f"Parser error: Unexpected token {token.type}")

    def term(self):
        node = self.factor()
        while self.current_token.type in ('MULTIPLY', 'DIVIDE'):
            op_token = self.current_token
            self.eat(op_token.type)
            node = BinaryOpNode(node, op_token, self.factor())
        return node

    def expr(self):
        node = self.term()
        while self.current_token.type in ('PLUS', 'MINUS'):
            op_token = self.current_token
            self.eat(op_token.type)
            node = BinaryOpNode(node, op_token, self.term())
        return node

    def parse(self):
        return self.expr()

# Example Usage (requires Lexer from above):
# lexer = Lexer("10 + (20 * 5) - 3")
# tokens = lexer.tokenize()
# parser = Parser(tokens)
# ast = parser.parse()
# print(ast) # This would print the AST object structure

A recursive descent parser for arithmetic expressions, building an AST.

Phase 3: Execution (Interpretation)

With the AST in hand, the final phase is to traverse this tree and perform the operations it describes. This is where the actual 'interpretation' happens. For an arithmetic expression, this means evaluating the numbers and performing the specified operations. For a language with variables, this phase would involve managing a symbol table (or environment) to store and retrieve variable values.

class Interpreter:
    def __init__(self, parser):
        self.parser = parser

    def visit(self, node):
        method_name = f'visit_{type(node).__name__}'
        visitor = getattr(self, method_name, self.generic_visit)
        return visitor(node)

    def generic_visit(self, node):
        raise Exception(f'No visit_{type(node).__name__} method')

    def visit_NumberNode(self, node):
        return node.value

    def visit_BinaryOpNode(self, node):
        left_val = self.visit(node.left)
        right_val = self.visit(node.right)

        if node.op.type == 'PLUS':
            return left_val + right_val
        elif node.op.type == 'MINUS':
            return left_val - right_val
        elif node.op.type == 'MULTIPLY':
            return left_val * right_val
        elif node.op.type == 'DIVIDE':
            # Basic error handling for division by zero
            if right_val == 0:
                raise Exception("Runtime error: Division by zero")
            return left_val / right_val

    def interpret(self):
        tree = self.parser.parse()
        return self.visit(tree)

# Full Example Usage:
# source_code = "10 + (20 * 5) - 3"
# lexer = Lexer(source_code)
# tokens = lexer.tokenize()
# parser = Parser(tokens)
# interpreter = Interpreter(parser)
# result = interpreter.interpret()
# print(f"Result: {result}") # Expected: 107.0

An interpreter that traverses the AST and evaluates arithmetic expressions.

1. Define Your Language's Grammar

Before writing any code, clearly define the syntax and semantics of the simple language you want to interpret. What operations will it support? What are its keywords? This will guide your token definitions and parsing rules.

2. Implement the Lexer

Write a lexer that takes the raw input string and converts it into a stream of tokens. Use regular expressions for pattern matching. Ensure it handles whitespace and comments appropriately (usually by discarding them).

3. Build the Parser and AST

Create a parser that consumes the token stream and constructs an Abstract Syntax Tree (AST). Recursive descent is a good starting point for simple grammars. Each node in your AST should represent a logical construct in your language.

4. Develop the Interpreter/Evaluator

Implement a component that traverses the AST. For each node type, define how it should be evaluated or executed. This is where the 'meaning' of your language's constructs comes to life.

5. Test Iteratively

Start with the simplest possible expressions or statements and gradually add complexity. Test each component (lexer, parser, interpreter) independently before integrating them. This makes debugging much easier.