How can I write a quick and dirty interpreter?

Learn how can i write a quick and dirty interpreter? with practical examples, diagrams, and best practices. Covers programming-languages, parsing, interpreter development techniques with visual exp...

Crafting a Quick & Dirty Interpreter: A Practical Guide

Crafting a Quick & Dirty Interpreter: A Practical Guide

Learn the fundamental components and step-by-step process of building a simple interpreter for a custom programming language, from lexing to execution.

Interpreters are at the heart of many programming languages, allowing code to be executed directly without a compilation step. While building a production-ready interpreter is a complex task, creating a 'quick and dirty' one is an excellent way to grasp core concepts like lexing, parsing, and abstract syntax trees (ASTs). This article will guide you through the basics, focusing on a minimal language to illustrate the process.

Understanding the Interpreter Pipeline

Every interpreter, regardless of its complexity, follows a general pipeline. This pipeline transforms raw source code into executable actions. We'll break it down into three primary phases: Lexical Analysis (Lexing), Syntactic Analysis (Parsing), and Interpretation/Execution.

A flowchart diagram illustrating the interpreter pipeline. It starts with 'Source Code' -> 'Lexer' (produces Tokens) -> 'Parser' (produces AST) -> 'Interpreter' (produces Result/Execution). Each stage is represented by a blue rectangular box, with arrows indicating the data flow. The labels are clear and concise.

The fundamental stages of an interpreter's pipeline.

1. Lexical Analysis (Lexing)

Lexing is the first step, where the raw input string (your source code) is converted into a stream of tokens. A token is a meaningful unit in the language, like keywords, identifiers, operators, numbers, or strings. The lexer (or scanner) reads character by character and groups them into these tokens. For example, the expression 2 + 3 would be tokenized into NUMBER(2), PLUS, NUMBER(3).

import re

TOKEN_TYPES = [
    ('NUMBER', r'\d+'),
    ('PLUS', r'\+'),
    ('MINUS', r'-'),
    ('MULTIPLY', r'\*'),
    ('DIVIDE', r'/'),
    ('LPAREN', r'\('),
    ('RPAREN', r'\)'),
    ('WHITESPACE', r'\s+'),
]

class Token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

    def __repr__(self):
        return f"Token({self.type}, {self.value})"

def lex(code):
    tokens = []
    position = 0
    while position < len(code):
        match = None
        for token_type, pattern in TOKEN_TYPES:
            regex = re.compile(pattern)
            m = regex.match(code, position)
            if m:
                value = m.group(0)
                if token_type != 'WHITESPACE':
                    tokens.append(Token(token_type, value))
                position += len(value)
                match = True
                break
        if not match:
            raise RuntimeError(f"Unexpected character: {code[position]}")
    return tokens

# Example usage:
# code = "(2 + 3) * 4"
# tokens = lex(code)
# print(tokens)

A simple Python lexer for basic arithmetic expressions.

2. Syntactic Analysis (Parsing)

After lexing, the stream of tokens is passed to the parser. The parser's job is to check if the sequence of tokens conforms to the language's grammar and, if so, to build an Abstract Syntax Tree (AST). An AST is a tree representation of the syntactic structure of the source code, abstracting away the concrete syntax. For 2 + 3, the AST might be a BinaryOp node with + as the operator, and Number(2) and Number(3) as its children.

class ASTNode:
    pass

class NumberNode(ASTNode):
    def __init__(self, value):
        self.value = int(value)

    def __repr__(self):
        return f"NumberNode({self.value})"

class BinaryOpNode(ASTNode):
    def __init__(self, left, op, right):
        self.left = left
        self.op = op
        self.right = right

    def __repr__(self):
        return f"BinaryOpNode({self.left}, {self.op.value}, {self.right})"

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.current_token_index = 0

    def current_token(self):
        if self.current_token_index < len(self.tokens):
            return self.tokens[self.current_token_index]
        return None

    def consume(self, expected_type):
        token = self.current_token()
        if token and token.type == expected_type:
            self.current_token_index += 1
            return token
        raise SyntaxError(f"Expected {expected_type}, got {token.type if token else 'EOF'}")

    def parse_number(self):
        token = self.consume('NUMBER')
        return NumberNode(token.value)

    def parse_factor(self):
        token = self.current_token()
        if token.type == 'NUMBER':
            return self.parse_number()
        elif token.type == 'LPAREN':
            self.consume('LPAREN')
            expr = self.parse_expression()
            self.consume('RPAREN')
            return expr
        raise SyntaxError(f"Unexpected token: {token.type}")

    def parse_term(self):
        node = self.parse_factor()
        while self.current_token() and self.current_token().type in ('MULTIPLY', 'DIVIDE'):
            op = self.current_token()
            self.consume(op.type)
            node = BinaryOpNode(node, op, self.parse_factor())
        return node

    def parse_expression(self):
        node = self.parse_term()
        while self.current_token() and self.current_token().type in ('PLUS', 'MINUS'):
            op = self.current_token()
            self.consume(op.type)
            node = BinaryOpNode(node, op, self.parse_term())
        return node

    def parse(self):
        return self.parse_expression()

# Example usage (assuming 'lex' from lexer.py is available):
# from lexer import lex, Token
# code = "(2 + 3) * 4"
# tokens = lex(code)
# parser = Parser(tokens)
# ast = parser.parse()
# print(ast)

A recursive descent parser for arithmetic expressions, building an AST.

3. Interpretation/Execution

Finally, the interpreter traverses the AST and performs the actions represented by its nodes. For an arithmetic expression, this means evaluating the numbers and applying the operations. Each type of AST node will have a corresponding interpretation logic. This phase is where the 'meaning' of the code is derived and acted upon.

class Interpreter:
    def visit(self, node):
        method_name = f'visit_{type(node).__name__}'
        visitor = getattr(self, method_name, self.generic_visit)
        return visitor(node)

    def generic_visit(self, node):
        raise Exception(f'No visit_{type(node).__name__} method')

    def visit_NumberNode(self, node):
        return node.value

    def visit_BinaryOpNode(self, node):
        left_val = self.visit(node.left)
        right_val = self.visit(node.right)

        if node.op.type == 'PLUS':
            return left_val + right_val
        elif node.op.type == 'MINUS':
            return left_val - right_val
        elif node.op.type == 'MULTIPLY':
            return left_val * right_val
        elif node.op.type == 'DIVIDE':
            # Basic error handling for division by zero
            if right_val == 0:
                raise ZeroDivisionError("Division by zero")
            return left_val / right_val

# Example usage (assuming 'lex' and 'Parser' are available):
# from lexer import lex
# from parser import Parser
# code = "(2 + 3) * 4"
# tokens = lex(code)
# parser = Parser(tokens)
# ast = parser.parse()
# interpreter = Interpreter()
# result = interpreter.visit(ast)
# print(f"Result: {result}") # Expected: 20

An interpreter that traverses the AST and evaluates arithmetic expressions.

Putting It All Together

To run our quick interpreter, we simply chain the components: lex the code, parse the tokens into an AST, and then interpret the AST. This modular design allows each component to focus on a single responsibility, making the interpreter easier to understand and extend.

from lexer import lex
from parser import Parser
from interpreter import Interpreter

def run_interpreter(code):
    print(f"Input code: {code}")
    tokens = lex(code)
    print(f"Tokens: {tokens}")
    parser = Parser(tokens)
    ast = parser.parse()
    print(f"AST: {ast}")
    interpreter = Interpreter()
    result = interpreter.visit(ast)
    print(f"Result: {result}")
    return result

# Test cases
run_interpreter("10 + 5 * 2")
# Expected: 20

run_interpreter("(10 + 5) * 2")
# Expected: 30

run_interpreter("20 / (4 - 2)")
# Expected: 10.0

The main script to run the interpreter with example code.

This basic structure can be extended to support more complex language features like variables, control flow (if/else, loops), function calls, and more. Each new feature typically requires additions to the token types, grammar rules (and thus parser logic), and AST node types with corresponding interpretation logic. This iterative process is how complex programming languages are built.