How can I write a quick and dirty interpreter?
Categories:
Building a Quick and Dirty Interpreter: A Practical Guide

Learn the fundamental concepts and practical steps to construct a basic interpreter for a simple programming language, from lexical analysis to execution.
Interpreters are fascinating pieces of software that execute code directly, line by line or statement by statement, without prior compilation into machine code. While building a production-ready interpreter is a complex task, creating a 'quick and dirty' one for a simple language can be an excellent way to understand core programming language concepts like lexical analysis, parsing, and execution. This article will guide you through the essential components and a practical approach to building such an interpreter.
The Interpreter's Core Components
Every interpreter, regardless of its complexity, typically follows a pipeline of distinct stages. Understanding these stages is crucial before diving into implementation. We'll focus on three primary phases: Lexical Analysis, Parsing, and Execution.
flowchart TD A[Source Code] --> B{Lexical Analysis} B --> C{Tokens} C --> D{Parsing} D --> E{Abstract Syntax Tree (AST)} E --> F{Execution} F --> G[Result/Output]
The typical pipeline of an interpreter
Phase 1: Lexical Analysis (Tokenization)
The first step is to break down the raw source code into a stream of meaningful units called 'tokens'. A lexer (or tokenizer) reads the input character by character and groups them into these tokens. For example, in the expression 10 + 20
, the lexer would identify 10
as a NUMBER token, +
as an PLUS token, and 20
as another NUMBER token. Each token typically has a type (e.g., NUMBER
, IDENTIFIER
, OPERATOR
) and a value (the actual text, or 'lexeme').
import re
TOKEN_TYPES = {
'NUMBER': r'\d+',
'PLUS': r'\+',
'MINUS': r'-',
'MULTIPLY': r'\*',
'DIVIDE': r'/',
'LPAREN': r'\(',
'RPAREN': r'\)',
'WHITESPACE': r'\s+'
}
class Token:
def __init__(self, type, value):
self.type = type
self.value = value
def __repr__(self):
return f"Token({self.type}, {self.value})"
class Lexer:
def __init__(self, text):
self.text = text
self.pos = 0
self.tokens = []
def get_next_token(self):
if self.pos >= len(self.text):
return Token('EOF', None)
for token_type, pattern in TOKEN_TYPES.items():
match = re.match(pattern, self.text[self.pos:])
if match:
value = match.group(0)
self.pos += len(value)
if token_type != 'WHITESPACE':
return Token(token_type, value)
else:
return self.get_next_token() # Skip whitespace
raise Exception(f"Lexer error: Unexpected character '{self.text[self.pos]}' at position {self.pos}")
def tokenize(self):
token = self.get_next_token()
while token.type != 'EOF':
self.tokens.append(token)
token = self.get_next_token()
self.tokens.append(token) # Add EOF token
return self.tokens
# Example Usage:
# lexer = Lexer("10 + (20 * 5) - 3")
# tokens = lexer.tokenize()
# print(tokens)
A simple Python lexer for basic arithmetic expressions.
flex
or lex
.Phase 2: Parsing (Syntax Analysis)
Once you have a stream of tokens, the parser's job is to check if the sequence of tokens conforms to the language's grammar rules and, if so, to build an Abstract Syntax Tree (AST). The AST is a tree representation of the source code's structure, abstracting away the concrete syntax. For example, 10 + 20
might become an Addition
node with 10
and 20
as its children. This tree structure is much easier for the interpreter to work with than a flat list of tokens.
class ASTNode:
pass
class NumberNode(ASTNode):
def __init__(self, token):
self.token = token
self.value = int(token.value)
class BinaryOpNode(ASTNode):
def __init__(self, left, op, right):
self.left = left
self.op = op
self.right = right
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.token_idx = 0
self.current_token = self.tokens[self.token_idx]
def advance(self):
self.token_idx += 1
if self.token_idx < len(self.tokens):
self.current_token = self.tokens[self.token_idx]
else:
self.current_token = Token('EOF', None)
def eat(self, token_type):
if self.current_token.type == token_type:
self.advance()
else:
raise Exception(f"Parser error: Expected {token_type}, got {self.current_token.type}")
def factor(self):
token = self.current_token
if token.type == 'NUMBER':
self.eat('NUMBER')
return NumberNode(token)
elif token.type == 'LPAREN':
self.eat('LPAREN')
node = self.expr()
self.eat('RPAREN')
return node
else:
raise Exception(f"Parser error: Unexpected token {token.type}")
def term(self):
node = self.factor()
while self.current_token.type in ('MULTIPLY', 'DIVIDE'):
op_token = self.current_token
self.eat(op_token.type)
node = BinaryOpNode(node, op_token, self.factor())
return node
def expr(self):
node = self.term()
while self.current_token.type in ('PLUS', 'MINUS'):
op_token = self.current_token
self.eat(op_token.type)
node = BinaryOpNode(node, op_token, self.term())
return node
def parse(self):
return self.expr()
# Example Usage (requires Lexer from above):
# lexer = Lexer("10 + (20 * 5) - 3")
# tokens = lexer.tokenize()
# parser = Parser(tokens)
# ast = parser.parse()
# print(ast) # This would print the AST object structure
A recursive descent parser for arithmetic expressions, building an AST.
Phase 3: Execution (Interpretation)
With the AST in hand, the final phase is to traverse this tree and perform the operations it describes. This is where the actual 'interpretation' happens. For an arithmetic expression, this means evaluating the numbers and performing the specified operations. For a language with variables, this phase would involve managing a symbol table (or environment) to store and retrieve variable values.
class Interpreter:
def __init__(self, parser):
self.parser = parser
def visit(self, node):
method_name = f'visit_{type(node).__name__}'
visitor = getattr(self, method_name, self.generic_visit)
return visitor(node)
def generic_visit(self, node):
raise Exception(f'No visit_{type(node).__name__} method')
def visit_NumberNode(self, node):
return node.value
def visit_BinaryOpNode(self, node):
left_val = self.visit(node.left)
right_val = self.visit(node.right)
if node.op.type == 'PLUS':
return left_val + right_val
elif node.op.type == 'MINUS':
return left_val - right_val
elif node.op.type == 'MULTIPLY':
return left_val * right_val
elif node.op.type == 'DIVIDE':
# Basic error handling for division by zero
if right_val == 0:
raise Exception("Runtime error: Division by zero")
return left_val / right_val
def interpret(self):
tree = self.parser.parse()
return self.visit(tree)
# Full Example Usage:
# source_code = "10 + (20 * 5) - 3"
# lexer = Lexer(source_code)
# tokens = lexer.tokenize()
# parser = Parser(tokens)
# interpreter = Interpreter(parser)
# result = interpreter.interpret()
# print(f"Result: {result}") # Expected: 107.0
An interpreter that traverses the AST and evaluates arithmetic expressions.
1. Define Your Language's Grammar
Before writing any code, clearly define the syntax and semantics of the simple language you want to interpret. What operations will it support? What are its keywords? This will guide your token definitions and parsing rules.
2. Implement the Lexer
Write a lexer that takes the raw input string and converts it into a stream of tokens. Use regular expressions for pattern matching. Ensure it handles whitespace and comments appropriately (usually by discarding them).
3. Build the Parser and AST
Create a parser that consumes the token stream and constructs an Abstract Syntax Tree (AST). Recursive descent is a good starting point for simple grammars. Each node in your AST should represent a logical construct in your language.
4. Develop the Interpreter/Evaluator
Implement a component that traverses the AST. For each node type, define how it should be evaluated or executed. This is where the 'meaning' of your language's constructs comes to life.
5. Test Iteratively
Start with the simplest possible expressions or statements and gradually add complexity. Test each component (lexer, parser, interpreter) independently before integrating them. This makes debugging much easier.