Building a Lisp Interpreter (in Rust) - Part 1

1. Why Lisp
2. Goal
3. Representing Lisp Data
- 3.1. Lists
- 3.2. Final Design
4. Parsing
- 4.1. Tokenization
- 4.2. Reading

1. Why Lisp

Implementing a compiler or interpreter may seem like a daunting task, but a Lisp interpreter is an easy starting point.

To write Lisp code, you write down the Lisp data structure itself. Evaluating a Lisp program involves two steps:

read - Read the text into a Lisp data structure.
eval - Evaluate that structure according to Lisp’s simple rules.

12 ;; An integer, 12
#true ;; A boolean

;; A list containing the identifier +, and the integers 1 and 2.
;; This expression specifically invokes the + function with arguments 1 and 2.
;; The result of evaluating this list is 3.
(+ 1 2)

2. Goal

In this post, we will use Rust to implement only the read step. We’ll evaluate expressions in Part 2 and implement custom functions in Part 3.

Our implementation will favor readability over performance. In some cases (such as list storage design), we’ll also diverge from traditional Lisp implementations for clarity. If you want a well-defined reference implementation, check out the Scheme R7RS specification.

3. Representing Lisp Data

Lisp is dynamically typed, like JavaScript or Python. Dynamic typing means that the type of a variable is determined at runtime. The type may also change.

(define x 100) ;; x is an integer
(set! x #true) ;; now x is a boolean

;; More code
...

;; At this point, x is probably a boolean, unless it was changed in "More code".
(+ x 100)

To implement dynamic typing, we will define a single value that can hold anything. In Rust, this will be an enum that can hold all the supported types. We will support integers, booleans, and symbols/identifiers.

#[derive(Clone, Debug, PartialEq)]
pub enum Val {
    Int(i64),
    Bool(bool),
    Symbol(String),
    // ... and any other types we want to support.
}

3.1. Lists

Lists in Lisp are typically linked lists. A linked list is a node containing an element and a reference to the rest of the list. A basic version might look like:

pub struct LinkedListNode<T> {
    item: T,
    rest: Option<Box<LinkedListNode<T>>>,
}

However, in the spirit of Lisp’s dynamic typing, lists are often represented as a generic pair. The first item of the pair points to the head element (car), and the second points to the rest (cdr), which is either another pair or null.

pub struct Pair {
    // To make things more proper, instead of `first` and `second`,
    // the traditional Lisp names `car` and `cdr` are used.
    //
    // Nobody remembers what they originally stand for.
    car: Val,
    cdr: Val,
}

That said, we’ll take a simpler approach for our interpreter and represent lists as Vec<Val>, largely because Rust’s pattern matching works beautifully with slices.

3.2. Final Design

Our final data structure will have four variants. It’s easy to add new datatypes later; just add another enum variant!

#[derive(Clone, Debug, PartialEq)]
pub enum Val {
    Int(i64),
    Bool(bool),
    Symbol(String),
    List(Vec<Val>),
}

4. Parsing

Parsing text into a Val involves two steps:

Tokenization - Split the text into meaningful tokens.
Reading - Convert the sequence of tokens into Lisp values.

4.1. Tokenization

The tokenizer splits the text into identifiers and symbols. We’ll split by whitespace, treating ( and ) as separate tokens.

#[test]
fn test_tokenizer() {
    let mut tokenizer = Tokenizer::new("12 #true (+ 1 2)");
    assert_eq!(
        tokenizer.collect::<Vec<_>>(),
        &["12", "#true", "(", "+", "1", "2", ")"]
    );
}

To implement a tokenizer, we will create a Tokenizer that contains the text that is being tokenized, and a cursor on where we are at. When asked for the next token, we will parse the token at position and advance position an entire token.

#[derive(Debug, Clone)]
pub struct Tokenizer<'a> {
    input: &'a str,
    position: usize,
}

impl<'a> Tokenizer<'a> {
    pub fn new(input: &'a str) -> Self {
        Self { input, position: 0 }
    }
}

To parse the next token, we will

Skip any whitespace since whitespace is not meaningful.
Peek at the next character.
- If it is a parenthesis, then we will return that as a token and advance the position by just a single character.
- If it is an identifier, then we will parse until the next whitespace or parenthesis. The parsed token will be returned and the position will be advanced.

impl<'a> Iterator for Tokenizer<'a> {
    type Item = &'a str;

    fn next(&mut self) -> Option<Self::Item> {
        // 1. Skip whitespace.
        if let Some(offset) = self.input[self.position..]
            .chars()
            .position(|c| !c.is_whitespace())
        {
            self.position += offset
        }
        if self.position >= self.input.len() {
            return None;
        }

        // 2a. Handle parens
        let start = self.position;
        let ch = &self.input[start..start + 1];
        if matches!(ch, "(" | ")") {
            self.position += 1;
            return Some(ch);
        }
        // 2b. Handle identifiers
        let token = self
            .input
            .get(start..)?
            .split(|c: char| c.is_whitespace() || c == '(' || c == ')')
            .next()?;
        self.position += token.len();
        Some(token)
    }
}

This is a minimal implementation that will get us through our current implementation. However, at some point you may want to have special handling for strings which may include whitespace. It also returns the string and throws away the position information. Positional information is useful for providing debug information to the downstream processing.

4.2. Reading

The reader converts the token stream into Lisp values:

Identifiers are parsed as integers, booleans, or symbols.
A ( begins a list; elements are parsed until the matching ).
If there’s no closing ), we signal an error.

#[test]
fn test_read() {
    let mut tokenizer = Tokenizer::new("42 #true + (+ (* 2 3) 4)");
    assert_eq!(read_next(&mut tokenizer).unwrap(), Some(Val::Int(42)));
    assert_eq!(read_next(&mut tokenizer).unwrap(), Some(Val::Bool(true)));
    assert_eq!(read_next(&mut tokenizer).unwrap(), Some(Val::Symbol("+".to_string())));
    assert_eq!(
        read_next(&mut tokenizer).unwrap(),
        Some(Val::List(vec![
            Val::Symbol("+".to_string()),
            Val::List(vec![
                Val::Symbol("*".to_string()),
                Val::Int(2),
                Val::Int(3)
            ]),
            Val::Int(4)
        ]))
    );
    assert_eq!(read_next(&mut tokenizer).unwrap(), None);
}

#[test]
fn test_read_unmatched_paren() {
    let mut tokenizer = Tokenizer::new("(+ 1 2");
    assert!(matches!(
        read_next(&mut tokenizer),
        Err(TokenizerError::UnexpectedEndOfInput)
    ));
}

#[test]
fn test_read_extra_closing_paren() {
    let mut tokenizer = Tokenizer::new(")");
    assert!(matches!(
        read_next(&mut tokenizer),
        Err(TokenizerError::UnexpectedEndOfInput)
    ));
}

To accomplish this, we'll create a helper that can parse the next thing. The results can be:

enum ReadItem {
    // The end of an expression was reached. The end is signaled with a closing
    // parenthesis.
    EndExpr,
    // A value was parsed.
    Val(Val),
    // There were no more values.
    None,
}

When parsing the next token, the options are:

There are no more tokens so ReadItem::None.
An identifier was encountered so ReadItem::Val.
- The identifier could be an integer as well. If not, then it is a "symbol" that our interpreter will use to bind variables and function names.
When a closing parenthesis is encountered, then its an ReadItem::EndExpr. In a normal context this is syntax error.
An opening parenthesis is encounterd, then we recursively call read_next_impl until a ReadItem::EndExpr is encountered.

#[derive(Copy, Clone, Debug, PartialEq)]
pub enum TokenizerError {
    UnexpectedEndOfInput,
}

fn read_next_impl(tokenizer: &mut Tokenizer) -> Result<ReadItem, TokenizerError> {
    match tokenizer.next() {
        None => Ok(ReadItem::None),
        Some("#true") => Ok(ReadItem::Val(Val::Bool(true))),
        Some("#false") => Ok(ReadItem::Val(Val::Bool(false))),
        Some("(") => {
            let mut res = Vec::new();
            loop {
                match read_next_impl(tokenizer)? {
                    ReadItem::EndExpr => return Ok(ReadItem::Val(Val::List(res))),
                    ReadItem::Val(v) => res.push(v),
                    ReadItem::None => return Err(TokenizerError::UnexpectedEndOfInput),
                }
            }
        }
        Some(")") => Ok(ReadItem::EndExpr),
        Some(identifier @ _) => {
            let val = if let Ok(int_val) = identifier.parse::<i64>() {
                Val::Int(int_val)
            } else {
                Val::Symbol(identifier.to_string())
            };
            Ok(ReadItem::Val(val))
        }
    }
}

To finish our implementation, we wrap our helper:

pub fn read_next(tokenizer: &mut Tokenizer) -> Result<Option<Val>, TokenizerError> {
    match read_next_impl(tokenizer)? {
        ReadItem::EndExpr => Err(TokenizerError::UnexpectedEndOfInput),
        ReadItem::Val(val) => Ok(Some(val)),
        ReadItem::None => Ok(None),
    }
}