<!--
.. title: From Tree-Walker to Bytecode VM: Compiling Lattice
.. slug: from-tree-walker-to-bytecode-vm-compiling-lattice
.. date: 2026-02-17 12:00:00 UTC-06:00
.. tags: lattice, programming languages, bytecode, virtual machine, compilers, c, language design, interpreters
.. category: software-engineering
.. link:
.. description: How Lattice moved from a tree-walking interpreter to a bytecode compiler and stack-based VM — covering the architecture of a 62-opcode instruction set, upvalue-based closures, phase system compilation, and the subtle bug that taught me how deep-clone-on-read interacts with in-place mutation.
.. type: text
-->

<div class="audio-widget">
<div class="audio-widget-header">
<span class="audio-widget-icon">🎧</span>
<span class="audio-widget-label">Listen to this article</span>
</div>
<audio controls preload="metadata">
<source src="/from-tree-walker-to-bytecode-vm-compiling-lattice_tts.mp3" type="audio/mpeg">
</audio>
<div class="audio-widget-footer">16 min · AI-generated narration</div>
</div>

Lattice is a programming language built around a [crystallization-based phase system](/posts/introducing-lattice-a-crystallization-based-programming-language.html) — values start as mutable "flux" and can be frozen into immutable "fix," with the runtime enforcing the transition and providing [reactions, bonds, contracts, and temporal tracking](/posts/mutability-as-a-first-class-concept-the-lattice-phase-system.html) around it. It's implemented in C with no external dependencies.

When I started building [Lattice](https://baud.rs/q5yFwI), a tree-walking interpreter was the obvious first move. You parse source into an AST, walk the nodes recursively, and evaluate as you go. It's straightforward, easy to debug, and lets you iterate on language semantics quickly without worrying about a second representation. [*Crafting Interpreters*](https://baud.rs/crafting-interpreters) calls this approach "the simplest way to build an interpreter," and it's right.

But tree-walkers have well-known limitations. Every expression evaluation descends through function calls — `eval_expr` calling `eval_binary` calling `eval_expr` twice more. The overhead compounds. You're chasing pointers through heap-allocated AST nodes with poor cache locality. And the call stack of the host language (C, in Lattice's case) becomes tangled with the call stack of the guest language, making it harder to implement features like error recovery and coroutines cleanly.

Lattice v0.3.0 shipped a bytecode compiler and stack-based virtual machine alongside the tree-walker. In v0.3.1, the bytecode VM became the default for file execution, the interactive REPL, and the browser-based playground. The tree-walker is still available via `--tree-walk`, but the VM now handles everything. This post walks through the architecture of that VM, some design decisions that turned out to matter, and a mutation bug that only surfaces when you combine deep-clone-on-read semantics with in-place method dispatch.

## Architecture Overview

The bytecode pipeline has three stages: lexing and parsing (shared with the tree-walker), compilation from AST to bytecode chunks, and execution on a stack-based VM. The compiler and VM together add about 8,200 lines of C to the codebase, bringing the total to around 33,000 lines.

A `Chunk` is the compilation unit — a dynamic array of bytecode instructions, a constant pool, and debug metadata mapping instructions back to source line numbers. The compiler walks the AST and emits bytes into a chunk. The VM reads bytes from the chunk and executes them against a value stack.

```c
typedef struct {
    uint8_t  *code;          // bytecode array
    size_t    len, cap;
    LatValue *constants;     // constant pool
    size_t    const_len, const_cap;
    int      *lines;         // source line per instruction
    char    **local_names;   // slot → variable name (debug)
    size_t    local_name_cap;
} Chunk;
```

The VM itself is a `for(;;)` loop with a `switch` on the current opcode byte — the textbook approach. No computed gotos, no threaded dispatch, no JIT. Just a switch. On modern hardware with branch prediction, a well-organized switch over 62 opcodes is fast enough that the overhead is negligible compared to the cost of actual operations (string allocation, hash table lookups, deep cloning).

```c
for (;;) {
    uint8_t op = READ_BYTE();
    switch (op) {
        case OP_CONSTANT: { ... }
        case OP_ADD:      { ... }
        case OP_CALL:     { ... }
        // 59 more cases
    }
}
```

The value stack holds 4,096 slots. The call frame stack holds 256 frames. Each `CallFrame` tracks its own instruction pointer, a base pointer into the value stack for its local variables, and an array of captured upvalues for closures. When you call a function, the VM pushes a new frame pointing at the callee's chunk. When the function returns, the frame pops and execution resumes in the caller.

## The Instruction Set

Lattice's instruction set has 62 opcodes. Some are standard — `OP_ADD`, `OP_JUMP_IF_FALSE`, `OP_RETURN`. Others exist because of Lattice-specific semantics.

The phase system needs dedicated opcodes. `OP_FREEZE` pops a value, deep-clones it into a crystal region with `VTAG_CRYSTAL` tags, and pushes the frozen result. `OP_THAW` does the reverse. `OP_MARK_FLUID` sets the phase tag to `VTAG_FLUID` — this is what `flux` bindings emit after their initializer. `OP_FREEZE_VAR` and `OP_THAW_VAR` handle the case where `freeze(x)` targets a named variable and needs to write back the result, carrying extra operands to identify the variable's location (local slot, upvalue, or global name).

Phase reactions and bonds each have their own opcodes: `OP_REACT`, `OP_UNREACT`, `OP_BOND`, `OP_UNBOND`, `OP_SEED`, `OP_UNSEED`. These could theoretically be implemented as native function calls, but making them opcodes lets the compiler emit the variable name as a constant operand — the VM needs the name to look up the correct reaction/bond registration in its tracking tables, and encoding it in the bytecode avoids a runtime string lookup.

Structured concurrency uses an interesting hybrid. `OP_SCOPE` and `OP_SELECT` each carry a constant-pool index that stores a pointer to the original AST `Expr*` node. When the VM hits one of these opcodes, it invokes the tree-walking evaluator on that subtree. This is a deliberate design choice — the concurrency primitives involve spawning threads and managing channels, which requires the evaluator's full environment machinery. Rather than reimplement all of that in the VM, the bytecode compiler punts to the tree-walker for these specific constructs. The rest of the program runs on the VM; only `scope` and `select` blocks briefly drop into interpretation.

## Closures and Upvalues

Closures are where bytecode VMs get interesting, and Lattice follows the upvalue model that Lua pioneered and Crafting Interpreters popularized.

When a function is defined inside another function and references variables from the enclosing scope, those variables need to outlive their original stack frame. The solution is upvalues — indirection objects that start pointing into the stack and get "closed over" when the variable goes out of scope.

```c
typedef struct ObjUpvalue {
    LatValue *location;        // points to stack slot or &closed
    LatValue  closed;          // holds value after scope exit
    struct ObjUpvalue *next;   // linked list for open upvalues
} ObjUpvalue;
```

While the enclosing function is still executing, `location` points directly at the stack slot. When the enclosing function returns, `OP_CLOSE_UPVALUE` copies the stack value into the `closed` field and repoints `location` to `&closed`. The closure doesn't know or care about the switch — it always dereferences `location`. This is why upvalues work: they're a level of indirection that transparently survives stack frame destruction.

The compiler resolves variable references in three stages: first it checks local scope (`resolve_local`), then upvalues (`resolve_upvalue`, which walks the compiler chain recursively), then falls back to globals via `OP_GET_GLOBAL`. The `OP_CLOSURE` instruction is followed by a series of `(is_local, index)` byte pairs, one per upvalue, telling the VM whether to capture from the current frame's stack or from the parent frame's upvalue array.

A concrete example makes this clearer. Consider a counter factory:

```lattice
fn make_counter() {
    flux count = 0
    return |n| { count += n; count }
}

let c = make_counter()
print(c(5))   // 5
print(c(3))   // 8
```

When `make_counter` returns, its stack frame is destroyed — but `count` needs to survive, because the returned closure references it. During compilation, the compiler sees that the closure's body references `count`, which is local to the enclosing `make_counter`. It emits an `(is_local=true, index=1)` upvalue descriptor. At runtime, `OP_CLOSURE` calls `capture_upvalue()`, which either reuses an existing `ObjUpvalue` pointing at that stack slot or creates a new one. When `make_counter` returns, `OP_CLOSE_UPVALUE` copies the stack value of `count` into the upvalue's `closed` field and repoints `location`. The closure keeps working, oblivious to the frame being gone.

One implementation detail worth noting: Lattice stores the upvalue array by repurposing the closure's `captured_env` field (normally an `Env*` in the tree-walker) and the upvalue count in the `region_id` field. This avoids adding new fields to the `LatValue` union, which matters when values are deep-cloned frequently — every field adds to the clone cost.

## Compiling for the REPL

A REPL that runs on a bytecode VM needs different compilation from file execution. The difference is small but important.

In file mode, `compile_module()` compiles a complete program and terminates with `OP_UNIT; OP_RETURN` — the module returns unit, and any expression results along the way are discarded with `OP_POP`. This is the right behavior for scripts: you don't want every intermediate expression to accumulate on the stack.

In REPL mode, `compile_repl()` needs the opposite behavior for the last expression. When you type `42` at the REPL prompt, you want to see `=> 42`. So if the last item in the compiled chunk is a bare expression statement, `compile_repl()` compiles the expression but *skips the `OP_POP`*, leaving the value on the stack. Then it emits `OP_RETURN`, and the VM receives the value as the chunk's return value.

```c
bool last_is_expr = (prog->item_count > 0 &&
    prog->items[prog->item_count - 1].tag == ITEM_STMT &&
    prog->items[prog->item_count - 1].as.stmt->tag == STMT_EXPR);

if (last_is_expr)
    emit_byte(OP_RETURN, 0);       // value already on stack
else {
    emit_byte(OP_UNIT, 0);         // no expression — return unit
    emit_byte(OP_RETURN, 0);
}
```

For function definitions, struct declarations, and enum definitions, the result is unit, and the REPL silently suppresses the `=> ` output. This matches user expectations — defining a function shouldn't print anything. The effect in practice:

```
lattice> 42
=> 42
lattice> "hello" + " world"
=> "hello world"
lattice> flux x = 10
lattice> x * 2
=> 20
lattice> fn square(n: Int) -> Int { n * n }
lattice> square(7)
=> 49
```

Each line is independently compiled and executed on the persistent VM. Globals defined in one line (`flux x = 10`) are visible in subsequent lines because they're stored in the VM's environment, which persists across iterations. The `Chunk` for each line is freed after execution — constants that matter (like global variable values) have already been deep-cloned into the environment.

The other critical difference is enum persistence. `compile_module()` frees its known-enum registry after compilation, because the compiler is done. `compile_repl()` must not, because enums defined in REPL iteration N need to be visible in iteration N+1. The REPL calls `compiler_free_known_enums()` only on exit. The same lifetime concern applies to parsed programs — struct and function declarations store `Expr*` pointers that compiled chunks reference at runtime. The REPL accumulates all parsed programs in a dynamic array and frees them only when the session ends.

## The Global Mutation Bug

This is the story I find most instructive, because it reveals a subtle interaction between two independently reasonable design decisions.

Lattice has **deep-clone-on-read** semantics. When you access a variable, the environment doesn't hand you a reference to the stored value — it hands you a fresh deep clone. This eliminates aliasing entirely: two variables never share underlying memory, passing a map to a function gives the function its own copy, and there's no way to create spooky action at a distance through shared mutable state.

```c
bool env_get(const Env *env, const char *name, LatValue *out) {
    for (size_t i = env->count; i > 0; i--) {
        LatValue *v = lat_map_get(&env->scopes[i-1], name);
        if (v) {
            *out = value_deep_clone(v);  // always a fresh copy
            return true;
        }
    }
    return false;
}
```

This is expensive but correct. It gives Lattice pure value semantics without needing a borrow checker or persistent data structures.

The tree-walking evaluator handles in-place mutation (like `array.push()`) with a separate `resolve_lvalue()` mechanism that obtains a direct mutable pointer into the environment's storage, bypassing the deep clone. Push, pop, index assignment — these all go through `resolve_lvalue` and mutate the stored value directly.

The bytecode VM needed the same distinction. For local variables, this is straightforward: locals live on the value stack, and the VM has a direct pointer to them via `frame->slots[slot]`. I added `OP_INVOKE_LOCAL`, which takes a stack slot index as an operand and passes a pointer to `vm_invoke_builtin()`:

```c
case OP_INVOKE_LOCAL: {
    uint8_t slot = READ_BYTE();
    uint8_t method_idx = READ_BYTE();
    uint8_t arg_count = READ_BYTE();
    const char *method_name = frame->chunk->constants[method_idx].as.str_val;
    LatValue *obj = &frame->slots[slot];  // direct pointer

    if (vm_invoke_builtin(vm, obj, method_name, arg_count, local_var_name)) {
        // builtin mutated obj in-place — mutation persists
        break;
    }
    // ... fall through to closure/method dispatch
}
```

When `.push()` grows the array by reallocating `obj->as.array.elems` and incrementing `obj->as.array.len`, it's directly modifying the stack slot. The mutation persists because `obj` *is* the variable.

For globals, the situation is different. Globals live in the environment (a scope-chain of hash maps), and `env_get()` deep-clones. The generic `OP_INVOKE` opcode works by evaluating the receiver expression onto the stack — which, for a global variable, means emitting `OP_GET_GLOBAL`, which calls `env_get()`, which deep-clones — and then dispatching the method on the cloned value. After the builtin mutates the clone, `OP_INVOKE` pops and *frees* it. The mutation vanishes.

```lattice
flux nums = [1, 2, 3]
nums.push(4)
print(nums)  // still [1, 2, 3] — the push mutated a clone
```

This is the kind of bug that's obvious in retrospect but invisible when you're implementing things one piece at a time. `env_get()` deep-cloning is correct. `OP_INVOKE` popping the receiver after dispatch is correct. Each piece behaves correctly in isolation. The bug emerges from their composition.

The fix is `OP_INVOKE_GLOBAL` — a new opcode that knows the receiver is a global variable and writes back after mutation:

```c
case OP_INVOKE_GLOBAL: {
    uint8_t name_idx = READ_BYTE();
    uint8_t method_idx = READ_BYTE();
    uint8_t arg_count = READ_BYTE();
    const char *global_name = frame->chunk->constants[name_idx].as.str_val;
    const char *method_name = frame->chunk->constants[method_idx].as.str_val;

    LatValue obj_val;
    if (!env_get(vm->env, global_name, &obj_val)) {
        VM_ERROR("undefined variable '%s'", global_name); break;
    }

    if (vm_invoke_builtin(vm, &obj_val, method_name, arg_count, global_name)) {
        if (vm->error) { /* handle error */ }
        // Write back the mutated clone to the environment
        env_set(vm->env, global_name, obj_val);
        break;
    }
    // ... fall through for non-builtin methods
}
```

The compiler emits `OP_INVOKE_GLOBAL` when it sees a method call on an identifier that isn't a local variable or an upvalue:

```c
case EXPR_METHOD_CALL: {
    if (e->as.method_call.object->tag == EXPR_IDENT) {
        int slot = resolve_local(current, e->as.method_call.object->as.str_val);
        if (slot >= 0) {
            // ... emit OP_INVOKE_LOCAL
        }
        int upvalue = resolve_upvalue(current, e->as.method_call.object->as.str_val);
        if (upvalue < 0) {
            // Not local, not upvalue — must be global
            // ... emit OP_INVOKE_GLOBAL
        }
    }
    // ... fall through to generic OP_INVOKE
}
```

This gives us three tiers of method dispatch: `OP_INVOKE_LOCAL` for locals (direct pointer, no clone), `OP_INVOKE_GLOBAL` for globals (clone + write-back), and `OP_INVOKE` for everything else (computed receivers like `get_array().push(x)`, where there's nothing to write back to). With the fix:

```lattice
flux nums = [1, 2, 3]
nums.push(4)
nums.push(5)
print(nums)  // [1, 2, 3, 4, 5] — mutations persist
```

All mutating builtins — `push`, `pop`, `set`, `remove`, `insert`, `remove_at` — now work correctly on global variables. The same pattern applies to maps, sets, and any other type with in-place methods.

The broader lesson is that deep-clone-on-read semantics create an impedance mismatch with in-place mutation. In a reference-based language, `obj.push(x)` just works — `obj` is a reference, and the mutation happens wherever the reference points. In a value-based language, you need to explicitly handle the write-back for every level of variable storage. The tree-walker's `resolve_lvalue` is one solution. The VM's tiered invoke opcodes are another. Both exist because of the same underlying tension.

## The WASM Playground

Lattice's browser-based [playground](https://baud.rs/odS816) compiles the entire VM to WebAssembly via Emscripten. The WASM API exposes four functions: `lat_init()`, `lat_run_line()`, `lat_is_complete()`, and `lat_destroy()`.

The playground runs the same bytecode VM as the native binary. Each line of input goes through the same pipeline: lex, parse, `compile_repl()`, `vm_run()`. The `lat_is_complete()` function checks bracket depth to determine whether the user is mid-expression, enabling multi-line input by waiting for balanced braces before compiling.

Previously the playground used the tree-walking evaluator, which meant code could behave differently in the browser than on the command line. Switching the WASM build to the bytecode VM eliminates that inconsistency — the playground, the REPL, and file execution all use the same compilation and execution path.

## What Didn't Change

It's worth noting what the bytecode VM *doesn't* change about Lattice.

The value representation is identical. A `LatValue` is still a tagged union with a type tag, phase tag, and payload. Phase transitions still deep-clone data across heap regions. The dual-heap architecture — mark-and-sweep for fluid data, arena-based regions for crystal data — is unchanged. Global variables still live in a scope-chain environment.

The parser and AST are completely shared. The compiler reads the same `Program` structure that the tree-walker reads. A single set of test programs validates both execution paths — all 771 tests pass on both.

The phase system compiles one-to-one. `freeze()` becomes `OP_FREEZE`. `thaw()` becomes `OP_THAW`. Bonds, reactions, seeds, pressure constraints — each has a corresponding opcode that does exactly what the tree-walker's evaluator function did, just driven by bytecode dispatch instead of recursive AST traversal.

## Performance

I haven't done rigorous benchmarking, and I'm deliberately not making performance claims. The motivation for the bytecode VM wasn't speed — it was consistency (one execution path everywhere) and architectural cleanliness (the VM is easier to extend than the tree-walker's deeply nested switch statements).

That said, bytecode VMs are generally faster than tree-walkers for the structural reasons mentioned earlier: better cache locality (sequential byte array vs. pointer-chasing through AST nodes), less call overhead (one switch dispatch vs. recursive function calls), and a compact representation that fits more of the program in cache. Whether this matters for Lattice programs depends on the workload. For a language whose core runtime cost is dominated by deep cloning, the dispatch overhead is rarely the bottleneck.

## Looking Forward

The VM is feature-complete but not optimized. There's no constant folding, no dead code elimination, no register allocation (it's a pure stack machine). The `OP_SCOPE` and `OP_SELECT` concurrency opcodes still delegate to the tree-walker. The dispatch loop is a plain switch rather than computed gotos.

These are all well-understood optimizations with clear implementation paths. The point of v0.3.1 is that the bytecode VM is now the default, passes all tests, and handles the full language surface including the phase system. Optimization is a separate project.

The source code is at [github.com/ajokela/lattice](https://baud.rs/fIe3gx), and you can try it in the browser at [lattice-lang.web.app](https://baud.rs/bwvnYT). The bytecode VM, compiler, REPL, and all 62 opcodes are in four files: `compiler.c`, `vm.c`, `chunk.c`, and `opcode.c`.

```
git clone https://github.com/ajokela/lattice.git
cd lattice && make
./clat
```

## Recommended Resources

- [*Crafting Interpreters*](https://baud.rs/crafting-interpreters) by Robert Nystrom - The definitive guide to building interpreters and bytecode VMs, and a major influence on Lattice's upvalue implementation
- [*Writing A Compiler In Go*](https://baud.rs/P6ofTE) by Thorsten Ball - Practical companion covering bytecode compilation and stack-based VMs
- [*Engineering a Compiler*](https://baud.rs/BSTqlt) by Cooper & Torczon - Comprehensive treatment of compiler internals from front-end to optimization
- [*Compilers: Principles, Techniques, and Tools*](https://baud.rs/JhMFPU) by Aho, Lam, Sethi, Ullman - The classic *Dragon Book* covering parsing, code generation, and optimization theory
