From Tree-Walker to Bytecode VM: Compiling Lattice

A.C. Jokela

2026-02-17

Lattice is a programming language built around a crystallization-based phase system — values start as mutable "flux" and can be frozen into immutable "fix," with the runtime enforcing the transition and providing reactions, bonds, contracts, and temporal tracking around it. It's implemented in C with no external dependencies.

When I started building Lattice, a tree-walking interpreter was the obvious first move. You parse source into an AST, walk the nodes recursively, and evaluate as you go. It's straightforward, easy to debug, and lets you iterate on language semantics quickly without worrying about a second representation. Crafting Interpreters calls this approach "the simplest way to build an interpreter," and it's right.

But tree-walkers have well-known limitations. Every expression evaluation descends through function calls — eval_expr calling eval_binary calling eval_expr twice more. The overhead compounds. You're chasing pointers through heap-allocated AST nodes with poor cache locality. And the call stack of the host language (C, in Lattice's case) becomes tangled with the call stack of the guest language, making it harder to implement features like error recovery and coroutines cleanly.

Lattice v0.3.0 shipped a bytecode compiler and stack-based virtual machine alongside the tree-walker. In v0.3.1, the bytecode VM became the default for file execution, the interactive REPL, and the browser-based playground. The tree-walker is still available via --tree-walk, but the VM now handles everything. This post walks through the architecture of that VM, some design decisions that turned out to matter, and a mutation bug that only surfaces when you combine deep-clone-on-read semantics with in-place method dispatch.

Architecture Overview

The bytecode pipeline has three stages: lexing and parsing (shared with the tree-walker), compilation from AST to bytecode chunks, and execution on a stack-based VM. The compiler and VM together add about 8,200 lines of C to the codebase, bringing the total to around 33,000 lines.

A Chunk is the compilation unit — a dynamic array of bytecode instructions, a constant pool, and debug metadata mapping instructions back to source line numbers. The compiler walks the AST and emits bytes into a chunk. The VM reads bytes from the chunk and executes them against a value stack.

typedef struct {
    uint8_t  *code;          // bytecode array
    size_t    len, cap;
    LatValue *constants;     // constant pool
    size_t    const_len, const_cap;
    int      *lines;         // source line per instruction
    char    **local_names;   // slot → variable name (debug)
    size_t    local_name_cap;
} Chunk;

The VM itself is a for(;;) loop with a switch on the current opcode byte — the textbook approach. No computed gotos, no threaded dispatch, no JIT. Just a switch. On modern hardware with branch prediction, a well-organized switch over 62 opcodes is fast enough that the overhead is negligible compared to the cost of actual operations (string allocation, hash table lookups, deep cloning).

for (;;) {
    uint8_t op = READ_BYTE();
    switch (op) {
        case OP_CONSTANT: { ... }
        case OP_ADD:      { ... }
        case OP_CALL:     { ... }
        // 59 more cases
    }
}

The value stack holds 4,096 slots. The call frame stack holds 256 frames. Each CallFrame tracks its own instruction pointer, a base pointer into the value stack for its local variables, and an array of captured upvalues for closures. When you call a function, the VM pushes a new frame pointing at the callee's chunk. When the function returns, the frame pops and execution resumes in the caller.

The Instruction Set

Lattice's instruction set has 62 opcodes. Some are standard — OP_ADD, OP_JUMP_IF_FALSE, OP_RETURN. Others exist because of Lattice-specific semantics.

The phase system needs dedicated opcodes. OP_FREEZE pops a value, deep-clones it into a crystal region with VTAG_CRYSTAL tags, and pushes the frozen result. OP_THAW does the reverse. OP_MARK_FLUID sets the phase tag to VTAG_FLUID — this is what flux bindings emit after their initializer. OP_FREEZE_VAR and OP_THAW_VAR handle the case where freeze(x) targets a named variable and needs to write back the result, carrying extra operands to identify the variable's location (local slot, upvalue, or global name).

Phase reactions and bonds each have their own opcodes: OP_REACT, OP_UNREACT, OP_BOND, OP_UNBOND, OP_SEED, OP_UNSEED. These could theoretically be implemented as native function calls, but making them opcodes lets the compiler emit the variable name as a constant operand — the VM needs the name to look up the correct reaction/bond registration in its tracking tables, and encoding it in the bytecode avoids a runtime string lookup.

Structured concurrency uses an interesting hybrid. OP_SCOPE and OP_SELECT each carry a constant-pool index that stores a pointer to the original AST Expr* node. When the VM hits one of these opcodes, it invokes the tree-walking evaluator on that subtree. This is a deliberate design choice — the concurrency primitives involve spawning threads and managing channels, which requires the evaluator's full environment machinery. Rather than reimplement all of that in the VM, the bytecode compiler punts to the tree-walker for these specific constructs. The rest of the program runs on the VM; only scope and select blocks briefly drop into interpretation.

Closures and Upvalues

Closures are where bytecode VMs get interesting, and Lattice follows the upvalue model that Lua pioneered and Crafting Interpreters popularized.

When a function is defined inside another function and references variables from the enclosing scope, those variables need to outlive their original stack frame. The solution is upvalues — indirection objects that start pointing into the stack and get "closed over" when the variable goes out of scope.

typedef struct ObjUpvalue {
    LatValue *location;        // points to stack slot or &closed
    LatValue  closed;          // holds value after scope exit
    struct ObjUpvalue *next;   // linked list for open upvalues
} ObjUpvalue;

While the enclosing function is still executing, location points directly at the stack slot. When the enclosing function returns, OP_CLOSE_UPVALUE copies the stack value into the closed field and repoints location to &closed. The closure doesn't know or care about the switch — it always dereferences location. This is why upvalues work: they're a level of indirection that transparently survives stack frame destruction.

The compiler resolves variable references in three stages: first it checks local scope (resolve_local), then upvalues (resolve_upvalue, which walks the compiler chain recursively), then falls back to globals via OP_GET_GLOBAL. The OP_CLOSURE instruction is followed by a series of (is_local, index) byte pairs, one per upvalue, telling the VM whether to capture from the current frame's stack or from the parent frame's upvalue array.

A concrete example makes this clearer. Consider a counter factory:

fn make_counter() {
    flux count = 0
    return |n| { count += n; count }
}

let c = make_counter()
print(c(5))   // 5
print(c(3))   // 8

When make_counter returns, its stack frame is destroyed — but count needs to survive, because the returned closure references it. During compilation, the compiler sees that the closure's body references count, which is local to the enclosing make_counter. It emits an (is_local=true, index=1) upvalue descriptor. At runtime, OP_CLOSURE calls capture_upvalue(), which either reuses an existing ObjUpvalue pointing at that stack slot or creates a new one. When make_counter returns, OP_CLOSE_UPVALUE copies the stack value of count into the upvalue's closed field and repoints location. The closure keeps working, oblivious to the frame being gone.

One implementation detail worth noting: Lattice stores the upvalue array by repurposing the closure's captured_env field (normally an Env* in the tree-walker) and the upvalue count in the region_id field. This avoids adding new fields to the LatValue union, which matters when values are deep-cloned frequently — every field adds to the clone cost.

Compiling for the REPL

A REPL that runs on a bytecode VM needs different compilation from file execution. The difference is small but important.

In file mode, compile_module() compiles a complete program and terminates with OP_UNIT; OP_RETURN — the module returns unit, and any expression results along the way are discarded with OP_POP. This is the right behavior for scripts: you don't want every intermediate expression to accumulate on the stack.

In REPL mode, compile_repl() needs the opposite behavior for the last expression. When you type 42 at the REPL prompt, you want to see => 42. So if the last item in the compiled chunk is a bare expression statement, compile_repl() compiles the expression but skips the OP_POP, leaving the value on the stack. Then it emits OP_RETURN, and the VM receives the value as the chunk's return value.

bool last_is_expr = (prog->item_count > 0 &&
    prog->items[prog->item_count - 1].tag == ITEM_STMT &&
    prog->items[prog->item_count - 1].as.stmt->tag == STMT_EXPR);

if (last_is_expr)
    emit_byte(OP_RETURN, 0);       // value already on stack
else {
    emit_byte(OP_UNIT, 0);         // no expression — return unit
    emit_byte(OP_RETURN, 0);
}

For function definitions, struct declarations, and enum definitions, the result is unit, and the REPL silently suppresses the => output. This matches user expectations — defining a function shouldn't print anything. The effect in practice:

lattice> 42
=> 42
lattice> "hello" + " world"
=> "hello world"
lattice> flux x = 10
lattice> x * 2
=> 20
lattice> fn square(n: Int) -> Int { n * n }
lattice> square(7)
=> 49

Each line is independently compiled and executed on the persistent VM. Globals defined in one line (flux x = 10) are visible in subsequent lines because they're stored in the VM's environment, which persists across iterations. The Chunk for each line is freed after execution — constants that matter (like global variable values) have already been deep-cloned into the environment.

The other critical difference is enum persistence. compile_module() frees its known-enum registry after compilation, because the compiler is done. compile_repl() must not, because enums defined in REPL iteration N need to be visible in iteration N+1. The REPL calls compiler_free_known_enums() only on exit. The same lifetime concern applies to parsed programs — struct and function declarations store Expr* pointers that compiled chunks reference at runtime. The REPL accumulates all parsed programs in a dynamic array and frees them only when the session ends.

The Global Mutation Bug

This is the story I find most instructive, because it reveals a subtle interaction between two independently reasonable design decisions.

Lattice has deep-clone-on-read semantics. When you access a variable, the environment doesn't hand you a reference to the stored value — it hands you a fresh deep clone. This eliminates aliasing entirely: two variables never share underlying memory, passing a map to a function gives the function its own copy, and there's no way to create spooky action at a distance through shared mutable state.

bool env_get(const Env *env, const char *name, LatValue *out) {
    for (size_t i = env->count; i > 0; i--) {
        LatValue *v = lat_map_get(&env->scopes[i-1], name);
        if (v) {
            *out = value_deep_clone(v);  // always a fresh copy
            return true;
        }
    }
    return false;
}

This is expensive but correct. It gives Lattice pure value semantics without needing a borrow checker or persistent data structures.

The tree-walking evaluator handles in-place mutation (like array.push()) with a separate resolve_lvalue() mechanism that obtains a direct mutable pointer into the environment's storage, bypassing the deep clone. Push, pop, index assignment — these all go through resolve_lvalue and mutate the stored value directly.

The bytecode VM needed the same distinction. For local variables, this is straightforward: locals live on the value stack, and the VM has a direct pointer to them via frame->slots[slot]. I added OP_INVOKE_LOCAL, which takes a stack slot index as an operand and passes a pointer to vm_invoke_builtin():

case OP_INVOKE_LOCAL: {
    uint8_t slot = READ_BYTE();
    uint8_t method_idx = READ_BYTE();
    uint8_t arg_count = READ_BYTE();
    const char *method_name = frame->chunk->constants[method_idx].as.str_val;
    LatValue *obj = &frame->slots[slot];  // direct pointer

    if (vm_invoke_builtin(vm, obj, method_name, arg_count, local_var_name)) {
        // builtin mutated obj in-place — mutation persists
        break;
    }
    // ... fall through to closure/method dispatch
}

When .push() grows the array by reallocating obj->as.array.elems and incrementing obj->as.array.len, it's directly modifying the stack slot. The mutation persists because obj is the variable.

For globals, the situation is different. Globals live in the environment (a scope-chain of hash maps), and env_get() deep-clones. The generic OP_INVOKE opcode works by evaluating the receiver expression onto the stack — which, for a global variable, means emitting OP_GET_GLOBAL, which calls env_get(), which deep-clones — and then dispatching the method on the cloned value. After the builtin mutates the clone, OP_INVOKE pops and frees it. The mutation vanishes.

flux nums = [1, 2, 3]
nums.push(4)
print(nums)  // still [1, 2, 3] — the push mutated a clone

This is the kind of bug that's obvious in retrospect but invisible when you're implementing things one piece at a time. env_get() deep-cloning is correct. OP_INVOKE popping the receiver after dispatch is correct. Each piece behaves correctly in isolation. The bug emerges from their composition.

The fix is OP_INVOKE_GLOBAL — a new opcode that knows the receiver is a global variable and writes back after mutation:

case OP_INVOKE_GLOBAL: {
    uint8_t name_idx = READ_BYTE();
    uint8_t method_idx = READ_BYTE();
    uint8_t arg_count = READ_BYTE();
    const char *global_name = frame->chunk->constants[name_idx].as.str_val;
    const char *method_name = frame->chunk->constants[method_idx].as.str_val;

    LatValue obj_val;
    if (!env_get(vm->env, global_name, &obj_val)) {
        VM_ERROR("undefined variable '%s'", global_name); break;
    }

    if (vm_invoke_builtin(vm, &obj_val, method_name, arg_count, global_name)) {
        if (vm->error) { /* handle error */ }
        // Write back the mutated clone to the environment
        env_set(vm->env, global_name, obj_val);
        break;
    }
    // ... fall through for non-builtin methods
}

The compiler emits OP_INVOKE_GLOBAL when it sees a method call on an identifier that isn't a local variable or an upvalue:

case EXPR_METHOD_CALL: {
    if (e->as.method_call.object->tag == EXPR_IDENT) {
        int slot = resolve_local(current, e->as.method_call.object->as.str_val);
        if (slot >= 0) {
            // ... emit OP_INVOKE_LOCAL
        }
        int upvalue = resolve_upvalue(current, e->as.method_call.object->as.str_val);
        if (upvalue < 0) {
            // Not local, not upvalue — must be global
            // ... emit OP_INVOKE_GLOBAL
        }
    }
    // ... fall through to generic OP_INVOKE
}

This gives us three tiers of method dispatch: OP_INVOKE_LOCAL for locals (direct pointer, no clone), OP_INVOKE_GLOBAL for globals (clone + write-back), and OP_INVOKE for everything else (computed receivers like get_array().push(x), where there's nothing to write back to). With the fix:

flux nums = [1, 2, 3]
nums.push(4)
nums.push(5)
print(nums)  // [1, 2, 3, 4, 5] — mutations persist

All mutating builtins — push, pop, set, remove, insert, remove_at — now work correctly on global variables. The same pattern applies to maps, sets, and any other type with in-place methods.

The broader lesson is that deep-clone-on-read semantics create an impedance mismatch with in-place mutation. In a reference-based language, obj.push(x) just works — obj is a reference, and the mutation happens wherever the reference points. In a value-based language, you need to explicitly handle the write-back for every level of variable storage. The tree-walker's resolve_lvalue is one solution. The VM's tiered invoke opcodes are another. Both exist because of the same underlying tension.

The WASM Playground

Lattice's browser-based playground compiles the entire VM to WebAssembly via Emscripten. The WASM API exposes four functions: lat_init(), lat_run_line(), lat_is_complete(), and lat_destroy().

The playground runs the same bytecode VM as the native binary. Each line of input goes through the same pipeline: lex, parse, compile_repl(), vm_run(). The lat_is_complete() function checks bracket depth to determine whether the user is mid-expression, enabling multi-line input by waiting for balanced braces before compiling.

Previously the playground used the tree-walking evaluator, which meant code could behave differently in the browser than on the command line. Switching the WASM build to the bytecode VM eliminates that inconsistency — the playground, the REPL, and file execution all use the same compilation and execution path.

What Didn't Change

It's worth noting what the bytecode VM doesn't change about Lattice.

The value representation is identical. A LatValue is still a tagged union with a type tag, phase tag, and payload. Phase transitions still deep-clone data across heap regions. The dual-heap architecture — mark-and-sweep for fluid data, arena-based regions for crystal data — is unchanged. Global variables still live in a scope-chain environment.

The parser and AST are completely shared. The compiler reads the same Program structure that the tree-walker reads. A single set of test programs validates both execution paths — all 771 tests pass on both.

The phase system compiles one-to-one. freeze() becomes OP_FREEZE. thaw() becomes OP_THAW. Bonds, reactions, seeds, pressure constraints — each has a corresponding opcode that does exactly what the tree-walker's evaluator function did, just driven by bytecode dispatch instead of recursive AST traversal.

Performance

I haven't done rigorous benchmarking, and I'm deliberately not making performance claims. The motivation for the bytecode VM wasn't speed — it was consistency (one execution path everywhere) and architectural cleanliness (the VM is easier to extend than the tree-walker's deeply nested switch statements).

That said, bytecode VMs are generally faster than tree-walkers for the structural reasons mentioned earlier: better cache locality (sequential byte array vs. pointer-chasing through AST nodes), less call overhead (one switch dispatch vs. recursive function calls), and a compact representation that fits more of the program in cache. Whether this matters for Lattice programs depends on the workload. For a language whose core runtime cost is dominated by deep cloning, the dispatch overhead is rarely the bottleneck.

Looking Forward

The VM is feature-complete but not optimized. There's no constant folding, no dead code elimination, no register allocation (it's a pure stack machine). The OP_SCOPE and OP_SELECT concurrency opcodes still delegate to the tree-walker. The dispatch loop is a plain switch rather than computed gotos.

These are all well-understood optimizations with clear implementation paths. The point of v0.3.1 is that the bytecode VM is now the default, passes all tests, and handles the full language surface including the phase system. Optimization is a separate project.

The source code is at github.com/ajokela/lattice, and you can try it in the browser at lattice-lang.web.app. The bytecode VM, compiler, REPL, and all 62 opcodes are in four files: compiler.c, vm.c, chunk.c, and opcode.c.

git clone https://github.com/ajokela/lattice.git
cd lattice && make
./clat

Recommended Resources

Crafting Interpreters by Robert Nystrom - The definitive guide to building interpreters and bytecode VMs, and a major influence on Lattice's upvalue implementation
Writing A Compiler In Go by Thorsten Ball - Practical companion covering bytecode compilation and stack-based VMs
Engineering a Compiler by Cooper & Torczon - Comprehensive treatment of compiler internals from front-end to optimization
Compilers: Principles, Techniques, and Tools by Aho, Lam, Sethi, Ullman - The classic Dragon Book covering parsing, code generation, and optimization theory