A Stack-Based Bytecode VM for Lattice: 100 Opcodes, Serialization, and a Self-Hosted Compiler

A.C. Jokela

2026-02-20

When I first wrote about Lattice's move from a tree-walking interpreter to a bytecode VM, the instruction set had 62 opcodes, concurrency primitives still delegated to the tree-walker, and programs couldn't be serialized. The VM was a foundation, correct and complete enough to become the default, but clearly a starting point.

That was ten versions ago. The bytecode VM now has 100 opcodes, compiles concurrency primitives into standalone sub-chunks with zero AST dependency at runtime, ships a binary serialization format for ahead-of-time compilation, includes an ephemeral bump arena for short-lived string temporaries, and (perhaps most satisfyingly) has a self-hosted compiler written entirely in Lattice that produces the same .latc bytecode files as the C implementation.

This post walks through what changed and why. The full technical treatment is available as a research paper; this is the practitioner's version.

Why Keep Going

The original bytecode VM solved the immediate problems: it eliminated recursive AST dispatch overhead and gave Lattice a single execution path for file execution, the REPL, and the WASM playground. But three issues remained.

First, OP_SCOPE and OP_SELECT (Lattice's structured concurrency opcodes) still stored AST node pointers in the constant pool and dropped into the tree-walking evaluator at runtime. This meant the AST had to stay alive during concurrent execution, which defeated one of the main motivations for having a bytecode VM in the first place.

Second, the AST dependency made serialization impossible. You can serialize bytecode to a file, but you can't easily serialize an arbitrary C pointer to an AST node. Programs had to be parsed and compiled on every run.

Third, the dispatch loop used a plain switch statement. Not a crisis, but computed goto dispatch is a well-known improvement for bytecode interpreters, and leaving it on the table felt unnecessary.

All three problems are solved now. Let me start with the instruction set, since everything else builds on it.

100 Opcodes

The instruction set grew from 62 to 100 opcodes, organized into 16 functional categories:

Category	Representative opcodes	Count
Stack manipulation	`CONSTANT`, `NIL`, `TRUE`, `FALSE`, `UNIT`, `POP`, `DUP`, `SWAP`	8
Arithmetic/logical	`ADD`, `SUB`, `MUL`, `DIV`, `MOD`, `NEG`, `NOT`	7
Bitwise	`BIT_AND`, `BIT_OR`, `BIT_XOR`, `BIT_NOT`, `LSHIFT`, `RSHIFT`	6
Comparison	`EQ`, `NEQ`, `LT`, `GT`, `LTEQ`, `GTEQ`	6
String	`CONCAT`	1
Variables	`GET/SET_LOCAL`, `GET/SET/DEFINE_GLOBAL`, `GET/SET_UPVALUE`, `CLOSE_UPVALUE`	8
Control flow	`JUMP`, `JUMP_IF_FALSE`, `JUMP_IF_TRUE`, `JUMP_IF_NOT_NIL`, `LOOP`	5
Functions	`CALL`, `CLOSURE`, `RETURN`	3
Iterators	`ITER_INIT`, `ITER_NEXT`	2
Data structures	`BUILD_ARRAY`, `INDEX`, `SET_INDEX`, `GET_FIELD`, `INVOKE`, etc.	15
Exceptions/defer	`PUSH_EXCEPTION_HANDLER`, `THROW`, `DEFER_PUSH`, `DEFER_RUN`, etc.	6
Phase system	`FREEZE`, `THAW`, `CLONE`, `MARK_FLUID`, `REACT`, `BOND`, `SEED`, etc.	14
Builtins/modules	`PRINT`, `IMPORT`	2
Concurrency	`SCOPE`, `SELECT`	2
Integer fast paths	`INC_LOCAL`, `DEC_LOCAL`, `ADD_INT`, `SUB_INT`, `LOAD_INT8`, etc.	8
Wide variants	`CONSTANT_16`, `GET_GLOBAL_16`, `SET_GLOBAL_16`, `DEFINE_GLOBAL_16`, `CLOSURE_16`	5
Special	`RESET_EPHEMERAL`, `HALT`	2
Total		100

The growth came from three directions: the integer fast-path opcodes (8 new), the wide constant variants (5 new), and the concurrency/arena opcodes. Let me explain each.

Integer Fast Paths

Tight loops like for i in 0..1000 spend most of their time incrementing a counter and comparing it to a bound. The generic OP_ADD has to check whether its operands are integers, floats, or strings (for concatenation), which adds branching overhead on every iteration.

The integer fast-path opcodes (OP_ADD_INT, OP_SUB_INT, OP_MUL_INT, OP_LT_INT, OP_LTEQ_INT) skip the type check entirely and operate directly on int64_t values. OP_INC_LOCAL and OP_DEC_LOCAL handle the i += 1 and i -= 1 patterns as single-byte instructions that modify the stack slot in place, no push or pop required.

OP_LOAD_INT8 encodes a signed byte directly in the instruction stream. The integer 42 becomes two bytes (OP_LOAD_INT8, 0x2A) instead of a three-byte OP_CONSTANT plus an eight-byte constant pool entry. Any integer in [-128, 127] gets this treatment.

Wide Constant Variants

The original instruction set used a single byte for constant pool indices, limiting each chunk to 256 constants. This is fine for most functions, but the self-hosted compiler (a 2,000-line Lattice program compiled as a single top-level script) blows past that limit easily.

OP_CONSTANT_16, OP_GET_GLOBAL_16, OP_SET_GLOBAL_16, OP_DEFINE_GLOBAL_16, and OP_CLOSURE_16 use two-byte big-endian indices, supporting up to 65,536 constants per chunk. The compiler automatically switches to wide variants when an index exceeds 255.

The Compiler

The bytecode compiler performs a single-pass walk over the AST. It maintains a chain of Compiler structs linked via enclosing pointers, one per function being compiled. Variable references resolve through three tiers: local (scan the current compiler's locals array), upvalue (recursively check enclosing compilers), and global (fall through to OP_GET_GLOBAL).

Three compilation modes handle different use cases. compile() is the standard file mode: it compiles all declarations and emits an implicit call to main() if one is defined. compile_module() is for imports, identical to compile() but skips the auto-call. compile_repl() preserves the last expression on the stack as the iteration's return value (displayed with => prefix) and keeps the known-enum table alive across REPL iterations so enum declarations persist.

The compiler implements several optimizations during code generation. Binary operations on literal operands are folded at compile time: 3 + 4 emits a single OP_LOAD_INT8 7 rather than two loads and an OP_ADD. The pattern x += 1 is detected and emitted as the single-byte OP_INC_LOCAL, which modifies the stack slot in place. And every statement is wrapped by compile_stmt_reset(), which appends OP_RESET_EPHEMERAL to trigger the ephemeral arena cleanup.

Computed Goto Dispatch

The dispatch loop now uses GCC/Clang's labels-as-values extension for computed goto:

#ifdef VM_USE_COMPUTED_GOTO
  static void *dispatch_table[] = {
    [OP_CONSTANT] = &&lbl_OP_CONSTANT,
    [OP_NIL]      = &&lbl_OP_NIL,
    // ... all 100 entries
  };
#endif

for (;;) {
    uint8_t op = READ_BYTE();
#ifdef VM_USE_COMPUTED_GOTO
    goto *dispatch_table[op];
#endif
    switch (op) {
        // ...
    }
}

Each opcode handler ends with a goto *dispatch_table[READ_BYTE()] rather than breaking back to the top of the loop. This eliminates the switch statement's bounds check and branch table indirection, replacing it with a single indirect jump. The CPU's branch predictor sees different jump sites for different opcodes, which improves prediction accuracy compared to a single switch that all opcodes funnel through.

On platforms without the extension, it falls back to a standard switch. The VM works correctly either way.

Pre-Compiled Concurrency

This is the change I'm most pleased with, because it solves the problem cleanly.

Lattice has three concurrency primitives: scope defines a concurrent region, spawn launches a task within that region, and select multiplexes over channels. In the tree-walker, these work by passing AST node pointers to spawned threads, which then evaluate the subtrees independently. The bytecode VM's original implementation did the same thing: OP_SCOPE stored an Expr* pointer in the constant pool and called the tree-walking evaluator at runtime.

The solution is to compile each concurrent body into a standalone Chunk at compile time. The compiler provides two helpers: compile_sub_body() for statement blocks and compile_sub_expr() for expressions. Each creates a fresh Compiler, compiles the code into a new chunk, emits OP_HALT, and stores the resulting chunk in the parent's constant pool as a VAL_CLOSURE constant.

OP_SCOPE uses variable-length encoding: a spawn count, a sync body chunk index, and one chunk index per spawn body. At runtime, the VM:

Exports locals to the global environment using the local_names debug table, so sub-chunks can access parent variables via OP_GET_GLOBAL
Runs the sync body (if present) via a recursive vm_run() call
Spawns threads for each spawn body, each running on a cloned VM
Joins all threads and propagates errors

OP_SELECT similarly encodes per-arm metadata: flags, channel expression chunk index, body chunk index, and binding name index. The VM evaluates channel expressions, polls for readiness, and executes the winning arm.

The key insight is that sub-chunks run as FUNC_SCRIPT without lexical access to the parent's locals. Since they can't use upvalues to reach into the parent frame, the VM exports the parent's live locals into the global environment before running any sub-chunk, using a pushed scope that gets popped after all sub-chunks complete. This is slightly more expensive than true lexical capture, but it keeps the sub-chunks completely self-contained: no AST, no parent frame dependency, fully serializable.

Bytecode Serialization

With AST dependency eliminated, serialization becomes straightforward. The .latc binary format starts with an 8-byte header:

[4C 41 54 43]  magic: "LATC"
[01 00]        format version: 1
[00 00]        reserved

The rest is a recursive chunk encoding: code length + bytecode bytes, line numbers for source mapping, typed constants (with a one-byte type tag for each), and local name debug info. Constants use seven type tags:

Tag	Type	Encoding
0	Int	8-byte signed LE
1	Float	8-byte IEEE 754
2	Bool	1 byte
3	String	length-prefixed (u32 + bytes)
4	Nil	no payload
5	Unit	no payload
6	Closure	param count + variadic flag + recursive sub-chunk

The Closure tag is what makes this recursive: a function constant contains its parameter metadata followed by a complete serialized sub-chunk. Nested functions serialize naturally to arbitrary depth.

The CLI integrates this cleanly:

# Compile to .latc
clat compile input.lat -o output.latc

# Run pre-compiled bytecode (auto-detects .latc suffix)
clat output.latc

# Or compile and run in one step (the default)
clat input.lat

Loading validates magic bytes, checks the format version, and uses a bounds-checking ByteReader that produces descriptive error messages for truncated or malformed inputs.

The Ephemeral Bump Arena

String concatenation is a common source of short-lived allocations. An expression like "hello " + name + "!" creates intermediate strings that are immediately consumed and discarded. In a language with deep-clone-on-read semantics, these temporaries add up.

The ephemeral bump arena is a simple optimization: string concatenation in OP_ADD and OP_CONCAT allocates into a bump arena (vm->ephemeral) instead of the general-purpose heap. These allocations are tagged with REGION_EPHEMERAL, and OP_RESET_EPHEMERAL (emitted by the compiler at every statement boundary) resets the arena in O(1), reclaiming all temporary strings at once.

The tricky part is escape analysis. If a temporary string gets assigned to a global variable, stored in an array, or passed to a compiled closure, it needs to be promoted out of the ephemeral arena before the arena is reset. The VM handles this at specific escape points: OP_DEFINE_GLOBAL, OP_CALL (for compiled closures), array.push, and OP_SET_INDEX_LOCAL. Each of these calls vm_promote_value(), which deep-clones the string to the regular heap if its region is ephemeral.

The arena uses a page-based allocator with 4 KB pages. Resetting doesn't free pages; it just moves the bump pointer back to zero, so subsequent allocations reuse the same memory without any malloc/free overhead. The full design and safety proof are covered in a companion paper.

Closures and the Storage Hack

The upvalue system hasn't changed architecturally since the first VM post; it's still the Lua-inspired open/closed model where ObjUpvalue structs start pointing into the stack and get closed (deep-cloned to the heap) when variables go out of scope. But the encoding grew to accommodate the wider instruction set.

OP_CLOSURE uses variable-length encoding: a constant pool index for the function's compiled chunk, an upvalue count, and then [is_local, index] byte pairs for each captured variable. OP_CLOSURE_16 uses a two-byte big-endian function index for chunks with more than 256 constants.

The storage hack (repurposing closure.body (NULL), closure.native_fn (Chunk pointer), closure.captured_env (ObjUpvalue** cast), and region_id (upvalue count)) remains unchanged. A sentinel value VM_NATIVE_MARKER distinguishes C-native functions from compiled closures:

#define VM_NATIVE_MARKER ((struct Expr **)(uintptr_t)0x1)

A closure with body == NULL and native_fn != NULL is either a C native (if default_values == VM_NATIVE_MARKER) or a compiled bytecode function (otherwise). This avoids adding VM-specific fields to the LatValue union, which matters when values are deep-cloned frequently.

The Self-Hosted Compiler

The file compiler/latc.lat is a bytecode compiler written entirely in Lattice, approximately 2,060 lines that read .lat source, produce bytecode, and write .latc files using the same binary format as the C implementation:

# Use the self-hosted compiler
clat compiler/latc.lat input.lat output.latc

# Run the result
clat output.latc

The architecture mirrors the C compiler: lexing via the built-in tokenize() function, a recursive-descent parser, single-pass code emission, and scope management with upvalue resolution. But Lattice's value semantics required some creative workarounds.

The biggest constraint is that structs and maps are pass-by-value. In C, the compiler uses a Compiler struct with mutable fields: local arrays, scope depth, a chunk pointer. In Lattice, passing a struct to a function creates a copy, so mutations in the callee don't propagate back. The self-hosted compiler works around this with parallel global arrays: code, constants, c_lines, local_names, local_depths, local_captured. Since array mutations via .push() and index assignment are in-place (via resolve_lvalue), global arrays work where structs don't.

Nested function compilation uses explicit save_compiler() / restore_compiler() functions that copy all global arrays to local temporaries and back. It's verbose but correct. The Buffer type (used for serialization output) is also pass-by-value, so a global ser_buf accumulates serialized bytes across function calls.

Other language constraints: no else if (requires else { if ... } or match), mandatory type annotations on function parameters (fn foo(a: any)), and test is a keyword so you can't use it as an identifier.

The self-hosted compiler currently handles expressions, variables, functions with closures, control flow (if/else, while, loop, for, break, continue, match), structs, enums, exceptions, defer, string interpolation, and imports. Not yet implemented: concurrency primitives and advanced phase operations (react, bond, seed). The bootstrapping chain is:

latc.lat → [C VM interprets] → output.latc → [C VM executes]

Full self-hosting (where latc.lat compiles itself) requires adding concurrency support and closing the remaining feature gaps.

The VM Execution Engine

The VM maintains a 4,096-slot value stack, a 256-frame call stack, an exception handler stack (64 entries), a defer stack (256 entries), a global environment, the open upvalue linked list, the ephemeral arena, and a module cache. A pre-allocated fast_args[16] buffer avoids heap allocation for most native function calls.

The OP_CALL instruction discriminates three callee types. Native C functions (marked with VM_NATIVE_MARKER) get the fast path: arguments are popped into fast_args, the C function pointer is invoked, and the return value is pushed. No call frame allocated. Compiled closures get the full treatment: the VM promotes ephemeral values in the current frame (so the callee's OP_RESET_EPHEMERAL doesn't invalidate the caller's temporaries), then pushes a new CallFrame with the instruction pointer at byte 0 of the callee's chunk. Callable structs look up a constructor-named field and dispatch accordingly.

Exception handling uses a handler stack. OP_PUSH_EXCEPTION_HANDLER records the current IP, chunk, call frame index, and stack top. When OP_THROW executes, the nearest handler is popped, the call frame and value stacks are unwound, the error value is pushed, and execution resumes at the handler's saved IP. Deferred blocks interact correctly: OP_DEFER_RUN executes all defer entries registered at or above the current frame before the frame is popped by OP_RETURN.

Iterators avoid closure allocation entirely. OP_ITER_INIT converts a range or array into an internal iterator occupying two stack slots (collection + cursor index). OP_ITER_NEXT advances the cursor, pushes the next element, or jumps to a specified offset when exhausted. The tree-walker used closure-based iterators for for loops; the bytecode version is simpler and avoids the allocation.

Ref<T>: The Escape Hatch from Value Semantics

Everything described so far operates in a world where values are deep-cloned on every read. Maps are pass-by-value. Structs are pass-by-value. Pass a collection to a function and the function gets its own copy; mutations don't propagate back. This is correct and eliminates aliasing bugs, but it creates a real problem: how do you share mutable state when you actually need to?

Ref<T> is the answer. It's a reference-counted shared mutable wrapper, the one type in Lattice that deliberately breaks value semantics:

struct LatRef {
    LatValue value;      // the wrapped inner value
    size_t   refcount;   // reference count
};

When a Ref is cloned (which happens on every variable read, like everything else), the VM bumps the refcount and copies the pointer. It does not deep-clone the inner value. Multiple copies of a Ref share the same underlying LatRef, so mutations through one are visible through all others. This is the explicit opt-in to reference semantics that the rest of the language avoids.

let r = Ref::new([1, 2, 3])
let r2 = r              // shallow copy — same LatRef
r.push(4)
print(r2.get())          // [1, 2, 3, 4] — shared state

The VM provides transparent proxying: OP_INDEX, OP_SET_INDEX, and OP_INVOKE all check for VAL_REF and delegate to the inner value. Indexing into a Ref<Array> indexes the inner array. Calling .push() on a Ref<Array> mutates the inner array directly. At the language level, a Ref mostly behaves like the value it wraps; you just get shared mutation instead of isolated copies.

Ref has its own methods (get()/deref() to clone the inner value out, set(v) to replace it, inner_type() to inspect the wrapped type) plus proxied methods for whatever the inner value supports (map set/get/keys, array push/pop, etc.).

The phase system applies to Refs too. Freezing a Ref blocks all mutation: set(), push(), index assignment all check obj->phase == VTAG_CRYSTAL and error with "cannot set on a frozen Ref." This makes frozen Refs safe to share across concurrent boundaries; they're immutable handles to immutable data.

This introduces a third memory management strategy alongside the dual-heap (mark-and-sweep for fluid values, arenas for crystal values) and the ephemeral bump arena. Refs use reference counting: ref_retain() on clone, ref_release() on free, with the inner value freed when the count hits zero. It's a deliberate trade-off: reference counting is simple and deterministic, and since Refs are the uncommon case (most Lattice code uses value semantics), the lack of cycle collection hasn't been an issue in practice.

Validation

The VM is validated by 815 tests covering every feature: arithmetic, closures, upvalues, phase transitions, exception handling, defer, iterators, data structures, concurrency, modules, bytecode serialization, and the self-hosted compiler.

All 815 tests pass under both normal compilation and AddressSanitizer builds (make asan), which dynamically checks for heap buffer overflows, use-after-free, stack buffer overflows, and memory leaks. For a VM with manual memory management, upvalue lifetime tracking, and an ephemeral arena that reclaims memory at statement boundaries, ASan validation is essential.

Both execution modes (bytecode VM (default) and tree-walker (--tree-walk)) share the same test suite and produce identical results:

make test                # bytecode VM: 815 passed
make test TREE_WALK=1    # tree-walker: 815 passed

Feature parity is complete:

Feature	Tree-walker	Bytecode VM
Phase system (freeze/thaw/clone/forge)	✓	✓
Closures with upvalues	✓	✓
Exception handling (try/catch/throw)	✓	✓
Defer blocks	✓	✓
Pattern matching	✓	✓
Structs with methods	✓	✓
Enums with payloads	✓	✓
Arrays, maps, tuples, sets, buffers	✓	✓
Iterators (for-in, ranges)	✓	✓
Module imports	✓	✓
Concurrency (scope/spawn/select)	✓	✓
Channels	✓	✓
Phase reactions/bonds/seeds	✓	✓
Contracts (require/ensure)	✓	✓
Variable tracking (history)	✓	✓
Bytecode serialization (.latc)	---	✓
Computed goto dispatch	---	✓
Ephemeral bump arena	---	✓
Specialized integer ops	---	✓

The last four rows are VM-only features that have no tree-walker equivalent.

What's Next

The VM is feature-complete but not performance-optimized. The obvious next steps are register allocation to reduce stack traffic, type-specialized dispatch paths guided by runtime profiling, tail call optimization for recursive patterns, and constant pool deduplication across compilation units. Further out, the bytecode provides a natural intermediate representation for JIT compilation.

On the self-hosting front, adding concurrency primitives to latc.lat would close the gap to full self-compilation, where the Lattice compiler compiles itself, producing a .latc file that can then compile other programs without the C implementation in the loop.

The full technical details (including encoding diagrams, the complete opcode listing, compilation walkthroughs, and references to related work in Lua, CPython, YARV, and WebAssembly) are in the research paper.

The source code is at github.com/ajokela/lattice, and the project site is at lattice-lang.org.

git clone https://github.com/ajokela/lattice.git
cd lattice && make
./clat