This is the story of attempting something probably inadvisable: compiling Rust for the Zilog Z80, an 8-bit processor from 1976. It's also a story about using AI as a genuine collaborator on deep systems programming work, and what happens when modern software abstractions collide with hardware constraints from an era when 64 kilobytes was considered generous.
Transparency: Claude Code as Collaborator
I want to be upfront about something: significant portions of this compiler backend were developed in collaboration with Claude Code, Anthropic's AI coding assistant. This isn't a case of "AI wrote the code and I took credit" — it's more nuanced than that. Claude served as an unusually patient pair programmer who happens to have read every LLVM tutorial ever written.
Here's what that collaboration actually looked like:
I would describe a problem: "The instruction selector is failing with cannot select: G_SADDO for signed addition with overflow detection." Claude would analyze the GlobalISel pipeline, identify that the Z80's ADC instruction sets the P/V flag for signed overflow, and propose an implementation. I would review, test, discover edge cases, and we'd iterate.
The debugging sessions were particularly valuable. When compilation hung for seven hours on what should have been a two-minute build, Claude helped trace the issue to an accidental infinite recursion — a replace_all refactoring had changed RBI.constrainGenericRegister(...) to constrainOrSetRegClass(...) inside the constrainOrSetRegClass helper function itself. The function was calling itself forever. Finding that bug manually would have taken hours of printf debugging; with Claude analyzing the code structure, we found it in minutes.
This is what AI-assisted development actually looks like in 2025: not magic code generation, but accelerated iteration with a collaborator who never gets frustrated when you ask "wait, explain register allocation to me again."
Why Z80? Why Rust?
The Z80 powered the TRS-80, ZX Spectrum, MSX computers, and countless embedded systems. It's still manufactured today — you can buy new Z80 chips. I actually did just that, I bought a handful of vintage ceramic Z80 chips off of eBay. There's something appealing about running modern language constructs on hardware designed when ABBA topped the charts.
More practically, I've been building Z80-based projects on the RetroShield platform, which lets you run vintage processors on Arduino-compatible hardware. Having a modern compiler toolchain opens possibilities that hand-written assembly doesn't.
But Rust specifically? Rust's ownership model and zero-cost abstractions are theoretically perfect for resource-constrained systems. The language was designed for systems programming. The question is whether "systems" can stretch back 50 years.
Building LLVM for the Z80
The first step was getting LLVM itself to build with Z80 support. This meant:
Adding Z80 to the list of supported targets in the build system
Creating the target description files (registers, instruction formats, calling conventions)
Implementing the GlobalISel pipeline components
Wiring everything together so llc -mtriple=z80-unknown-unknown actually works
The target description files alone span thousands of lines. Here's what defining just the basic registers looks like:
Every instruction needs similar treatment. The Z80 has over 700 documented instruction variants when you count all the addressing modes. Not all are needed for a basic backend, but getting basic arithmetic, loads, stores, branches, and calls working required implementing dozens of instruction patterns.
The build process itself was surprisingly manageable — LLVM's build system is well-designed. A complete build with the Z80 target takes about 20 minutes on modern hardware. The iteration cycle during development was typically: change a few files, rebuild (30 seconds to 2 minutes depending on what changed), test with llc, fix, repeat.
The LLVM Approach
LLVM provides a framework for building compiler backends. You describe your target's registers, instruction set, and calling conventions; LLVM handles optimization, instruction selection, and register allocation. In theory, adding a new target is "just" filling in these descriptions.
In practice, LLVM assumes certain things about targets. It assumes you have a reasonable number of general-purpose registers. It assumes arithmetic operations work on values that fit in registers. It assumes function calls follow conventions that modern ABIs have standardized.
The Z80 violates all of these assumptions.
The Register Poverty Problem
The Z80 has seven 8-bit registers: A, B, C, D, E, H, and L. Some can be paired into 16-bit registers: BC, DE, HL. That's it. Modern architectures have 16 or 32 general-purpose registers; the Z80 has seven that aren't even all general-purpose — A is the accumulator with special arithmetic privileges, HL is the primary memory pointer.
LLVM's register allocator expects to juggle many virtual registers across many physical registers. When you have more virtual registers than physical registers, it spills values to memory. On the Z80, you're spilling constantly. Every 32-bit operation requires careful choreography of the few registers available.
Here's what a simple 16-bit addition looks like in our backend:
That's clean because we designed the calling convention to pass arguments in HL and DE. The backend recognizes that the inputs are already where they need to be and emits just the ADD instruction.
But 32-bit addition? That becomes a multi-instruction sequence juggling values through the stack because we can't hold four 16-bit values in registers simultaneously.
The Width Problem
The Z80 is fundamentally an 8-bit processor with 16-bit addressing. Rust's standard library uses usize for indexing, which on most platforms is 32 or 64 bits. The Z80 cannot directly perform 32-bit arithmetic. Every u32 operation expands into multiple 8-bit or 16-bit operations.
Consider multiplication. The Z80 has no multiply instruction at all. To multiply two 16-bit numbers, we emit a call to a runtime library function (__mulhi3) that implements multiplication through shifts and adds. 32-bit multiplication requires calling a function that orchestrates four 16-bit multiplications with proper carry handling.
Division is worse. Iterative division algorithms on 8-bit hardware are slow. Floating-point arithmetic doesn't exist in hardware — every floating-point operation becomes a library call to software implementations.
GlobalISel: The Modern Approach
We're using LLVM's GlobalISel framework rather than the older SelectionDAG. GlobalISel provides finer control over instruction selection through explicit lowering steps:
IRTranslator: Converts LLVM IR to generic machine instructions (G_ADD, G_LOAD, etc.)
Legalizer: Transforms operations the target can't handle into sequences it can
RegBankSelect: Assigns register banks (8-bit vs 16-bit on Z80)
InstructionSelector: Converts generic instructions to target-specific instructions
Each step presented challenges. The Legalizer needed custom rules to break 32-bit operations into 16-bit pieces. RegBankSelect needed to understand that some Z80 instructions only work with specific register pairs. The InstructionSelector needed patterns for every Z80 instruction variant.
One particularly tricky issue: LLVM's overflow-detecting arithmetic. Instructions like G_SADDO (signed add with overflow) return both a result and an overflow flag. The Z80's ADC instruction sets the P/V flag on signed overflow, but capturing that flag to a register requires careful instruction sequencing — you can't just read the flag register arbitrarily.
The Bug That Cost Seven Hours
During development, we hit a bug that perfectly illustrates the challenges of compiler work. After implementing a helper function to handle register class assignment, compilation started hanging. Not crashing — hanging. A simple three-function test file that should compile in milliseconds ran for over seven hours before I killed it.
The issue? During a refactoring pass, we used a global search-and-replace to change all calls from RBI.constrainGenericRegister(...) to our new constrainOrSetRegClass(...) helper. But the helper function itself contained a call to RBI.constrainGenericRegister() as its fallback case. The replace-all changed that too:
// Before (correct):boolconstrainOrSetRegClass(RegisterReg,...){if(!MRI.getRegClassOrNull(Reg)){MRI.setRegClass(Reg,&RC);returntrue;}returnRBI.constrainGenericRegister(Reg,RC,MRI);// Fallback}// After (infinite recursion):boolconstrainOrSetRegClass(RegisterReg,...){if(!MRI.getRegClassOrNull(Reg)){MRI.setRegClass(Reg,&RC);returntrue;}returnconstrainOrSetRegClass(Reg,RC,MRI);// Calls itself forever!}
The function was calling itself instead of the underlying LLVM function. Every attempt to compile anything would recurse until the stack overflowed or the heat death of the universe, whichever came first.
This is the kind of bug that's obvious in hindsight but insidious during development. There were no compiler errors, no warnings, no crashes with helpful stack traces. Just silence as the process spun forever.
Finding it required adding debug output at each step of the instruction selector, rebuilding, and watching where the output stopped. Claude helped immensely here — recognizing the pattern of "output stops here" and immediately checking what that code path did.
The Calling Convention
We designed a Z80-specific calling convention optimized for the hardware's constraints:
First 16-bit argument: HL register pair
Second 16-bit argument: DE register pair
Return value: HL register pair
Additional arguments: Stack
Caller-saved: All registers (callee can clobber anything)
Callee-saved: None
This convention minimizes register shuffling for simple functions. A function taking two 16-bit values and returning one doesn't need any register setup at all — the arguments arrive exactly where the ADD instruction expects them.
For 8-bit arguments, values arrive in the low byte of HL (L register) or DE (E register). This wastes the high byte but simplifies the calling convention.
This is radically different from typical calling conventions. Modern ABIs specify precise preservation rules, stack alignment requirements, and argument passing in specific registers. On the Z80, with so few registers, we had to make pragmatic choices. Every function saves and restores what it needs; there's no concept of "preserved across calls."
A Working Example
Here's LLVM IR that our backend compiles successfully:
The 16-bit operations are efficient. The 8-bit addition shows the register shuffling required when values aren't in the accumulator — we have to move values through available registers to get them where the ADD instruction expects.
Compilation time for these three functions: 0.01 seconds. The backend works.
Where We Are Now
The backend compiles simple LLVM IR to working Z80 assembly. Integer arithmetic, control flow, function calls, memory access — the fundamentals work. We've implemented handlers for dozens of generic machine instructions and their various edge cases.
Attempting to compile Rust's core library has been... educational. The core library is massive. It includes:
All the formatting infrastructure (Display, Debug, write! macros)
Iterator implementations and adaptors
Option, Result, and their many combinator methods
Slice operations, sorting algorithms
Panic handling infrastructure
Unicode handling
Each of these generates significant code. The formatting system alone probably exceeds the entire memory capacity of a typical Z80 system.
Current status: compilation of core starts, processes thousands of functions, but eventually hits edge cases we haven't handled yet. The most recent error involves register class assignment in the floating-point decimal formatting code — ironic since the Z80 has no floating-point hardware.
Connecting Rust to the Z80 Backend
Getting Rust to use our LLVM backend required modifying the Rust compiler itself. This involved:
Adding a target specification: Defining z80-unknown-none-elf in Rust's target database with the appropriate data layout, pointer width, and feature flags.
Pointing Rust at our LLVM: Rust can use an external LLVM rather than its bundled version. We configured the build to use our Z80-enabled LLVM.
Disabling C compiler-builtins: Rust's standard library includes some C code from compiler-rt for low-level operations. There's no Z80 C compiler readily available, so we had to disable these and rely on pure Rust implementations.
Setting panic=abort: The Z80 can't reasonably support stack unwinding for panic handling.
The Rust target specification looks like this:
Target{arch:Arch::Z80,data_layout:"e-m:e-p:16:8-i16:8-i32:8-i64:8-n8:16".into(),llvm_target:"z80-unknown-unknown".into(),pointer_width:16,options:TargetOptions{c_int_width:16,panic_strategy:PanicStrategy::Abort,max_atomic_width:Some(0),// No atomicsatomic_cas:false,singlethread:true,no_builtins:true,// No C runtime..TargetOptions::default()},}
The pointer_width: 16 is crucial — this is a 16-bit architecture. The max_atomic_width: Some(0) tells Rust that atomic operations aren't available at all, since the Z80 has no atomic instructions.
When Rust tries to compile core, it invokes rustc, which invokes LLVM, which invokes our Z80 backend. Each function in core goes through this pipeline. The sheer volume is staggering — core contains thousands of generic functions that get monomorphized for every type they're used with.
The Honest Assessment
Will Rust's standard library ever practically run on a Z80? Almost certainly not. The core library alone, compiled for Z80, would likely exceed a megabyte — far beyond the 64KB address space. Even if you could page-swap the code, the runtime overhead of software floating-point, 32-bit arithmetic emulation, and iterator abstractions would make execution glacially slow.
What might actually work:
#![no_std]#![no_core] programs: Bare-metal Rust with a tiny custom runtime, no standard library, hand-optimized for the hardware. A few kilobytes of carefully written Rust that compiles to tight Z80 assembly.
Code generation experiments: Using the LLVM backend to study how modern language constructs map to constrained hardware, even if the results aren't practical to run.
Educational purposes: Understanding compiler internals by working with hardware simple enough to reason about completely.
The value isn't in running production Rust on Z80s. It's in the journey — understanding LLVM's internals, grappling with register allocation on a machine that predates the concept (and myself albeit by only a few years), and seeing how far modern tooling can stretch.
Conclusion
Compiling Rust for the Z80 is somewhere between ambitious and absurd. The hardware constraints are genuinely incompatible with modern language expectations. But the attempt has been valuable — understanding LLVM deeply, exploring what "resource-constrained" really means, and discovering that AI collaboration can work effectively on low-level systems programming.
The Z80 was designed for a world where programmers counted bytes. Rust was designed for a world where programmers trust the compiler to manage complexity. Making them meet is an exercise in translation across decades of computing evolution.
There's something deeply satisfying about watching a nearly 50-year-old CPU execute code you just compiled. The Z80 processor, introduced by Zilog in 1976, powered everything from the TRS-80 to the ZX Spectrum to countless CP/M machines. With roughly 8,500 transistors, it's almost incomprehensibly simple by modern standards; a high-end Intel i9 has around 17 billion. Today, thanks to projects like the RetroShield, you can plug one of these vintage processors into an Arduino and run real 8-bit code.
But here's the thing: actually writing Z80 programs is painful. Traditional approaches involve either hand-assembling hex codes, wrestling with decades-old assemblers that barely run on modern systems, or writing raw bytes into binary files. I wanted something better. What if I could write Z80 programs in Rust, using a fluent API that generates correct machine code without the mental overhead of remembering opcode encodings?
The result is the retroshield-z80-workbench, a Rust crate that powers three substantial retro applications: a dBASE II database clone, a WordStar-compatible text editor, and a VisiCalc-style spreadsheet. The workbench emerged from patterns I discovered while building earlier projects like a C compiler and LISP interpreter. This post explains how it works and what it's enabled.
The Problem with Traditional Z80 Development
I first encountered Z80 assembly in the 1990s, writing programs on a TI-85 graphing calculator. The process was painfully tedious: hand-assemble each instruction to hex using a reference card, type the bytes into the calculator's memory editor, run it, watch it crash, and start over. There was no debugger, no error messages, just a frozen screen or a memory clear if you were unlucky. I spent more time looking up opcodes than thinking about algorithms.
Writing Z80 assembly by hand means memorizing hundreds of opcodes. LD A, B is 0x78. JP NZ, addr is 0xC2 followed by a 16-bit address in little-endian format. Conditional returns, indexed addressing, and the various Z80-specific instructions like LDIR and DJNZ all have their own encodings. One wrong byte and your program jumps into garbage.
Traditional assemblers solve this, but they come with their own problems. Many only run under CP/M or DOS. Modern cross-assemblers exist, but they're another tool to install, another syntax to learn, another build step to manage. And when you're generating code programmatically, like when building a compiler that targets Z80, an external assembler becomes a significant complication.
There are also modern C compilers for the Z80, most notably SDCC (Small Device C Compiler), which is actively maintained and produces decent code. But when your goal is to generate Z80 machine code from Rust, perhaps as the backend of a compiler or code generator, you want something that integrates directly into your Rust toolchain.
What I wanted was the ability to write something like this in Rust:
rom.ld_a(0x42);// LD A, 0x42rom.call("print_hex");// CALL print_hexrom.ret();// RET
And have it emit the correct bytes: 0x3E 0x42 0xCD xx xx 0xC9.
The Workbench Architecture
The retroshield-z80-workbench crate is built around three core concepts: emit, label, and fixup.
Emit: The Foundation
At the lowest level, everything is just bytes being appended to a buffer:
Every Z80 instruction ultimately calls emit(). The ld_a() method is just:
pubfnld_a(&mutself,n:u8){self.emit(&[0x3E,n]);// Opcode 0x3E is LD A, n}
This pattern scales to cover the entire Z80 instruction set. The crate provides over 80 instruction helpers, from simple register loads to complex block transfer instructions.
Labels: Named Positions
Labels mark positions in the code that can be referenced by jumps and calls:
When you write rom.label("main"), the current position gets recorded. Later, when you write rom.jp("main"), the crate knows exactly where to jump.
Fixups: Forward References
The clever part is handling forward references. When you write rom.call("print_string") before print_string is defined, the crate can't know the address yet. Instead, it records a fixup:
pubfncall(&mutself,label:&str){self.emit(&[0xCD]);// CALL opcodeself.fixup(label);// Record that we need to fill in this address}pubfnfixup(&mutself,label:&str){self.fixups.push((self.rom.len(),label.to_string()));self.emit_word(0x0000);// Placeholder}
At the end, resolve_fixups() walks through all recorded fixups and patches in the correct addresses:
This simple mechanism enables natural code organization where you can reference routines before defining them.
Building Blocks: The Standard Library
Raw instruction emission is powerful but verbose. The workbench includes pre-built routines for common tasks that any Z80 program needs.
Serial I/O
Our modified RetroShield firmware emulates an MC6850 ACIA for serial communication (the official RetroShield uses an Intel 8251). The standard library provides blocking read/write routines:
pubfnemit_getchar(&mutself){self.label("getchar");self.in_a(0x80);// Read status registerself.and_a(0x01);// Test RX ready bitself.emit(&[0x28,0xFA]);// JR Z, -6 (loop until ready)self.in_a(0x81);// Read data registerself.ret();}
This generates a 10-byte routine that any program can call with rom.call("getchar"). The character comes back in the A register, exactly as you'd expect from a standard library function.
Similar routines handle putchar, print_string (for null-terminated strings), and newline (CR+LF).
VT100 Terminal Control
Every program I've written needs cursor positioning, screen clearing, and other terminal operations. The standard library includes VT100 escape sequences:
pubfnemit_clear_screen(&mutself){self.label("clear_screen");self.ld_hl_label("_cls_seq");self.call("print_string");self.ret();}// Later, in data section:rom.label("_cls_seq");rom.emit_string("\x1B[2J\x1B[H");// ESC[2J ESC[H
The cursor_pos routine is more complex, converting binary row/column values to the ASCII digits that VT100 expects. It's about 50 bytes of Z80 code that no one wants to write more than once.
Math Routines
The Z80 has limited math capabilities, especially for 16-bit operations. The standard library provides:
print_byte_dec: Convert and print A register as decimal (000-255)
div16: 16-bit division with remainder
negate_hl: Two's complement negation
These become critical building blocks for anything involving numbers.
Pseudo-Assembly as Building Blocks
The real power emerges when you combine these primitives into higher-level constructs. Instead of thinking in individual Z80 instructions, you start thinking in chunks of functionality.
Consider implementing a text editor. You need a routine to insert a character at the cursor position. In pseudo-assembly, this is:
Get the current line pointer
Shift all bytes from cursor to end of buffer right by one
Insert the new character
Update cursor position
Redraw
Each of these steps becomes a Rust method that emits a sequence of Z80 instructions:
fnemit_insert_char(&mutself){self.label("insert_char");// Save the character to insertself.ld_addr_a(TEMP_A);// Get current line pointerself.ld_a_addr(CURSOR_ROW);self.call("get_line_ptr");// HL = line start// Add cursor column offsetself.ld_de_addr(CURSOR_COL);self.add_hl_de();// HL = insert position// Calculate bytes to shift...// (many more instructions)// Use LDDR for the actual shiftself.emit(&[0xED,0xB8]);// LDDR// Insert the characterself.ld_a_addr(TEMP_A);self.ld_hl_ind_a();// Update counters and redrawself.call("increment_cursor");self.call("draw_current_line");self.ret();}
This method generates about 80 bytes of Z80 machine code. By building up from primitives to routines to complete functions, complex programs become manageable.
Programs Built with the Workbench
The real test of any framework is what you can build with it. Here's what's running on the RetroShield today.
kz80_db: A dBASE II Clone
dBASE II was the database that launched a thousand businesses in the early 1980s. Before SQL became dominant, dBASE gave microcomputer users their first taste of structured data management. My clone implements the authentic 1981 file format: 8-byte headers, 16-byte field descriptors, fixed-length records with delete flags.
CREATE to define new database structures with up to 16 fields
USE to open existing .DBF files from the SD card
APPEND to add records interactively
LIST to display all records in columnar format
EDIT to modify existing records with field-by-field prompts
DELETE and PACK for soft-delete and physical removal
GO TOP/BOTTOM and GO n for record navigation
DISPLAY STRUCTURE to show field definitions
The generated ROM is about 4KB, fitting comfortably in the RetroShield's 8KB ROM space. It reads and writes real .DBF files that you can open in modern database tools like LibreOffice Calc or even current versions of dBASE.
Building this required implementing a command parser that handles the dot-prompt interface, string comparison routines for command matching, file I/O through the SD card interface with seek operations, and the full dBASE command set. Each command is a Rust method that emits the appropriate Z80 code:
fnemit_list_command(&mutself){self.label("cmd_list");// Check if database is openself.ld_a_addr(DB_OPEN);self.or_a_a();self.jp_z("no_db_open");// Print column headers from field descriptorsself.call("print_headers");// Loop through all recordsself.ld_hl(1);self.ld_addr_hl(CURRENT_REC);self.label("list_loop");self.call("read_record");self.call("print_record");// Increment and check against record countself.ld_hl_addr(CURRENT_REC);self.inc_hl();self.ld_addr_hl(CURRENT_REC);// ... 150+ more lines}
The SD card interface deserves special mention. The RetroShield includes an SD card reader accessible through I/O ports. Commands like open, read, write, seek, and close are sent through a command register, with data transferred byte-by-byte through a data register. The workbench makes this tolerable by wrapping the low-level port operations in reusable routines.
kz80_ws: A WordStar Clone
WordStar defined text editing for a generation of writers. George R.R. Martin famously still uses it. The diamond cursor movement (^E ^S ^D ^X arranged like arrow keys on the keyboard), the block operations (^KB ^KK ^KC), the search functions, the word wrap, the careful attention to 80-column displays: all of this became muscle memory for millions of users.
The clone implements:
Full cursor movement with ^E/^S/^D/^X and ^A/^F for word movement
Insert and overwrite modes with ^V toggle
Block operations: mark begin (^KB), mark end (^KK), copy (^KC), delete (^KY)
File operations: save (^KS), save and exit (^KD), quit without saving (^KQ)
Search (^QF), word wrap at configurable right margins
Line operations: delete line (^Y), insert line break (^N)
Quick movement: top of file (^QR), end of file (^QC), line start/end (^QS/^QD)
VT100 terminal output with proper status line showing line/column/mode
The memory layout is carefully designed for the 8KB RAM constraint:
The word wrap implementation is particularly satisfying. When the cursor passes the right margin (default column 65), the editor scans backward to find the last space, then uses the Z80's LDDR instruction to shift the buffer and insert a CR/LF pair. The cursor repositions on the new line at exactly the right column to continue typing the wrapped word. All of this happens fast enough that the user just sees smooth text flow.
The screen update strategy matters on a 4MHz processor. Rather than redrawing the entire screen on each keystroke, the editor tracks what changed and only redraws the affected line. The VT100 "clear to end of line" escape sequence handles trailing garbage. This keeps the interface responsive despite the hardware limitations.
kz80_calc: A VisiCalc-Style Spreadsheet
VisiCalc was the "killer app" that made personal computers business tools. Dan Bricklin and Bob Frankston's 1979 creation turned the Apple II from a hobbyist toy into something accountants would buy. My version brings that experience to the Z80:
1024 cells (16 columns A-P by 64 rows) in 6KB of RAM
8-digit packed BCD arithmetic for accurate decimal math
Formula support with cell references (A1+B2*C3)
Operator precedence (* and / before + and -)
Range functions: @SUM, @AVG, @MIN, @MAX, @COUNT
Automatic recalculation when cells change
Arrow key navigation and GOTO command for jumping to cells
Cell types: numbers, labels, formulas, and repeating characters
The BCD math was the hardest part. Binary floating-point would give wrong answers for financial calculations (the classic 0.1 + 0.2 != 0.3 problem). Packed BCD stores two decimal digits per byte, and the Z80's DAA (Decimal Adjust Accumulator) instruction handles single-byte addition correctly. But building 32-bit multiplication and division from 8-bit DAA takes hundreds of carefully sequenced instructions.
The formula parser handles expressions like =A1+B2*C3-@SUM(D1:D10). This required implementing recursive descent parsing in Z80 machine code, which the workbench made tractable by letting me focus on the algorithm rather than opcode encodings. The parser breaks formulas into tokens, builds a simple AST in memory, and evaluates it with proper operator precedence.
Beyond the Workbench
The workbench proved its value for these three substantial applications. But I've also built other Z80 projects that predate the workbench or use their own code generation approaches:
kz80_c: A C compiler with its own emit infrastructure, developed before the workbench was extracted as a reusable crate
kz80_lisp: A LISP interpreter with mark-and-sweep garbage collection
kz80_prolog: Logic programming with unification and backtracking
kz80_ml: An ML compiler with Hindley-Milner type inference
kz80_fortran: FORTRAN77 subset for scientific computing nostalgia
kz80_lua, kz80_smalltalk, kz80_chip8: Various interpreters and emulators
The experience building these earlier projects is what led to extracting the common patterns into the workbench. The emit/label/fixup pattern appeared independently in several codebases before I recognized it as a reusable abstraction.
Looking back at kz80_c, for instance, I can see the proto-workbench emerging. There's a CodeGen struct with an emit() method, a labels hashmap, and fixup resolution. The same pattern appears in kz80_lisp. Eventually it became clear that this infrastructure should be its own crate, tested once and reused everywhere.
The workbench also benefited from hindsight. Early projects had ad-hoc solutions for things like unique label generation (essential for compiling nested control structures) and relative jump calculation. The workbench handles these correctly from the start, saving debugging time on every subsequent project.
The Hardware: RetroShield Z80
For those unfamiliar with the RetroShield project, it's worth a brief explanation. The RetroShield is an Arduino shield designed by 8BitForce that lets you run real vintage CPUs. You plug an actual Z80 (or 6502, or 6809, or 8085) into a socket on the shield. The Arduino provides clock, reset, and memory by intercepting the CPU's bus signals.
I added an SD card interface at ports 0x10-0x15, which isn't part of the standard RetroShield but integrates cleanly with the Arduino firmware. This gives the dBASE and WordStar clones persistent file storage.
This constrained environment is actually liberating. You can't reach for a 100MB framework or spawn threads. Every byte matters. The programs you write are complete, self-contained, and comprehensible. The entire WordStar clone is about 4KB of machine code. You can read a hex dump of the ROM and, with patience, trace exactly what every byte does.
The RetroShield connects to an Arduino Mega via two rows of 18 pins, or alternatively to a Teensy 4.1 using a special carrier board. Either way, you interact with your Z80 programs through a terminal emulator over USB serial. The VT100 and VT220 escape sequences that the workbench's terminal routines emit work perfectly in modern terminals like iTerm2 or the venerable screen command, connecting 1970s display protocols to 2020s software.
Why Rust?
Rust brings several advantages to this domain:
Type Safety: The compiler catches mistakes like passing a label where an address is expected, or using the wrong register size. This matters when generating machine code where a single wrong byte corrupts everything.
Zero Runtime: The generated ROMs contain only Z80 code, no runtime, no garbage collector. Rust's abstractions compile away completely.
Excellent Tooling: Cargo handles dependencies, testing, and publishing. The workbench is on crates.io; adding it to a project is one line in Cargo.toml.
Performance: Code generation is fast. Even the complex projects compile in under a second.
Expressiveness: Rust's type system lets me encode Z80 concepts cleanly. A label is a String, an address is a u16, and the compiler keeps them straight.
Lessons Learned
Building the workbench and using it for real projects taught me several things:
Start with the primitives right: The emit/label/fixup core hasn't changed since the first version. Getting the foundation solid paid dividends.
Standard library matters: Having I/O and terminal routines ready to call eliminated boilerplate from every project. I probably use call("print_string") a hundred times across all the projects.
Let the host do the work: Complex string manipulation, parsing, and data structure management happen in Rust on the host computer. The Z80 code just handles the runtime behavior. This split makes everything easier.
Readability over brevity: A Z80 program written in the workbench is longer than the equivalent hand-assembled hex, but it's readable and maintainable. When I need to fix a bug in the WordStar word wrap routine, I can read the Rust code and understand it.
Getting Started
The workbench is available on crates.io:
[dependencies]retroshield-z80-workbench="0.1"
A minimal program:
useretroshield_z80_workbench::prelude::*;fnmain(){letmutrom=CodeGen::new();rom.emit_startup(0x3FFF);rom.call("clear_screen");rom.ld_hl_label("msg");rom.call("print_string");rom.halt();rom.label("msg");rom.emit_string("Hello from Z80!\r\n");rom.include_stdlib();rom.resolve_fixups();rom.write_bin("hello.bin").unwrap();}
Load hello.bin onto a RetroShield (or run it in a Z80 emulator), and you'll see the greeting on your terminal.
Conclusion
The Z80 is nearly 50 years old, but it's still fun to program. The retroshield-z80-workbench brings modern development practices to vintage hardware: type-safe code generation, proper dependency management, fast iteration, and readable source.
Whether you want to build a clone of classic software, implement your own programming language for 8-bit hardware, or just understand how computers work at the machine code level, having the right tools makes all the difference. And there's still nothing quite like watching your code run on a chip that predates most programmers alive today.
The code for the workbench and all the kz80_* projects is available on GitHub under BSD-3-Clause licenses. PRs welcome.
Over the past year, I have been building a collection of programming language compilers and interpreters targeting the venerable Zilog Z80 microprocessor. What started as an experiment in retrocomputing has grown into a comprehensive suite of tools spanning multiple programming paradigms: from the functional elegance of LISP to the object-oriented messaging of Smalltalk, from the structured programming of Pascal and Fortran to the low-level control of C. This anthology documents the common architectural patterns, the unique challenges of targeting an 8-bit processor, and the unexpected joys of bringing modern language implementations to 1970s hardware.
My fascination with the Z80 began in the mid-1990s when I got my first TI-85 graphing calculator. That unassuming device, marketed for algebra and calculus homework, contained a Z80 running at 6 MHz with 28KB of RAM. Discovering that I could write programs in Z80 assembly and run them on this pocket computer was revelatory. I accumulated a small library of Z80 assembly books and spent countless hours learning the instruction set, writing simple games, and understanding how software meets hardware at the most fundamental level. Three decades later, this project represents a return to that formative obsession, now armed with modern tools and a deeper understanding of language implementation.
The RetroShield Platform
The RetroShield is a family of hardware adapters that bridge vintage microprocessors to modern Arduino development boards. The product line covers a remarkable range of classic CPUs: the MOS 6502 (powering the Apple II and Commodore 64), the Motorola 6809 (used in the TRS-80 Color Computer), the Intel 8085, the SC/MP, and the Zilog Z80. Each variant allows the original processor to execute real machine code while the Arduino emulates memory, peripherals, and I/O.
For this project, I focused exclusively on the RetroShield Z80. The Z80's rich instruction set, hardware BCD support via the DAA instruction, and historical significance as the CPU behind CP/M made it an ideal target for language implementation experiments. The RetroShield Z80 connects the actual Z80 chip to an Arduino Mega (or Teensy adapter for projects requiring more RAM), which emulates the memory and peripheral chips. This arrangement provides the authenticity of running on actual Z80 silicon while offering the convenience of modern development workflows.
The standard memory map provides 8KB of ROM at addresses 0x0000-0x1FFF and 6KB of RAM at 0x2000-0x37FF, though the Teensy adapter expands this significantly to 256KB. Serial I/O is handled through an emulated MC6850 ACIA chip at ports 0x80 and 0x81, providing the familiar RS-232 interface that connects these vintage programs to modern terminals.
It needs to be mentioned that if you do have a Z80 RetroShield and you want to run the binaries produced by the compilers collections on actual hardware, you will need a couple things: 1) bin2c, this is a program that will take a Z80 binary and turn it into a PROGMEM statement that you can put into an Arduino sketch. 2) Look at this sketch - there is code in there for emulating the MC6850 ACIA.
Common Compiler Architecture: Lexer, Parser, AST, Codegen
Every compiler in this collection follows a similar multi-stage architecture, a pattern that has proven itself across decades of compiler construction. Understanding this common structure reveals how the same fundamental approach can target vastly different source languages while producing efficient Z80 machine code.
The Lexer: Breaking Text into Tokens
The lexer (or tokenizer) is the first stage of compilation, responsible for transforming raw source code into a stream of tokens. Each language has its own lexical grammar: LISP recognizes parentheses and symbols, C identifies keywords and operators, Smalltalk distinguishes between message selectors and literals. Despite these differences, every lexer performs the same fundamental task of categorizing input characters into meaningful units.
In our Rust implementations, the lexer typically maintains a position in the source string and provides a next_token() method that advances through the input. This produces tokens like Token::Integer(42), Token::Plus, or Token::Identifier("factorial"). The lexer handles the tedious work of skipping whitespace, recognizing multi-character operators, and converting digit sequences into numbers.
The Parser: Building the Abstract Syntax Tree
The parser consumes the token stream and constructs an Abstract Syntax Tree (AST) that represents the hierarchical structure of the program. Most of our compilers use recursive descent parsing, a technique where each grammar rule becomes a function that may call other rule functions. This approach is intuitive, produces readable code, and handles the grammars of most programming languages effectively.
For example, parsing an arithmetic expression like 3 + 4 * 5 requires understanding operator precedence. The parser might have functions like parse_expression(), parse_term(), and parse_factor(), each handling operators at different precedence levels. The result is an AST where the multiplication is grouped as a subtree, correctly representing that it should be evaluated before the addition.
Code Generation: Emitting Z80 Machine Code
The code generator walks the AST and emits Z80 machine code. This is where the rubber meets the road: abstract operations like "add two numbers" become concrete sequences of Z80 instructions like LD A,(HL), ADD A,E, and LD (DE),A.
Most of our compilers generate code directly into a byte buffer, manually encoding each instruction's opcode and operands. This approach, while requiring intimate knowledge of the Z80 instruction set, gives us precise control over the generated code and avoids the complexity of an intermediate representation or separate assembler pass.
The DAA Instruction and BCD Arithmetic
One of the most fascinating aspects of Z80 programming is the DAA (Decimal Adjust Accumulator) instruction, opcode 0x27. This single instruction makes the Z80 surprisingly capable at decimal arithmetic, which proves essential for implementing numeric types on an 8-bit processor.
What is BCD?
Binary Coded Decimal (BCD) is a numeric representation where each decimal digit is stored in 4 bits (a nibble). Rather than storing the number 42 as binary 00101010 (its true binary representation), BCD stores it as 0100 0010, with the first nibble representing 4 and the second representing 2. This "packed BCD" format stores two decimal digits per byte.
While BCD is less space-efficient than pure binary (you can only represent 0-99 in a byte rather than 0-255), it has a crucial advantage: decimal arithmetic produces exact decimal results without rounding errors. This is why BCD was the standard for financial calculations on mainframes and why pocket calculators (including the famous TI series) used BCD internally.
How DAA Works
When you perform binary addition on two BCD digits, the result may not be valid BCD. Adding 0x09 and 0x01 gives 0x0A, but 0x0A is not a valid BCD digit. The DAA instruction corrects this: it examines the result and the half-carry flag (which indicates a carry from bit 3 to bit 4, i.e., from the low nibble to the high nibble) and adds 0x06 to any nibble that exceeds 9. After DAA, that 0x0A becomes 0x10, correctly representing decimal 10 in BCD.
This process works for both addition (after ADD or ADC instructions) and subtraction (after SUB or SBC instructions, where DAA subtracts 0x06 instead of adding it). The Z80 remembers whether the previous operation was addition or subtraction through its N flag.
BCD in Our Compilers
Several of our compilers use 4-byte packed BCD integers, supporting numbers up to 99,999,999 (8 decimal digits). The addition routine loads bytes from both operands starting from the least significant byte, adds them with ADC (add with carry) to propagate carries between bytes, applies DAA to correct each byte, and stores the result. The entire operation takes perhaps 20 bytes of code but provides exact decimal arithmetic on an 8-bit processor.
Here is a simplified version of our BCD addition loop:
bcd_add:
LD B, 4 ; 4 bytes to process
OR A ; Clear carry flag
bcd_add_loop:
LD A, (DE) ; Load byte from first operand
ADC A, (HL) ; Add byte from second operand with carry
DAA ; Decimal adjust
LD (DE), A ; Store result
DEC HL ; Move to next byte
DEC DE
DJNZ bcd_add_loop
RET
This pattern appears in kz80_c, kz80_fortran, kz80_smalltalk, and kz80_lisp, demonstrating how a hardware feature designed in 1976 still provides practical benefits for language implementation.
The Evolution: From Assembly to C to Rust
The journey of implementing these compilers taught us valuable lessons about choosing the right tool for the job, and our approach evolved significantly over time.
First Attempt: Pascal in Z80 Assembly
Our first language implementation was kz80_pascal, a Pascal interpreter written entirely in Z80 assembly language. This approach seemed natural: if you are targeting the Z80, why not write directly in its native language?
The reality proved challenging. Z80 assembly, while powerful, is unforgiving. Building a recursive descent parser in assembly requires manually managing the call stack, carefully preserving registers across function calls, and debugging through hex dumps of memory. The resulting interpreter works and provides an interactive REPL for Pascal expressions, but extending it requires significant effort. Every new feature means more assembly, more potential for subtle bugs, and more time spent on implementation details rather than language design.
Second Attempt: Fortran 77 in C with SDCC
For kz80_fortran, we tried a different approach: writing the interpreter in C and cross-compiling with SDCC (Small Device C Compiler). This was dramatically more productive. C provided structured control flow, automatic stack management, and the ability to organize code into manageable modules.
The result is a comprehensive Fortran 77 subset with floating-point arithmetic (via BCD), subroutines and functions, arrays, and block IF statements. The C source compiles to approximately 19KB of Z80 code, fitting comfortably in ROM with room for program storage in RAM.
However, this approach has limitations. SDCC produces functional but not always optimal code, and debugging requires understanding both the C source and the generated assembly. The interpreter also requires the Teensy adapter with 256KB RAM, as the Arduino Mega's 4KB is insufficient for the runtime data structures.
The Rust Workbench: Our Final Form
Our breakthrough came with the realization that we did not need the compiler itself to run on the Z80, only the generated code. This insight led to what we call the "Rust workbench" approach: write the compiler in Rust, running on a modern development machine, and have it emit Z80 binary images.
This architecture provides enormous advantages:
Modern tooling: Cargo manages dependencies and builds, rustc catches bugs at compile time, and we have access to the entire Rust ecosystem for testing and development.
Fast iteration: Compiling a Rust program takes seconds; testing the generated Z80 code in our emulator takes milliseconds. Compare this to the multi-minute flash cycles required when the compiler runs on the target.
Comprehensive testing: Each compiler includes both Rust unit tests (testing the lexer, parser, and code generator individually) and integration tests that compile source programs and verify their output in the emulator.
Zero-dependency output: Despite being written in Rust, the generated Z80 binaries have no runtime dependencies. They are pure machine code that runs directly on the hardware.
This approach now powers kz80_lisp, kz80_c, kz80_lua, kz80_smalltalk, kz80_chip8, and retrolang. Each is a standalone Rust binary that reads source code and produces a 32KB ROM image.
The Z80 Emulator
None of this would be practical without a way to test generated code quickly. Our RetroShield Z80 Emulator provides exactly this: a cycle-accurate Z80 emulation with the same memory map and I/O ports as the real hardware.
The emulator comes in two versions: a simple passthrough mode (retroshield) that connects stdin/stdout directly to the emulated serial port, and a full TUI debugger (retroshield_nc) with register displays, disassembly views, memory inspection, and single-step execution. The passthrough mode enables scripted testing, piping test inputs through the emulator and comparing outputs against expected results. The TUI debugger proves invaluable when tracking down code generation bugs.
The emulator uses the superzazu/z80 library for CPU emulation, which provides accurate flag behavior and correct cycle counts. Combined with our MC6850 ACIA emulation, it provides a faithful recreation of the RetroShield environment without requiring physical hardware.
Self-Hosting Compilers: LISP and C
Two of our compilers achieve something remarkable: they can compile themselves and run on the target hardware. This property, called "self-hosting," is a significant milestone in compiler development.
What Does Self-Hosting Mean?
A self-hosting compiler is one written in the language it compiles. The classic example is the C compiler: most C compilers are themselves written in C. But this creates a chicken-and-egg problem: how do you compile a C compiler if you need a C compiler to compile it?
The solution is bootstrapping. You start with a minimal compiler written in some other language (or in machine code), use it to compile a slightly better compiler written in the target language, and iterate until you have a full-featured compiler that can compile its own source code. Once bootstrapped, the compiler becomes self-sustaining: future versions compile themselves.
kz80_lisp: A Self-Hosted LISP Compiler
kz80_lisp (crates.io) includes a LISP-to-Z80 compiler written in LISP itself. The compiler.lisp file defines functions that traverse LISP expressions and emit Z80 machine code bytes directly into memory. When you call (COMPILE '(+ 1 2)), it generates the actual Z80 instructions to load 1 and 2 and add them.
The self-hosted compiler supports arithmetic expressions, nested function calls, and can generate code that interfaces with the runtime's I/O primitives. While not a full replacement for the Rust-based code generator, it demonstrates that LISP is expressive enough to describe its own compilation to machine code.
kz80_c: A Self-Hosted C Compiler
kz80_c (crates.io) goes further: its self/cc.c file is a complete C compiler written in the C subset it compiles. This compiler reads C source from stdin and outputs Z80 binary to stdout, making it usable in shell pipelines:
The self-hosted C compiler supports all arithmetic operators, pointers, arrays, global variables, control flow statements, and recursive functions. Its main limitation is memory: the compiler source is approximately 66KB, exceeding the 8KB input buffer available on the Z80. This is a fundamental hardware constraint, not a compiler bug. In theory, a "stage 0" minimal compiler could bootstrap larger compilers.
Why Self-Hosting Matters
Self-hosting is more than a technical achievement; it validates the language implementation. If the compiler can compile itself correctly, it demonstrates that the language is expressive enough for real programs and that the code generator produces working code under complex conditions. For our Z80 compilers, self-hosting also connects us to the history of computing: the original Small-C compiler by Ron Cain in 1980 was similarly self-hosted on Z80/CP-M systems.
The Language Implementations
kz80_lisp
A minimal LISP interpreter and compiler featuring the full suite of list operations (CAR, CDR, CONS), special forms (QUOTE, IF, COND, LAMBDA, DEFINE), and recursive function support. The implementation includes a pure-LISP floating-point library and the self-hosted compiler mentioned above.
A C compiler supporting char (8-bit), int (16-bit), float (BCD), pointers, arrays, structs, and a preprocessor with #define and #include. The runtime library provides serial I/O and comprehensive BCD arithmetic functions. The self-hosted variant can compile and run C programs entirely on the Z80.
# cat fibonacci.c
int fib(int n) {
if (n <= 1) return n;
return fib(n-1) + fib(n-2);
}
int main() {
puts("Fibonacci:");
for (int i = 0; i < 10; i = i + 1)
print_num(fib(i));
return 0;
}
# kz80_c fibonacci.c -o fib.bin
# retroshield -l fib.bin
Fibonacci:
0 1 1 2 3 5 8 13 21 34
A Smalltalk subset compiler implementing the language's distinctive message-passing syntax with left-to-right operator evaluation. Expressions like 1 + 2 * 3 evaluate to 9 (not 7), matching Smalltalk's uniform treatment of binary messages. All arithmetic uses BCD with the DAA instruction.
A Lua compiler producing standalone ROM images with an embedded virtual machine. Supports tables (Lua's associative arrays), first-class functions, closures, and familiar control structures. The generated VM interprets Lua bytecode, with frequently-used operations implemented in native Z80 code for performance.
# cat factorial.lua
function factorial(n)
if n <= 1 then return 1 end
return n * factorial(n - 1)
end
print("5! =", factorial(5))
# kz80_lua factorial.lua -o fact.bin
# retroshield -l fact.bin
Tiny Lua v0.1
5! = 120
A Fortran 77 interpreter with free-format input, REAL numbers via BCD floating point, block IF/THEN/ELSE/ENDIF, DO loops, subroutines, and functions. Requires the Teensy adapter for sufficient RAM. Written in C and cross-compiled with SDCC.
FORTRAN-77 Interpreter v0.3
RetroShield Z80
Ready.
> INTEGER X, Y
> X = 7
> Y = X * 6
> WRITE(*,*) 'Answer:', Y
Answer: 42
A Pascal interpreter implemented in pure Z80 assembly. Provides an interactive REPL for expression evaluation with integer arithmetic, boolean operations, and comparison operators. A testament to the challenges of assembly language programming.
A custom systems programming language with Pascal/C-like syntax, featuring 16-bit integers, 8-bit bytes, pointers, arrays, inline assembly, and full function support with recursion. Compiles to readable Z80 assembly before assembling to binary.
# cat squares.rl
proc main()
var i: int;
print("Squares: ");
for i := 1 to 5 do
printi(i * i);
printc(32);
end;
println();
end;
# retrolang squares.rl --binary -o squares.bin
# retroshield -l squares.bin
Squares: 1 4 9 16 25
A static recompiler that transforms CHIP-8 programs into native Z80 code. Rather than interpreting CHIP-8 bytecode at runtime, the compiler analyzes each instruction and generates equivalent Z80 sequences. Classic games like Space Invaders and Tetris run directly on the hardware.
The choice of Rust for our compiler workbench was not accidental. Several features make it exceptionally well-suited for this work.
Strong typing catches bugs early. When you're generating machine code, off-by-one errors or type mismatches can produce binaries that crash or compute wrong results. Rust's type system prevents many such errors at compile time.
Pattern matching excels at AST manipulation. Walking a syntax tree involves matching on node types and recursively processing children. Rust's match expressions with destructuring make this natural and exhaustive (the compiler warns if you forget a case).
Zero-cost abstractions. We can use high-level constructs like iterators, enums with data, and trait objects without runtime overhead. The generated compiler code is as efficient as hand-written C.
Excellent tooling. Cargo's test framework made it easy to build comprehensive test suites. Each compiler has dozens to hundreds of tests that run in seconds, providing confidence when making changes.
Memory safety without garbage collection. This matters less for the compilers themselves (which are desktop tools) but more for our mental model: thinking about ownership and lifetimes transfers naturally to thinking about Z80 register allocation and stack management.
Conclusion
Building these compilers has been a journey through computing history, from the Z80's 1976 architecture to modern Rust tooling, from the fundamentals of lexing and parsing to the intricacies of self-hosting. The BCD arithmetic that seemed like a curiosity became a practical necessity; the emulator that started as a debugging aid became essential infrastructure; the Rust workbench that felt like an optimization became the key to productivity.
The Z80 remains a remarkable teaching platform. Its simple instruction set is comprehensible in an afternoon, yet implementing real languages for it requires genuine compiler engineering. Every language in this collection forced us to think carefully about representation, evaluation, and code generation in ways that higher-level targets often obscure.
All of these projects are open source under BSD-3-Clause licenses. The compilers are available on both GitHub and crates.io, ready to install with cargo install. Whether you are interested in retrocomputing, compiler construction, or just curious how programming languages work at the metal level, I hope these tools and their source code prove useful.
The Z80 may be nearly 50 years old, but it still has lessons to teach.
There's something deeply satisfying about running code on vintage hardware. The blinking cursor, the deliberate pace of execution, the direct connection between your keystrokes and the machine's response. The RetroShield by Erturk Kocalar brings this experience to modern makers by allowing real vintage CPUs like the Zilog Z80 to run on Arduino boards. But what if you could experience that same feeling directly in your web browser?
That's exactly what I set out to build: a complete Z80 emulator that runs RetroShield firmware in WebAssembly, complete with authentic CRT visual effects and support for multiple programming language interpreters.
Try It Now
Select a ROM below and click "Load ROM" to start. Click on the terminal to focus it, then type to interact with the interpreter.
Idle
PC:0000
Cycles:0
Speed:0 MHz
Tip: Click on the terminal to focus it, then type to send input. Try loading Fortran 77 and entering: INTEGER X then X = 42 then WRITE(*,*) X
ROM Information
Select a ROM above to load it into the emulator.
The RetroShield Platform
Before diving into the emulator, it's worth understanding what makes the RetroShield special. Unlike software emulators that simulate a CPU in code, the RetroShield uses a real vintage microprocessor. The Z80 variant features an actual Zilog Z80 chip running at its native speed, connected to an Arduino Mega or Teensy that provides:
Memory emulation: The Arduino's SRAM serves as the Z80's RAM, while program code is stored in the Arduino's flash memory
I/O peripherals: Serial communication, typically through an emulated MC6850 ACIA or Intel 8251 USART
Clock generation: The Arduino provides the clock signal to the Z80
This hybrid approach means you get authentic Z80 behavior - every timing quirk, every undocumented opcode - while still having the convenience of USB connectivity and easy program loading.
Here's my RetroShield Z80 setup with the Teensy adapter:
The Zilog Z80 CPU sits in the 40-pin DIP socket, with the Teensy 4.1 providing memory emulation and I/O handling beneath.
The physical hardware runs identically to the browser emulator above - the same ROMs, the same interpreters, the same authentic Z80 execution.
Why Build a Browser Emulator?
Having built several interpreters and tools for the RetroShield, I found myself constantly cycling through the development loop: edit code, compile, flash to Arduino, test, repeat. A software emulator would speed this up significantly, but I also wanted something I could share with others who might not have the hardware.
WebAssembly seemed like the perfect solution. It runs at near-native speed in any modern browser, requires no installation, and can be embedded directly in a web page. Someone curious about retro computing could try out a Fortran 77 interpreter or Forth environment without buying any hardware.
Building the Emulator in Rust
I chose Rust for the emulator implementation for several reasons:
Excellent WASM support: Rust's wasm-bindgen and wasm-pack tools make compiling to WebAssembly straightforward
Performance: Rust compiles to efficient code, important for cycle-accurate emulation
The rz80 crate: Andre Weissflog's rz80 provides a battle-tested Z80 core
One challenge was supporting ROMs that use different serial chips. The RetroShield ecosystem has two common configurations:
MC6850 ACIA (ports $80/$81): Used by many homebrew projects including MINT, Firth Forth, and my own Fortran and Pascal interpreters. The ACIA has four registers (control, status, transmit data, receive data) mapped to two ports, with separate read/write functions per port.
Intel 8251 USART (ports $00/$01): Used by Grant Searle's popular BASIC port and the EFEX monitor. The 8251 is simpler with just two ports - one for data and one for control/status.
The emulator detects which chip to use based on ROM metadata and configures the I/O handlers accordingly.
Memory Layout
The standard RetroShield memory map looks like this:
Address Range
Size
Description
$0000-$7FFF
32KB
ROM/RAM (program dependent)
$8000-$FFFF
32KB
Extended RAM (Teensy adapter)
Most of my interpreters use a layout where code occupies the lower addresses and data/stack occupy higher memory. The Fortran interpreter, for example, places its program text storage at $6700 and variable storage at $7200, with the stack growing down from $8000.
The CRT Effect
No retro computing experience would be complete without the warm glow of a CRT monitor. I implemented several visual effects using pure CSS:
Scanlines: A repeating gradient overlay creates the horizontal line pattern characteristic of CRT displays:
Chromatic aberration: CRT displays have slight color fringing due to the electron beam hitting phosphors at angles. I simulate this with animated text shadows that shift red and blue components:
Flicker: Real CRTs had subtle brightness variations. A randomized opacity animation creates this effect without being distracting.
Vignette: The edges of CRT screens were typically darker than the center, simulated with a radial gradient.
The font: I'm using the Glass TTY VT220 font, a faithful recreation of the DEC VT220 terminal font from the 1980s. It's public domain and adds significant authenticity to the experience.
The Language Interpreters
The emulator comes pre-loaded with several language interpreters, each running as native Z80 code:
Fortran 77 Interpreter
This is my most ambitious RetroShield project: a subset of Fortran 77 running interpretively on an 8-bit CPU. It supports:
REAL numbers via BCD (Binary Coded Decimal) floating point with 8 significant digits
INTEGER and REAL variables with implicit typing (I-N are integers)
The interpreter is written in C and cross-compiled with SDCC. At roughly 21KB of code, it pushes the limits of what's practical on the base RetroShield, which is why it requires the Teensy adapter.
MINT (Minimal Interpreter)
MINT is a wonderfully compact stack-based language. Each command is a single character, making it incredibly memory-efficient:
> 1 2 + .
3
> : SQ D * ;> 5 SQ .
25
Firth Forth
A full Forth implementation by John Hardy. Forth's stack-based paradigm and extensibility made it popular on memory-constrained systems:
> : FACTORIAL ( n -- n! ) 1 SWAP 1+ 1 DO I * LOOP ;> 7 FACTORIAL .
5040
Grant Searle's BASIC
A port of Microsoft BASIC that provides the classic BASIC experience:
Z80 BASIC Ver 4.7b
Ok
> 10 FOR I = 1 TO 10> 20 PRINT I * I> 30 NEXT I> RUN
1
4
9
...
Technical Challenges
Building this project involved solving several interesting problems:
Memory Layout Debugging
The Fortran interpreter crashed mysteriously when entering lines with statement labels. After much investigation, I discovered the CODE section had grown to overlap with the DATA section. The linker was told to place data at $5000, but code had grown past that point. The fix was updating the memory layout to give code more room:
This kind of bug is particularly insidious because it works fine until the code grows past a certain threshold.
BCD Floating Point
Implementing floating-point math on a Z80 without hardware support is challenging. I chose BCD (Binary Coded Decimal) representation because:
Exact decimal representation: No binary floating-point surprises like 0.1 + 0.2 != 0.3
Simpler conversion: Reading and printing decimal numbers is straightforward
Reasonable precision: 8 BCD digits give adequate precision for an educational interpreter
Each BCD number uses 6 bytes: 1 for sign, 1 for exponent, and 4 bytes holding 8 packed decimal digits.
Cross-Compilation with SDCC
The Small Device C Compiler (SDCC) targets Z80 and other 8-bit processors. While it's an impressive project, there are quirks:
No standard library functions that assume an OS
Limited optimization compared to modern compilers
Memory model constraints require careful attention to data placement
I wrote a custom crt0.s startup file that initializes the stack, sets up the serial port, and calls main().
Running the Emulator
The emulator runs at roughly 3-4 MHz equivalent speed, depending on your browser and hardware. This is actually faster than the original Z80's typical 4 MHz, but the difference isn't noticeable for interactive use.
John Hardy - Author of Firth Forth, MINT, and Monty
There's something magical about running 49-year-old CPU architectures in a modern web browser. The Z80 powered countless home computers, embedded systems, and arcade games. With this emulator, that legacy is just a click away.
Gilbreths, Vorticism and the Echoes of Artificial Intelligence in the Twenty-First-Century Knowledge Economy
Introduction
The first decades of the twentieth century were a crucible of technological, scientific and cultural transformation. The steam-driven factory floor, the internal-combustion automobile, the telegraph-to-telephone network, and the nascent film industry all collapsed distance and accelerated the rhythm of everyday life. In that moment of accelerated modernity two seemingly unrelated phenomena emerged on opposite sides of the Atlantic: the Gilbreths' scientific-management laboratory in the United States, and the Vorticist avant-garde in Britain.
Both were responses to a shared "milieu"—a world in which the machine was no longer a peripheral tool but the central fact of existence. The Gilbreths turned the machine into a system of human motion, dissecting work into its smallest elements (the "therbligs") and re-engineering tasks for efficiency, ergonomics and profit. Vorticists, led by Wyndham Lewis and allied with figures such as Ezra Pound and Henri Gaudier-Brzeska, seized upon the same mechanical dynamism in a visual language of sharp angles, fractured planes and kinetic abstraction.
A century later, the rise of artificial intelligence is reshaping the same terrain, but this time the target is not manual labor on the factory floor; it is knowledge work, the very act of thinking, deciding and creating. Yet the cultural logic that animated the Gilbreths and the Vorticists resurfaces in the AI era: a faith in rationalization, an obsession with breaking complex processes into analyzable units, a belief that design—whether of a workflow, a painting, or an algorithm—can impose order on the chaos of modern life.
This essay weaves together three strands. First, it sketches the broader historical and intellectual atmosphere that nurtured both the Gilbreths and Vorticism. Second, it juxtaposes their concrete practices and aesthetic strategies, drawing out the convergences in their conceptualization of motion, fragmentation, control and progress. Third, it maps these early-twentieth-century dynamics onto the present AI-driven re-organization of knowledge labor, arguing that the same cultural grammar underlies both epochs, even as the material substrates have shifted from bricklaying to neural networks.
1. The Early-Twentieth-Century Milieu
Technological Acceleration
Between 1900 and 1920 the world witnessed a multiplication of speed. The internal-combustion engine made automobiles and aircraft possible; the electric motor powered factories and household appliances; the telephone and radio collapsed geographic distance; the cinema rendered motion visible and repeatable. Historian David Edgerton has shown that these "new machines" were not simply tools but actors that reshaped social relations (Edgerton, The Shock of the Old, 2006). The very perception of time became quantifiable: a stopwatch could now register the fraction of a second it took a worker to raise a hammer, a clerk to type a word, or a runner to cross a track.
Scientific Management and the Quest for Rational Order
Frederick Winslow Taylor published The Principles of Scientific Management (1911), arguing that work could be transformed into a science through measurement, standardization and hierarchical control. Taylor's ideas traveled swiftly across the Atlantic, finding eager audiences in American industry and, later, in British engineering firms. The core premise was that human labor could be rendered as predictable, repeatable data, amenable to optimization.
The Gilbreths—Frank B. Gilbreth Sr. (a mechanical engineer) and Lillian M. Gilbreth (a psychologist)—expanded Taylor's blueprint. They introduced motion-study photography, a method of capturing workers' movements on film, then dissecting each frame to isolate "therbligs," the elementary units of motion (the word itself a reversal of "Gilbreth"). Their work was both scientific and humane: they claimed that eliminating unnecessary motions would reduce fatigue, increase safety and, paradoxically, improve the worker's quality of life. Their 1915 book Motion Study blended engineering diagrams with psychological insight, making the Gilbreths the archetype of industrial ergonomics.
The Cultural Avant-Garde
Concurrently, a wave of artistic experimentation was erupting across Europe. Cubism (Picasso, Braque) deconstructed visual reality into geometric facets; Futurism (Marinetti, Balla) glorified speed, noise and the machine; Constructivism (Tatlin, Rodchenko) championed functional design as a social weapon. In London, a small cadre of writers and painters, disillusioned with the lingering Victorian aesthetic, coalesced around the journal BLAST (1914-1915).
The manifesto of Vorticism, authored chiefly by Wyndham Lewis, declared a desire to capture the "vortex"—the point where energy, motion and form converge. Vorticist works are characterized by hard-edged angularity, stark color contrasts and a sense of centrifugal force. They rejected the lyrical softness of the Pre-Raphaelite tradition and the pastoral nostalgia of the Edwardian era, instead embracing the "hard, machine-like precision" of the new industrial world.
Overlapping Intellectual Currents
Both the Gilbreths and the Vorticists were embedded in a broader intellectual climate that prized measurement, abstraction and the re-creation of reality. The rise of psychophysics, behaviorism, and physiological psychology introduced the notion that human perception and action could be quantified. In parallel, philosophers such as Henri Bergson were wrestling with the concept of duration and the mechanization of time, while sociologists like Georg Simmel explored the "blasé" effect of urban modernity. The shared vocabulary of "efficiency," "speed," "fragmentation" and "design" became the lingua franca of both engineers and artists.
2. Parallel Strategies: From Motion Study to Vortex
The Machine as Central Fact
Both movements privileged the machine not as a peripheral tool but as a defining lens through which to understand humanity. The Gilbreths approached human labor as a component of a larger production system, treating the body like a mechanical part. Their methods of representation—motion-study film frames, thermographic charts, time-and-motion diagrams—reduced the worker to analyzable data. Their ontological stance held that reality could be reduced to measurable motions, with the machine serving as the baseline condition of life.
The Vorticists operated from a parallel framework but expressed it through aesthetic means. They rendered the human figure and urban landscape as networks of intersecting mechanical forms, employing sharp angular compositions, overlapping planes, and stylized gears and dynamized lines. For them, reality was a flux of forces, and the "vortex" captured the dynamic, mechanized energy of modern existence.
In both cases, the human body was subordinated to, or fused with, a system of motion. For the Gilbreths, a worker's hand was a lever; for the Vorticists, a dancer's limb could be a blade of light cutting through the air.
Fragmentation and Reassembly
The Gilbreths' therbligs (e.g., "reach," "grasp," "move") represent a conceptual atomization of work. By isolating each atomic action, they could re-assemble a sequence that minimized waste and maximized output. This analytical practice mirrors the visual fragmentation employed by Vorticist painters, who broke down objects into geometric primitives before re-constituting them on canvas.
Consider a typical Gilbreth motion-study photograph of a bricklayer: the image is a series of still frames, each showing the worker's arm at a distinct angle. The analyst's task is to trace the trajectory, identify redundant motions, and propose a smoother path. In a Vorticist painting such as Wyndham Lewis's The Crowd (1914-15), the same crowd is depicted as a constellation of overlapping triangles and intersecting lines, each fragment suggesting a movement, a direction, a force. The similarity lies not in content but in methodology: a belief that complex reality becomes intelligible when decomposed into simpler parts.
Control, Order and Design
Both camps produced manifestos that served as design blueprints for their respective domains.
The Gilbreths published practical handbooks—Motion Study (1915), Applied Motion Study (1922)—that provided step-by-step protocols for reorganizing factories, hospitals and even homes. Their famous household experiment, depicted in Cheaper by the Dozen (1948), turned family life into a laboratory of efficiency.
The Vorticists issued the BLAST manifesto (1914), a terse proclamation that called for "a new art that will cut away the old, the sentimental, the decorative". It demanded clarity, precision, and a rejection of "softness"—values that echo the Gilbreths' insistence on eliminating "soft" motions that do not contribute to productive output.
Both therefore exerted cultural authority by prescribing how the world should be organized—whether through a Gantt chart or a bold, angular composition.
Ambivalent Faith in Progress
The Gilbreths believed that scientific optimization would lead to a more humane workplace. Yet their work also laid the groundwork for later Taylorist dehumanization, where workers became interchangeable cogs. Their optimism was tempered by the reality that efficiency could be weaponized for profit, not for worker welfare.
Vorticists, especially Lewis, celebrated the "machine aesthetic" but also expressed an undercurrent of skepticism. Lewis's later writings (e.g., The Apes of God, 1930) reveal a cynical view of mass culture and the mechanization of society. The vortex, while a source of energy, can also become a whirlpool of alienation.
Thus, both movements embody a dual vision of modernity: a promise of liberation through order, paired with a fear of loss of individuality.
3. The AI Turn: Re-Engineering Knowledge Work
From Bricklaying to Algorithms
If the Gilbreths turned the physical act of building into a set of measurable motions, today's AI researchers turn the cognitive act of reasoning into data. Machine-learning pipelines ingest millions of text fragments, label them, and train neural networks that can generate, summarize, and evaluate human language. The "therblig" of a knowledge worker—reading, analyzing, drafting—can now be instrumented by click-stream data, eye-tracking, and keystroke dynamics.
Just as a motion-study camera captured the kinematics of a worker, modern digital platforms capture the logistics of a mind at work. The "process mining" tools used in enterprise software map the sequence of digital actions much as Gilbreth charts mapped the sequence of physical actions.
Fragmentation of Cognitive Tasks
AI development follows the same atomization logic that underpinned both the Gilbreths and the Vorticists. Large language models (LLMs) are trained on tokenized text, where each token—often a sub-word fragment—is a basic unit of meaning. The model learns statistical relationships between tokens, then re-assembles them into sentences, paragraphs, or code.
Similarly, the micro-task platforms (e.g., Amazon Mechanical Turk) break down complex knowledge work (data labeling, content moderation) into tiny, repeatable units that can be distributed across a crowd. The "crowd" becomes a modern analog of the bricklayer's workshop, and the platform's algorithmic workflow is the contemporary "assembly line".
Design, Control and the Algorithmic Order
Just as the Gilbreths produced process charts and Vorticists drafted manifestos, AI researchers issue model cards, datasheets for datasets, and ethical guidelines. These documents codify how the system should behave, what data it may use, and how it ought to be evaluated—mirroring the design-by-specification ethos of early scientific management.
The rise of "prompt engineering"—the craft of phrasing inputs to LLMs to obtain desired outputs—can be read as a new form of motion study. Prompt engineers dissect the model's internal "motion" (attention patterns, token probabilities) and rearrange the prompt to optimize the "efficiency" of the model's response.
Ambivalence and Ethical Dilemmas
The Gilbreths' optimism about worker welfare was later undercut by automation-induced job loss and the rise of "scientific" surveillance of labor. Vorticism's celebration of the machine later seemed naïve in the face of the World Wars and the totalitarian use of technology.
AI today reproduces this ambivalence. Proponents hail it as a tool that will free humanity from routine cognition, allowing us to focus on creativity and empathy. Critics warn of algorithmic bias, disinformation, and the erosion of skilled labor. The "vortex" of AI can either be a centrifugal force that propels society forward or a black-hole that absorbs human agency.
4. Comparative Synthesis: Themes Across the Century
The Machine as Ontological Baseline
Across all three movements, the machine serves not merely as a tool but as a fundamental framework for understanding human existence. The Gilbreths treated the human body as a component of a larger mechanical system. The Vorticists rendered human figures as geometric, machine-like forms on canvas. Today's AI researchers model human cognition as data pipelines and neural "circuits." Each epoch finds its own way to subordinate organic complexity to mechanical logic.
Fragmentation and Reassembly
The pattern of breaking down complex wholes into analyzable parts, then reconstituting them in optimized form, appears consistently across all three contexts. The Gilbreths isolated "therbligs" from continuous motion. Vorticist artists broke visual reality into planes and reassembled them into the vortex. Modern AI systems tokenize text, distribute cognitive tasks across micro-work platforms, and build modular model components. The underlying faith remains the same: that decomposition reveals the essence of things and enables their improvement.
Design as Control
Each movement produced its own form of prescriptive documentation. The Gilbreths created process charts, standardized tools, and ergonomic workstation designs. The Vorticists issued manifestos prescribing aesthetic order and "hard edges." AI practitioners develop model cards, governance frameworks, and prompt engineering guides. All represent attempts to codify and control complex systems through explicit design principles.
Faith in Progress Tempered by Anxiety
The Gilbreths promised that efficiency would bring both productivity and worker welfare, yet their methods also enabled dehumanization. The Vorticists celebrated speed and mechanical energy while hinting at alienation in their fractured compositions. AI promises cognitive augmentation while raising concerns about surveillance and the erosion of human expertise. Each technological moment carries this dual character: the hope of liberation alongside the fear of submission.
The Shifting Cultural Milieu
The Gilbreths operated within a milieu shaped by Taylorism, psychophysics, mass media, and rapid urbanization. The Vorticists emerged amid Futurism, Cubism, Constructivism, and the upheaval of the First World War. Today's AI revolution unfolds against the backdrop of big data, ubiquitous connectivity, platform capitalism, and post-pandemic remote work. Though the specific historical conditions differ, the structural logic linking these moments remains remarkably stable. What changes is the material substrate—bricks, paint, or bits—and the scale of impact—factory floors, galleries, or global digital ecosystems.
5. The "New Vortex": AI as Contemporary Avant-Garde
Just as Vorticism attempted to visualize the invisible forces of industrial modernity, AI functions as a conceptual vortex that reshapes how we see knowledge. The latent space of a language model can be visualized as a high-dimensional field of probabilities, a kind of abstract energy landscape. Artists and designers now employ AI to generate images (e.g., DALL-E, Midjourney) that echo Vorticist aesthetics: sharp, kinetic, synthetic. The algorithmic brushstroke replaces the painter's line, yet the visual language still speaks of speed, fragmentation, and mechanized beauty.
Moreover, the cultural discourse around AI mirrors the manifestos of early avant-garde movements. Papers such as "The Ethics of Artificial Intelligence" (Bostrom & Yudkowsky, 2014) and corporate statements like Google's AI Principles (2018) function as modern manifestos, setting out a vision of a rational, humane future while warning against the dark vortex of misuse.
6. Implications for the Future of Work and Culture
Re-thinking Efficiency
The Gilbreths taught that efficiency is not merely speed, but the minimization of wasteful motion. In the AI era, efficiency must be re-conceptualized as cognitive economy: reducing unnecessary mental load, automating routine reasoning, and presenting information in ways that align with human attention patterns. However, a purely quantitative approach—optimizing click-through rates or model loss functions—runs the risk of reducing the richness of human judgment, just as early Taylorism reduced workers to data points.
Agency and the "Human-Machine" Hybrid
Both Vorticism and the Gilbreths celebrated the integration of human and machine, yet they also highlighted a tension: the loss of the organic in favor of the mechanical. Today, human-AI collaboration (often called "centaur" models) seeks a synthesis where humans guide, correct, and imbue AI with values, while AI handles scale and pattern detection. The artistic "vortex" becomes a collaborative vortex—a shared space where the algorithm's output is a raw material that the human refines.
Ethical Governance as Modern Manifesto
Just as Vorticist manifestos set out a normative framework for artistic production, AI governance documents aim to define norms for algorithmic behavior. The challenge is to avoid the pitfalls of technocratic paternalism—the belief that a small elite can dictate the shape of society through scientific design, a stance implicit in early scientific management. Democratic participation, interdisciplinary oversight, and transparent "process charts" (e.g., model interpretability dashboards) can help ensure that the AI vortex does not become a black-hole of control.
Conclusion
The Gilbreths and the Vorticists were, in their own ways, architects of the modern machine age. The former turned the human body into a calibrated component of industrial systems, while the latter rendered human experience as a kinetic, geometric abstraction. Both operated within a cultural environment that prized measurement, fragmentation, and the belief that design could impose order on a rapidly changing world.
A century later, artificial intelligence stands at a comparable crossroads. The same grammar of fragmentation, reassembly, and control underlies the transformation of knowledge work. Motion-study films have been supplanted by digital telemetry; therbligs have given way to token embeddings; Vorticist canvases now coexist with AI-generated visualizations of latent spaces.
Yet, as history shows, each wave of technological rationalization brings both liberation and alienation. The Gilbreths' optimism about a more humane workplace was later tempered by concerns over mechanistic dehumanization; Vorticism's celebration of the machine was later haunted by the specter of war and totalitarian control. In the AI epoch, we must likewise balance the promise of cognitive augmentation with vigilance against algorithmic opacity, bias, and the erosion of skilled judgment.
The lesson from the early twentieth century is not that the machine should be rejected, but that human agency must remain the central design parameter. If we can learn to treat AI not as a new "vortex" that swallows us, but as a collaborative partner that can be shaped through transparent, ethically grounded processes, we may fulfill the Gilbreths' original hope—more efficient work without sacrificing humanity—and realize a Vorticist vision of a world where form, function, and freedom converge in the bright, kinetic heart of the modern age.
Modern AI assistants like ChatGPT have fundamentally changed user expectations around conversational interfaces. Users now expect to have coherent, multi-turn conversations where the AI remembers what was said earlier in the discussion. However, when building AI-powered bots on top of messaging platforms like Signal, Telegram, or SMS, developers face a fundamental architectural challenge: these platforms are inherently stateless. Each message arrives as an independent event with no built-in mechanism for maintaining conversational context.
This paper examines a production implementation that bridges this gap, enabling persistent multi-turn AI conversations over Signal's stateless messaging protocol. We explore the database schema design, the command parsing architecture, and a novel inline image reference system that allows users to incorporate visual context into ongoing conversations.
1. Introduction
1.1 The Statefulness Problem
Large Language Models (LLMs) like GPT-4 and GPT-5 are stateless by design. Each API call is independent—the model has no memory of previous interactions unless the developer explicitly includes conversation history in each request. Services like ChatGPT create the illusion of memory by maintaining conversation state server-side and replaying the full message history with each new user input.
When building a bot on a messaging platform, developers must solve this same problem, but with additional constraints:
Message Independence: Each incoming message from Signal (or similar platforms) arrives as a discrete event with no connection to previous messages.
Multi-User Environments: In group chats, multiple users may be conducting separate conversations with the bot simultaneously.
Asynchronous Delivery: Messages may arrive out of order or with significant delays.
Platform Limitations: Most messaging APIs provide no native support for threading or conversation tracking.
Resource Constraints: Storing complete conversation histories for every interaction can become expensive, both in terms of storage and API costs (since longer histories mean more tokens per request).
1.2 Design Goals
Our implementation targets the following objectives:
Conversation Continuity: Users should be able to continue previous conversations by referencing a conversation ID.
New Conversation Simplicity: Starting a fresh conversation should require no special syntax—just send a message.
Multi-Modal Support: Users should be able to reference images stored in the system within their conversational context.
Cost Transparency: Each response should report the API cost and attribute it correctly for multi-user billing scenarios.
Thread Safety: The system must handle concurrent conversations from multiple users without data corruption.
2. Database Schema Design
2.1 Conversation Tables
The persistence layer uses SQLite with a straightforward two-table design:
The gpt_conversations table serves as a lightweight header, storing only the conversation ID and creation timestamp. The actual message content lives in gpt_conversation_messages, which maintains the full history of each conversation.
2.2 Schema Rationale
Several design decisions merit explanation:
Minimal Conversation Metadata: The gpt_conversations table intentionally stores minimal information. We considered adding fields like user_id, title, or summary, but found these complicated the implementation without providing sufficient value. The conversation ID alone is enough to retrieve and continue any conversation.
Text Storage for Timestamps: Rather than using SQLite's native datetime types, we store ISO 8601 formatted strings. This provides timezone awareness (critical for a system serving users across time zones) and human readability when debugging.
Content as Plain Text: The content field stores the raw message text, not a structured format. This keeps the schema simple and avoids premature optimization. When multi-modal content (like inline images) is needed, we resolve references at query time rather than storing binary data in the conversation history.
Foreign Key Constraints: The foreign key relationship between messages and conversations ensures referential integrity and enables cascading deletes if conversation cleanup is needed.
3. Conversation Management API
3.1 Core Operations
The database abstraction layer exposes three primary operations:
defcreate_gpt_conversation(first_message:GPTMessage)->int:"""Create a new conversation and return its ID."""withget_db_connection()asconn:cur=conn.cursor()cur.execute("INSERT INTO gpt_conversations (created_at) VALUES (?)",(pendulum.now("America/Chicago").isoformat(),),)new_id=cur.lastrowidconn.commit()add_message_to_conversation(new_id,first_message)returnnew_id
The create_gpt_conversation function atomically creates both the conversation record and its first message. This ensures that no conversation exists without at least one message, maintaining data consistency.
defadd_message_to_conversation(conversation_id:int,message:GPTMessage):"""Append a message to an existing conversation."""withget_db_connection()asconn:cur=conn.cursor()cur.execute("""INSERT INTO gpt_conversation_messages (conversation_id, created_at, role, content) VALUES (?, ?, ?, ?)""",(conversation_id,pendulum.now().isoformat(),message.role,message.content),)conn.commit()
defget_messages_for_conversation(conversation_id:int)->List[GPTMessage]:"""Retrieve all messages in chronological order."""withget_db_connection()asconn:cur=conn.cursor()cur.execute("""SELECT created_at, role, content FROM gpt_conversation_messages WHERE conversation_id = ? ORDER BY created_at ASC""",(conversation_id,),)rows=cur.fetchall()return[GPTMessage(role=row[1],content=row[2])forrowinrows]
3.2 The GPTMessage Data Class
Messages are represented using a simple data class that mirrors the OpenAI API's message format:
@dataclassclassGPTMessage:role:str# "user", "assistant", or "system"content:str# The message text (or structured content for multi-modal)
This alignment with the OpenAI API structure means messages can be retrieved from the database and passed directly to the API without transformation, reducing complexity and potential for bugs.
4. Command Parsing and Conversation Flow
4.1 Command Syntax
The bot supports an optional conversation ID in its command syntax:
This is implemented via a regex pattern that makes the conversation ID optional:
def_process_gpt_command(text:str,command:str,model:GPTModel)->bool:pat=rf"^{command} (\d+ )?\s?(.*)"m=re.search(pat,text,flags=re.IGNORECASE|re.DOTALL)ifnotm:returnFalseconversation_id=m.groups()[0]# None if not providedprompt=m.groups()[1]
4.2 Conversation Branching Logic
The command handler implements distinct paths for new versus continued conversations:
ifconversation_id:# Continue existing conversationsignal_archive_db.add_message_to_conversation(conversation_id,GPTMessage(role="user",content=prompt))messages=signal_archive_db.get_messages_for_conversation(conversation_id)conv_id=conversation_idelse:# Start new conversationfirst_message=GPTMessage(role="user",content=prompt)conv_id=signal_archive_db.create_gpt_conversation(first_message)messages=signal_archive_db.get_messages_for_conversation(conv_id)
For continued conversations, we first persist the new user message, then retrieve the complete history. For new conversations, we create the conversation record (which automatically adds the first message), then retrieve it back. This ensures consistency—what we send to the API exactly matches what's stored in the database.
4.3 Response Handling and Storage
After receiving the AI's response, we store it as an assistant message:
gpt_response=gpt_api.gpt_completion(api_messages,model=model)response_text=gpt_response.get("text","Error: No text in response")bot_message=GPTMessage(role="assistant",content=response_text)signal_archive_db.add_message_to_conversation(conv_id,bot_message)send_message(f"[conversation {conv_id}] {response_text}\n"f"cost: \${cost:.4f}, payer: {payer}")
The response always includes the conversation ID, making it easy for users to continue the conversation later. Including cost and payer information provides transparency in multi-user environments where API expenses are shared or attributed.
Signal allows sending images as attachments, but these are ephemeral—they arrive with the message and aren't easily referenced later. For AI conversations, users often want to ask follow-up questions about an image discussed earlier, or reference images from the bot's archive in new conversations.
5.2 The imageid= Syntax
We implemented a lightweight markup syntax that lets users embed image references in their prompts:
gpt imageid=123 What's happening in this image?
gpt 42 imageid=123 imageid=456 Compare these two images
The syntax is intentionally simple—imageid= followed by a numeric ID. Multiple images can be included in a single prompt.
5.3 Implementation
Image references are resolved at request time through a two-stage process:
IMAGE_ID_REGEX=re.compile(r"imageid=(\d+)",re.IGNORECASE)def_build_inline_image_content(prompt:str)->tuple[list|str,list[int]]:"""Convert imageid= references to OpenAI API image payloads."""image_ids=IMAGE_ID_REGEX.findall(prompt)ifnotimage_ids:returnprompt,[]contents:list[dict]=[]cleaned_prompt=IMAGE_ID_REGEX.sub("",prompt).strip()contents.append({"type":"text","text":cleaned_prompt})embedded_ids:list[int]=[]forraw_idinimage_ids:image_id=int(raw_id)image_result=image_manager.get_image_by_id(image_id)ifnotimage_result:raiseValueError(f"Image ID {image_id} not found")_,image_path=image_resultimage_bytes=image_manager.read_image_bytes(image_path)image_b64=base64.b64encode(image_bytes).decode("utf-8")contents.append({"type":"image_url","image_url":{"url":f"data:image/jpeg;base64,{image_b64}"}})embedded_ids.append(image_id)returncontents,embedded_ids
The function extracts image IDs from the prompt, removes the imageid= markers from the text, loads each referenced image from disk, base64-encodes it, and constructs the multi-modal content structure expected by the OpenAI API.
5.4 Applying to Full Conversations
Since conversations may span multiple messages with image references, we apply this transformation to the entire message history:
def_prepare_messages_with_inline_images(messages:list[GPTMessage],)->tuple[list[GPTMessage],list[int]]:"""Transform all messages, resolving image references."""prepared:list[GPTMessage]=[]referenced_image_ids:list[int]=[]formessageinmessages:content=message.contentifmessage.role=="user"andisinstance(content,str):content,ids=_build_inline_image_content(content)referenced_image_ids.extend(ids)prepared.append(GPTMessage(role=message.role,content=content))returnprepared,referenced_image_ids
This approach means the database stores the original imageid= references as plain text, while the actual image data is resolved fresh for each API call. This has several advantages:
Storage Efficiency: We don't duplicate image data in conversation history.
Image Updates: If an image is re-processed or corrected, subsequent conversation continuations automatically use the updated version.
Auditability: The stored conversation clearly shows which images were referenced.
6. Concurrency and Thread Safety
6.1 Threading Model
Each command runs in its own daemon thread to avoid blocking the main message processing loop:
The system tracks which user initiated each request for cost attribution. Since this context is stored in thread-local storage, we must capture it before spawning the worker thread and restore it inside the thread:
current_user_context=gpt_api.get_user_context()defmy_func():try:gpt_api.set_user_context(current_user_context)# ... API calls use this context for billing ...finally:gpt_api.clear_user_context()
6.3 Database Connection Safety
SQLite connections are managed via context managers, ensuring proper cleanup even if exceptions occur:
Each database operation acquires its own connection, avoiding issues with SQLite's threading limitations while maintaining data consistency.
7. Practical Considerations
7.1 Conversation Length and Token Limits
As conversations grow, they consume more tokens per API call. The current implementation sends the complete history with each request, which can become expensive for long conversations. Production deployments might consider:
Summarization: Periodically summarizing older messages to reduce token count.
Windowing: Only sending the N most recent messages.
Smart Truncation: Using the model to identify and retain the most relevant context.
7.2 Error Handling
The implementation includes robust error handling for common failure modes:
try:api_messages,embedded_images=_prepare_messages_with_inline_images(messages)exceptValueErrorase:logger.error(f"Failed to attach images for GPT request: {e}")send_message(str(e))return
Invalid image references fail fast with clear error messages rather than sending malformed requests to the API.
7.3 User Experience
The response format provides all information users need to continue conversations:
[conversation 42] Here's my analysis of the image...
cost: \$0.0234, payer: jon
Users can immediately reference conversation 42 in their next message to continue the discussion.
8. Conclusion
Building persistent conversational AI over stateless messaging platforms requires careful consideration of data modeling, state management, and user experience. Our implementation demonstrates that a relatively simple database schema combined with thoughtful command parsing can provide a seamless multi-turn conversation experience.
The inline image reference system shows how platform limitations can be overcome through creative syntax design, allowing users to build rich multi-modal conversations without the messaging platform's native support.
This architecture has proven robust in production, handling concurrent users, long-running conversations, and multi-modal content while maintaining data consistency and providing transparency into API costs. The patterns described here are applicable beyond Signal to any stateless messaging platform where persistent AI conversations are desired.
Building a Nightly AI Code Scanner with vLLM, ROCm, and JIRA Integration
I've been running a ballistics calculation engine — a Rust physics library with several components, like a Flask app wrapper with machine learning capabilities, bindings for a python library as well as a Ruby gem library. There are also Android and iOS apps, too. The codebase has grown to about 15,000 lines of Rust and another 10,000 lines of Python. At this scale, bugs hide in edge cases: division by zero, floating-point precision issues in transonic drag calculations, unwrap() panics on unexpected input.
What if I could run an AI code reviewer every night while I sleep? Not a cloud API with per-token billing that could run up a $500 bill scanning 50 files, but a local model running on my own hardware, grinding through the codebase and filing JIRA tickets for anything suspicious.
This is the story of building that system.
The Hardware: AMD Strix Halo on ROCm 7.0
I'm running this on a server with an AMD Radeon 8060S (Strix Halo APU) — specifically the gfx1151 architecture. This isn't a data center GPU. It's essentially an integrated GPU with 128GB of shared memory; configured to give 96GB to VRAM and the rest to system RAM. Not the 80GB of HBM3 you'd get on an H100, but enough to run a 32B parameter model comfortably.
The key insight: for batch processing where latency doesn't matter, you don't need bleeding-edge hardware. A nightly scan can take hours. I'm not serving production traffic; I'm analyzing code files one at a time with a 30-second cooldown between requests. The APU handles this fine.
The HSA_OVERRIDE_GFX_VERSION environment variable is critical. Without it, ROCm doesn't recognize the Strix Halo architecture. This is the kind of sharp edge you hit running ML on AMD consumer hardware.
Model Selection: Qwen2.5-Coder-7B-Instruct
I tested several models:
Model
Parameters
Context
Quality
Notes
DeepSeek-Coder-V2-Lite
16B
32k
Good
Requires flash_attn (ROCm issues)
Qwen3-Coder-30B
30B
32k
Excellent
Too slow on APU
Qwen2.5-Coder-7B-Instruct
7B
16k
Good
Sweet spot
TinyLlama-1.1B
1.1B
4k
Poor
Too small for code review
Qwen2.5-Coder-7B-Instruct hits the sweet spot. It understands Rust and Python well enough to spot real issues, runs fast enough to process 50 files per night, and doesn't require flash attention (which has ROCm compatibility issues on consumer hardware).
vLLM Setup
vLLM provides an OpenAI-compatible API server that makes integration trivial. Here's the startup command:
The --max-model-len 16384 limits context to 16k tokens. My code files rarely exceed 500 lines (truncated), so this is plenty. The --gpu-memory-utilization 0.85 leaves headroom for the system.
I run this in a Python venv rather than Docker because ROCm device passthrough with Docker on Strix Halo is finicky. Sometimes you have to choose pragmatism over elegance.
Docker Configuration (When It Works)
For reference, here's the Docker Compose configuration I initially built. It works on dedicated AMD GPUs but has issues on integrated APUs:
The ipc: host and seccomp:unconfined are necessary for ROCm to function properly. The depends_on with service_healthy ensures the scanner waits for vLLM to be fully loaded before starting — important since model loading can take 2-3 minutes.
The confidence_threshold: 0.75 is crucial. Without it, the model reports every minor style issue. At 75%, it focuses on things it's genuinely concerned about.
The review_threshold: 5 triggers a different behavior: if the model finds more than 5 issues, it creates a single summary ticket for manual review rather than flooding JIRA with individual tickets. This is a safety valve for when the model goes haywire.
Structured Outputs with Pydantic
LLMs are great at finding issues but terrible at formatting output consistently. Left to their own devices, they'll return findings as markdown, prose, JSON with missing fields, or creative combinations thereof.
The solution is structured outputs. I define Pydantic models for exactly what I expect:
classSeverity(str,Enum):CRITICAL="critical"HIGH="high"MEDIUM="medium"LOW="low"INFO="info"classFindingType(str,Enum):BUG="bug"PERFORMANCE="performance"SECURITY="security"CODE_QUALITY="code_quality"POTENTIAL_ISSUE="potential_issue"classCodeFinding(BaseModel):file_path:str=Field(description="Path to the file")line_start:int=Field(description="Starting line number")line_end:Optional[int]=Field(default=None)finding_type:FindingTypeseverity:Severitytitle:str=Field(max_length=100)description:strsuggestion:Optional[str]=Noneconfidence:float=Field(ge=0.0,le=1.0)code_snippet:Optional[str]=None
The confidence field is a float between 0 and 1. The model learns to be honest about uncertainty — "I think this might be a bug (0.6)" versus "This is definitely division by zero (0.95)."
In a perfect world, I'd use vLLM's Outlines integration for guided JSON generation. In practice, I found that prompting Qwen for JSON and parsing the response works reliably:
def_analyze_code(self,file_path:str,content:str)->List[CodeFinding]:messages=[{"role":"system","content":self.system_prompt},{"role":"user","content":f"""Analyze this code for bugs and issues.File: {file_path}
{content}
ReturnaJSONarrayoffindings.Eachfindingmusthave:-file_path:string-line_start:number-finding_type:"bug"|"performance"|"security"|"code_quality"-severity:"critical"|"high"|"medium"|"low"|"info"-title:string(max100chars)-description:string-suggestion:stringornull-confidence:number0-1Ifnoissuesfound,returnanemptyarray:[]"""} ] response = self._call_llm(messages) # Parse JSON from response (handles markdown code blocks too) if response.strip().startswith('['): findings_data = json.loads(response) elif '```json' in response: json_str = response.split('```json')[1].split('```')[0] findings_data = json.loads(json_str) elif '[' in response: start = response.index('[') end = response.rindex(']') + 1 findings_data = json.loads(response[start:end]) else: return [] # Validate each finding with Pydantic findings = [] for item in findings_data: try: finding = CodeFinding(item) findings.append(finding) except ValidationError: pass # Skip malformed findings return findings
The System Prompt
The system prompt is where you teach the model what you care about. Here's mine:
You are an expert code reviewer specializing in Rust and Python.
Your job is to find bugs, performance issues, security vulnerabilities,
and code quality problems.
You are analyzing code from a ballistics calculation project that includes:
- A Rust physics engine for trajectory calculations
- Python Flask API with ML models
- PyO3 bindings between Rust and Python
Key areas to focus on:
1. Numerical precision issues (floating point errors, rounding)
2. Edge cases in physics calculations (division by zero, negative values)
3. Memory safety in Rust code
4. Error handling (silent failures, unwrap panics)
5. Performance bottlenecks (unnecessary allocations, redundant calculations)
6. Security issues (input validation, injection vulnerabilities)
Be conservative with findings - only report issues you are confident about.
Avoid false positives.
The phrase "Be conservative with findings" is doing heavy lifting. Without it, the model reports everything that looks slightly unusual. With it, it focuses on actual problems.
Timeout Handling
Large files (500+ lines) can take a while to analyze. My initial 120-second timeout caused failures on complex files. I bumped it to 600 seconds (10 minutes):
I also truncate files to 300 lines. For longer files, the model only sees the first 300 lines. This is a trade-off — I might miss bugs in the back half of long files — but it keeps scans predictable and prevents timeout cascades. I plan to revisit this in future iterations.
lines=content.split('\n')iflen(lines)>300:content='\n'.join(lines[:300])logger.info("Truncated to 300 lines for analysis")
JIRA Integration
When the scanner finds issues, it creates JIRA tickets automatically. The API is straightforward:
defcreate_jira_tickets(self,findings:List[CodeFinding]):jira_base_url=f"https://{jira_domain}/rest/api/3"forfindinginfindings:# Map severity to JIRA prioritypriority_map={Severity.CRITICAL:"Highest",Severity.HIGH:"High",Severity.MEDIUM:"Medium",Severity.LOW:"Low",Severity.INFO:"Lowest"}payload={"fields":{"project":{"key":"MBA"},"summary":f"[AI] {finding.title}","description":{"type":"doc","version":1,"content":[{"type":"paragraph","content":[{"type":"text","text":build_description(finding)}]}]},"issuetype":{"name":"Bug"iffinding.finding_type==FindingType.BUGelse"Task"},"priority":{"name":priority_map[finding.severity]},"labels":["ai-detected","code-scanner"]}}response=requests.post(f"{jira_base_url}/issue",json=payload,auth=(jira_email,jira_api_key),headers={"Content-Type":"application/json"})
The [AI] prefix in the summary makes it obvious these tickets came from the scanner. The ai-detected label allows filtering.
I add a 2-second delay between ticket creation to avoid rate limiting:
time.sleep(2)# Rate limit protection
Systemd Scheduling
The scanner runs nightly via systemd timer:
# /etc/systemd/system/code-scanner.timer[Unit]Description=Run Code Scanner nightly at 11pm[Timer]OnCalendar=*-*-* 23:00:00Persistent=trueRandomizedDelaySec=300[Install]WantedBy=timers.target
The RandomizedDelaySec=300 adds up to 5 minutes of random delay. This prevents the scanner from always starting at exactly 11:00:00, which helps if multiple services share the same schedule.
The service unit is a oneshot that runs the scanner script:
The TimeoutStartSec=25200 (7 hours) gives the scanner enough time to complete even if it scans every file.
Sample Findings
Here's what the scanner actually finds. From a recent run:
{"file_path":"/home/alex/projects/ballistics-engine/src/fast_trajectory.rs","line_start":115,"finding_type":"bug","severity":"high","title":"Division by zero in fast_integrate when velocity approaches zero","description":"The division dt / velocity_magnitude could result in division by zero if the projectile stalls (velocity_magnitude = 0). This can happen at the apex of a high-angle shot.","suggestion":"Add a check for velocity_magnitude < epsilon before division, or clamp to a minimum value.","confidence":0.85}
This is a real issue. In ballistics calculations, a projectile fired at a high angle momentarily has zero horizontal velocity at the apex. Without a guard, this causes a panic.
Not every finding is valid. The model occasionally flags intentional design decisions as "issues." But at a 75% confidence threshold, the false positive rate is manageable — maybe 1 in 10 findings needs to be closed as "not a bug."
Trade-offs and Lessons
What works well:
- Finding numerical edge cases (division by zero, overflow)
- Spotting unwrap() calls on Options that might be None
- Identifying missing error handling
- Flagging dead code and unreachable branches
What doesn't work as well:
- Understanding business logic (the model doesn't know physics)
- Spotting subtle race conditions in concurrent code
- False positives on intentional patterns
Operational lessons:
- Start with a low iteration limit (10-20 files) to test the pipeline
- Monitor the first few runs manually before trusting it
- Keep credentials in .env files excluded from rsync
- The 300-line truncation is aggressive; consider chunking for long files
Handling JSON Parse Failures
Despite asking for JSON, LLMs sometimes produce malformed output. I see two failure modes:
Truncated JSON: The model runs out of tokens mid-response, leaving an unterminated string or missing closing brackets.
Wrapped JSON: The model adds explanatory text around the JSON, like "Here are the findings:" before the array.
My parser handles both:
defparse_findings_response(response:str)->list:"""Extract JSON from potentially messy LLM output."""response=response.strip()# Best case: raw JSON arrayifresponse.startswith('['):try:returnjson.loads(response)exceptjson.JSONDecodeError:pass# Fall through to extraction# Common case: JSON in markdown code blockif'```json'inresponse:try:json_str=response.split('```json')[1].split('```')[0]returnjson.loads(json_str)except(IndexError,json.JSONDecodeError):pass# Fallback: extract JSON array from surrounding textif'['inresponseand']'inresponse:try:start=response.index('[')end=response.rindex(']')+1returnjson.loads(response[start:end])exceptjson.JSONDecodeError:pass# Give uplogger.warning("Could not extract JSON from response")return[]
When parsing fails, I log the error and skip that file rather than crashing the entire scan. In a typical 50-file run, I see 2-3 parse failures — annoying but acceptable.
Testing the Pipeline
Before trusting the scanner with JIRA ticket creation, I ran it in "dry run" mode:
# Set max iterations low and disable JIRAexportMAX_ITERATIONS=5# In config: jira.enabled: false
pythonrun_scanner_direct.py
This scans just 5 files and prints findings without creating tickets. I manually reviewed each finding:
True positive: Division by zero in trajectory calculation — good catch
False positive: Flagged intentional unwrap() on a guaranteed-Some Option — needs better context
True positive: Dead code path never executed — valid cleanup suggestion
Marginal: Style suggestion about variable naming — below my quality threshold
After tuning the confidence threshold and system prompt, the true positive rate improved to roughly 90%.
Monitoring and Observability
The scanner writes detailed logs to stdout and a JSON results file. Sample log output:
I keep the last 30 result files (configurable) for historical comparison. Eventually I'll build a dashboard showing finding trends over time.
What's Next
The current system is batch-oriented: run once per night, file tickets, done. Future improvements I'm considering:
Pre-commit integration: Run on changed files only, fast enough for CI
Retrieval-augmented context: Include related files when analyzing (e.g., when scanning a function, include its callers)
Learning from feedback: Track which tickets get closed as "not a bug" and use that to tune prompts
Multi-model ensemble: Run the same code through two models, only file tickets when both agree
For now, though, the simple approach works. Every morning I check JIRA, triage the overnight findings, and fix the real bugs. The model isn't perfect, but it finds things I miss. And unlike a human reviewer, it never gets tired, never skips files, and never has a bad day.
Get the Code
I've open-sourced the complete scanner implementation on GitHub: llm-code-scanner
The project includes:
Dual scanning modes: Fast nightly scans via vLLM and comprehensive weekly analyses through Ollama
Smart deduplication: SQLite database prevents redundant issue tracking across runs
JIRA integration: Automatically creates tickets for findings above your confidence threshold
Email reports: SendGrid integration for daily/weekly summaries
Multi-language support: Python, Rust, TypeScript, Kotlin, Swift, Go, and more
To get started, clone the repo, configure your scanner_config.yaml with your vLLM/Ollama server details, and run python -m agent.scanner. The README has full setup instructions including environment variables for JIRA and SendGrid integration.
When developing ballistics-engine, a high-performance ballistics calculation library written in Rust, I faced a challenge: how do I efficiently build and distribute binaries for multiple operating systems and architectures? The answer led to the creation of an automated build orchestration system that leverages diverse hardware—from single-board computers to powerful x86_64 servers—to build native binaries for macOS, Linux, FreeBSD, NetBSD, and OpenBSD across both ARM64 and x86_64 architectures. Now, you are probably wondering why I am bothering to show love for the BSD Trilogy; the answer is simple: because I want to. Sure they are a bit esoteric, but I ran FreeBSD for years as my mail server. I still like the BSDs.
This article explores the architecture, implementation, and lessons learned from building a production-grade multi-platform build system that powers https://ballistics.zip, where users can download pre-built binaries for their platform with a simple curl command.
curl --proto '=https' --tlsv1.2 -sSf https://ballistics.zip/install.sh | sh
The Problem: Cross-Platform Distribution
Rust's cross-compilation capabilities are impressive, but they have limitations:
Cross-compilation complexity: While Rust supports cross-compilation, getting it working reliably for BSD systems (especially with system dependencies) is challenging
Native testing: You need to test on actual hardware to ensure binaries work correctly
Binary compatibility: Different BSD versions and configurations require native builds
Performance verification: Emulated builds may behave differently than native ones
The solution? Build natively on each target platform using actual hardware or high-performance emulation.
Architecture Overview
The build orchestration system consists of three main components:
1. Build Nodes (Physical and Virtual Machines)
macOS systems (x86_64 and aarch64) - Local builds
Linux x86_64 server - Remote build via SSH
FreeBSD ARM64 - Single-board computer (Raspberry Pi 4)
OpenBSD ARM64 - QEMU VM emulated on x86_64 (rig.localnet)
NetBSD x86_64 and ARM64 - QEMU VMs
2. Orchestrator (Python-based coordinator)
Reads build node configuration from build-nodes.yaml
Executes builds in parallel across all nodes
Collects artifacts via SSH/SCP
Generates SHA256 checksums
Uploads to Google Cloud Storage
Updates version metadata
3. Distribution (ballistics.zip website)
Serves install script at https://ballistics.zip
Hosts binaries in GCS bucket (gs://ballistics-releases/)
Provides version detection and automatic downloads
Supports version fallback for platforms with delayed releases
Role: Linux builds, BSD VM host, emulated ARM64 builds
CPU: Intel i9
RAM: 96GB
IP: 10.1.1.27 (Linux host), 10.1.1.17 (KVM host)
VMs Hosted:
FreeBSD x86_64: 10.1.1.21
OpenBSD x86_64: 10.1.1.20
OpenBSD ARM64 (emulated): 10.1.1.23
NetBSD x86_64: 10.1.1.19
Local macOS Development Machine
Role: macOS binary builds (both architectures)
Build Method: Local cargo builds with target flags
Architectures:
aarch64-apple-darwin (Apple Silicon)
x86_64-apple-darwin (Intel Macs)
A Surprising Discovery: Emulated ARM64 Performance
One of the most interesting findings during development was discovering that emulated ARM64 builds on powerful x86_64 hardware are significantly faster than emulated ARM64 on native ARM64 builds on single-board computers.
Performance Comparison
Emulated ARM64 on ARM64: ~99+ minutes per build
Emulated ARM64 on x86_64: 15m 37s ⚡
The emulated build on rig.localnet (running QEMU with KVM acceleration) completed in about 6x less time than the native ARM64 hardware. This is because:
The x86_64 server has significantly more powerful CPU cores
QEMU with KVM provides near-native performance for many workloads
Rust compilation is primarily CPU-bound and benefits from faster single-core performance
The x86_64 server has faster storage (NVMe vs eMMC/SD card)
As a result, the native OpenBSD ARM64 node on the Orange Pi is now disabled in favor of the emulated version.
Prerequisites
SSH Key-Based Authentication
Critical: The orchestration system requires passwordless SSH access to all remote build nodes. Here's how to set it up:
Generate SSH key (if you don't have one):
ssh-keygen-ted25519-C"build-orchestrator"
Copy public key to each build node:
# For each build node
ssh-copy-iduser@build-node-ip
# Examples:
ssh-copy-idalex@10.1.1.27# Linux x86_64
ssh-copy-idfreebsd@10.1.1.7# FreeBSD ARM64
ssh-copy-idroot@10.1.1.20# OpenBSD x86_64
ssh-copy-idroot@10.1.1.23# OpenBSD ARM64 emulated
ssh-copy-idroot@10.1.1.19# NetBSD x86_64
ssh-copy-idroot@10.1.1.15# NetBSD ARM64
Test SSH access:
sshuser@build-node-ip"uname -a"
Software Requirements
On Build Orchestrator Machine:
Python 3.8+
pyyaml (pip install pyyaml)
Google Cloud SDK (gcloud command) for GCS uploads
SSH client
On Each Build Node:
Rust toolchain (cargo, rustc)
Build essentials (compiler, linker)
curl, wget, or ftp (for downloading source)
Sufficient disk space (~2GB for build artifacts)
BSD-Specific Requirements
NetBSD: Install curl via pkgsrc (native ftp doesn't support HTTPS)
The orchestrator.py script coordinates the entire build process:
Step 1: Parallel Build Execution
defbuild_on_node(node,version):ifnode['host']=='local':# Local buildsubprocess.run(build_command,shell=True,check=True)else:# Remote build via SSHssh_command=f"ssh {node['host']} '{build_command}'"subprocess.run(ssh_command,shell=True,check=True)
Step 2: Artifact Collection
defcollect_artifacts(node,version):binary_name=f"ballistics-{version}-{node['name']}"ifnode['host']=='local':shutil.copy(node['binary_path'],f"./{binary_name}")else:# Download via SCPscp_command=f"scp {node['host']}:{node['binary_path']} ./{binary_name}"subprocess.run(scp_command,shell=True,check=True)
The easiest way to add a new node is using the interactive script:
cdbuild-orchestrator
./add-node.sh
This will prompt you for:
- Node name (e.g., openbsd-aarch64-emulated)
- SSH host (e.g., root@10.1.1.23 or local)
- Rust target triple (e.g., aarch64-unknown-openbsd)
- Build commands (how to download and build)
- Binary location (where the compiled binary is located)
Manual Configuration
Alternatively, edit build-nodes.yaml directly:
-name:your-new-platformhost:user@ip-address# or 'local' for local buildstarget:rust-target-triplebuild_command:|# Commands to download source and buildcd /tmp && rm -rf ballistics-engine-{version}curl -L -o v{version}.tar.gz https://github.com/...tar xzf v{version}.tar.gzcd ballistics-engine-{version}cargo build --releasebinary_path:/path/to/compiled/binaryenabled:true
Variables:
- {version}: Replaced with target version (e.g., 0.13.4)
- {target}: Replaced with Rust target triple
sudocat>/etc/systemd/system/openbsd-arm64-emulated-vm.service<< 'EOF'[Unit]Description=OpenBSD ARM64 VM (Emulated on x86_64)After=network.target[Service]Type=simpleUser=alexWorkingDirectory=/opt/bsd-vms/openbsd-arm64-emulatedExecStart=/opt/bsd-vms/openbsd-arm64-emulated/boot.shRestart=alwaysRestartSec=10[Install]WantedBy=multi-user.targetEOF
sudosystemctlenableopenbsd-arm64-emulated-vm.service
sudosystemctlstartopenbsd-arm64-emulated-vm.service
Configure networking (assign static IP 10.1.1.23)
Install build tools inside VM:
sshroot@10.1.1.23
pkg_addrustgit
Test SSH access:
sshroot@10.1.1.23"cargo --version"
Add to build-nodes.yaml and test:
./build.sh--version0.13.3--dry-run
GitHub Webhook Integration (Optional)
For fully automated builds triggered by GitHub releases:
SBCs are viable: Raspberry Pi and Orange Pi work well for native builds, but slower
Parallel execution: Running all 7 builds in parallel takes only ~16 minutes (longest pole is FreeBSD ARM64)
Conclusion
Building a custom multi-platform build orchestration system may seem daunting, but the benefits are substantial:
→ Full control: Own your build infrastructure
→ Native builds: Real hardware ensures compatibility
→ Cost-effective: Low operational costs after initial hardware investment
→ Fast iteration: Parallel builds complete in ~16 minutes
→ Flexibility: Easy to add new platforms
→ Learning: Deep understanding of cross-platform development
The surprising discovery that emulated ARM64 on powerful x86_64 hardware outperforms native ARM64 single-board computers has practical implications: you don't always need native hardware for every architecture. Strategic use of emulation can provide better performance while maintaining compatibility.
For projects requiring broad platform support (especially BSD systems not well-served by traditional CI/CD), this approach offers a reliable, maintainable, and cost-effective solution.
The Rockchip RK3588 has emerged as one of the most compelling ARM System-on-Chips (SoCs) for edge AI applications in 2024-2025, featuring a dedicated 6 TOPS Neural Processing Unit (NPU) integrated alongside powerful Cortex-A76/A55 CPU cores. This SoC powers a growing ecosystem of single-board computers and system-on-modules from manufacturers worldwide, including Orange Pi, Radxa, FriendlyElec, Banana Pi, and numerous industrial board makers.
But how does the RK3588's NPU perform in real-world scenarios? In this comprehensive deep dive, I'll share detailed benchmarks of the RK3588 NPU testing both Large Language Models (LLMs) and computer vision workloads, with primary testing on the Orange Pi 5 Max and comparative analysis against the closely-related RK3576 found in the Banana Pi CM5-Pro.
The RK3588 Ecosystem: Devices and Availability
The Rockchip RK3588 powers a diverse range of single-board computers (SBCs) and system-on-modules (SoMs) from multiple manufacturers in 2024-2025:
Consumer SBCs:
Orange Pi 5 Max - Full-featured SBC with up to 16GB RAM, M.2 NVMe, WiFi 6
Radxa ROCK 5B/5B+ - Available with up to 32GB RAM, PCIe 3.0, 8K video output
Boardcon Idea3588 - Compute module with up to 16GB RAM and 256GB eMMC
Theobroma Systems TIGER/JAGUAR - High-reliability modules for robotics and industrial automation
Recent Developments:
RK3588S2 (2024-2025) - Updated variant with modernized memory controllers and platform I/O while maintaining the same 6 TOPS NPU performance
The RK3576, found in devices like the Banana Pi CM5-Pro, shares the same 6 TOPS NPU architecture as the RK3588 but features different CPU cores (Cortex-A72/A53 vs. A76/A55), making it an interesting comparison point for NPU-focused workloads.
Hardware Overview
RK3588 SoC Specifications
Built on an 8nm process, the Rockchip RK3588 integrates:
CPU:
4x ARM Cortex-A76 @ 2.4 GHz (high-performance cores)
LLM tokens ≈ 0.75 words on average (1.33 tokens per word)
10-15 tokens/sec = ~7.5-11.25 words/second
Performance Analysis:
✅ 2-4x faster than reading aloud (2.5-2.7 words/sec)
✅ 2-3x faster than comfortable silent reading (3.3-5 words/sec)
✅ Comparable to speed reading (6.7-11.7 words/sec)
Verdict: The RK3588 NPU running TinyLlama 1.1B generates text significantly faster than most humans can comfortably read, making it well-suited for real-time conversational AI, chatbots, and interactive applications at the edge.
This is particularly impressive for a $180 device consuming only 5-6W of power. Users won't be waiting for the AI to "catch up" - instead, the limiting factor is human reading speed, not the NPU's generation capability.
Output Quality Verification
To verify the model produces meaningful, coherent responses, I tested it with several prompts:
Test 1: Factual Question
Prompt: "What is the capital of France?"
Response: "The capital of France is Paris."
✅ Result: Correct and concise answer.
Test 2: Simple Math
Prompt: "What is 2 plus 2?"
Response: "2 + 2 = 4"
✅ Result: Correct mathematical calculation.
Test 3: List Generation
Prompt:"List 3 colors: red,"Response:"Here are three different color options for your text:1. Red2. Orange3.Yellow"
✅ Result: Logical completion with proper formatting.
Observations:
Responses are coherent and grammatically correct
Factual accuracy is maintained after W8A8 quantization
The model understands context and provides relevant answers
Text generation is fluent and natural
No obvious degradation from quantization
Note: The interactive demo tends to continue generating after the initial response, sometimes repeating patterns. This appears to be a demo interface issue rather than a model quality problem - the initial responses to each prompt are consistently accurate and useful.
LLM Findings
Strengths:
Fast model conversion (~1.3 minutes for 1.1B model)
Text generation faster than human reading speed (7.5-11.25 words/sec)
All 3 NPU cores actively utilized
No noticeable quality degradation from W8A8 quantization
Limitations:
RK3588 only supports W8A8 quantization (no W4A16 for better compression)
1.14 GB model size may be limiting for memory-constrained deployments
Max context length: 2048 tokens
RK3588 vs RK3576: NPU Performance Comparison
The RK3576, found in the Banana Pi CM5-Pro, shares the same 6 TOPS NPU architecture as the RK3588 but differs in CPU configuration (Cortex-A72/A53 vs. A76/A55). This provides an interesting comparison for understanding NPU-specific performance versus overall platform capabilities.
W4A16 models are smaller (645MB vs 1.14GB for TinyLlama) but may run slower on some models
The NPU architecture is fundamentally the same (6 TOPS, 3 cores), but software stack differences affect performance
For 0.5B models, RK3588 shows ~20% better performance
Larger models benefit from W4A16's memory efficiency on RK3576
Computer Vision Performance:
Both RK3588 and RK3576 share the same NPU architecture for computer vision workloads:
MobileNet V1 on RK3576 (Banana Pi CM5-Pro): ~161.8ms per image (~6.2 FPS)
ResNet18 on RK3588 (Orange Pi 5 Max): 4.09ms per image (244 FPS)
The dramatic performance difference here is primarily due to model complexity (ResNet18 is better optimized for NPU execution than older MobileNet V1) rather than NPU hardware differences.
Practical Implications:
For NPU-focused workloads, both the RK3588 and RK3576 deliver similar AI acceleration capabilities. The choice between platforms should be based on:
CPU performance needs: RK3588's A76 cores are significantly faster
Quantization requirements: RK3576 offers W4A16 for LLMs, RK3588 only W8A8
Model size constraints: W4A16 (RK3576) produces smaller models
Cost considerations: RK3576 platforms (like CM5-Pro at $103) vs RK3588 platforms ($150-180)
PyTorch/TensorFlow models cannot execute directly on the NPU. They must be converted through an AOT (Ahead-of-Time) compilation process. However, this conversion is fast and straightforward.
Conversion Performance (x86_64)
Converting PyTorch ResNet18 to RKNN format:
Phase
Time
Size
Details
PyTorch → ONNX
0.25s
44.6 MB
Fixed batch size, opset 11
ONNX → RKNN
1.11s
-
INT8 quantization, operator fusion
Export
0.00s
11.4 MB
Final .rknn file
Total
1.37s
11.4 MB
25.7% of ONNX size
Model Optimizations:
INT8 quantization (weights and activations)
Automatic operator fusion
Layout optimization for NPU
Target: 3 NPU cores on RK3588
Memory Usage:
Internal memory: 1.1 MB
Weight memory: 11.5 MB
Total model size: 11.4 MB
NPU Inference Performance
Running ResNet18 inference on Orange Pi 5 Max (10 iterations after 2 warmup runs):
Results:
Average Inference Time: 4.09 ms
Min Inference Time: 4.02 ms
Max Inference Time: 4.43 ms
Standard Deviation: ±0.11 ms
Throughput: 244.36 FPS
Initialization Overhead:
NPU initialization: 0.350s (one-time)
Model load: 0.008s (one-time)
Input/Output:
Input: 224×224×3 images (INT8)
Output: 1000 classes (Float32)
Performance Comparison
Platform
Inference Time
Throughput
Notes
RK3588 NPU
4.09 ms
244 FPS
3 NPU cores, INT8
ARM A76 CPU (est.)
~50 ms
~20 FPS
Single core
Desktop RTX 3080
~2-3 ms
~400 FPS
Reference
NPU Speedup
12x faster than CPU
-
Same hardware
Computer Vision Findings
Strengths:
Extremely fast conversion (<2 seconds)
Excellent inference performance (4.09ms, 244 FPS)
Very consistent latency (±0.11ms)
Efficient quantization (74% size reduction)
12x speedup vs CPU cores on same SoC
Simple Python API for inference
Trade-offs:
INT8 quantization may reduce accuracy slightly
AOT conversion required (no dynamic model execution)
Fixed input shapes required
Technical Deep Dive
NPU Architecture
The RK3588 NPU is based on a 3-core design with 6 TOPS total performance:
The RK3588 NPU on the Orange Pi 5 Max delivers impressive performance for edge AI applications. With 244 FPS for ResNet18 (4.09ms latency) and 10-15 tokens/second for 1.1B LLMs, it's well-positioned for real-time computer vision and small language model inference.
✅ Good LLM support: 1B-class models run at usable speeds
✅ Outstanding value: $180 for 6 TOPS of NPU performance
✅ Easy to use: Simple Python API, automatic NPU detection
✅ Power efficient: ~5-6W under AI load, 39x better than desktop GPU
✅ PyTorch compatible: Via conversion workflow
⚠️ Conversion required: Cannot run PyTorch/TensorFlow directly
⚠️ Quantization needed: INT8 for best performance
⚠️ Memory constrained: Large models (>2GB) challenging
The RK3588 NPU is an excellent choice for edge AI applications where power efficiency and cost matter. It's not going to replace high-end GPUs for training or large-scale inference, but for deploying computer vision models and small LLMs at the edge, it's one of the best options available today.
The Banana Pi CM5-Pro (also sold as the ArmSoM-CM5) represents Banana Pi's entry into the Raspberry Pi Compute Module 4 form factor market, powered by Rockchip's RK3576 SoC. Released in 2024, this compute module targets developers seeking a CM4-compatible solution with enhanced specifications: up to 16GB of RAM, 128GB of storage, WiFi 6 connectivity, and a 6 TOPS Neural Processing Unit for AI acceleration. With a price point of approximately $103 for the 8GB/64GB configuration and a guaranteed production life until at least August 2034, Banana Pi positions the CM5-Pro as a long-term alternative to Raspberry Pi's official offerings.
After extensive testing, benchmarking, and comparison against contemporary single-board computers including the Orange Pi 5 Max, Raspberry Pi 5, and LattePanda IOTA, the Banana Pi CM5-Pro emerges as a competent but not exceptional offering. It delivers solid performance, useful features including AI acceleration, and good expandability, but falls short of being a clear winner in any specific category. This review examines where the CM5-Pro excels, where it disappoints, and who should consider it for their projects.
Banana Pi CM5-Pro showing the dual 100-pin connectors and CM4-compatible form factor
Hardware Architecture: The Rockchip RK3576
At the heart of the Banana Pi CM5-Pro lies the Rockchip RK3576, a second-generation 8nm SoC featuring a big.LITTLE ARM architecture:
4x ARM Cortex-A72 cores @ 2.2 GHz (high performance)
4x ARM Cortex-A53 cores @ 1.8 GHz (power efficiency)
The Cortex-A72, originally released by ARM in 2015, represents a significant step up from the ancient Cortex-A53 (2012) but still trails the more modern Cortex-A76 (2018) found in Raspberry Pi 5 and Orange Pi 5 Max. The A72 offers approximately 1.8-2x the performance per clock compared to the A53, with better branch prediction, wider execution units, and more sophisticated memory prefetching. However, it lacks the A76's more advanced microarchitecture improvements and typically runs at lower clock speeds (2.2 GHz vs. 2.4 GHz for the A76 in the Pi 5).
The inclusion of four Cortex-A53 efficiency cores alongside the A72 performance cores gives the RK3576 a total of eight cores, allowing it to balance power consumption and performance. In practice, this means the system can handle background tasks and light workloads on the A53 cores while reserving the A72 cores for demanding applications. The big.LITTLE scheduler in the Linux kernel attempts to make intelligent decisions about which cores to use for which tasks, though the effectiveness varies depending on workload characteristics.
M.2 NVMe SSD support (our unit had a 932GB NVMe drive installed)
WiFi 6 (802.11ax) and Bluetooth 5.3
Gigabit Ethernet
HDMI 2.0 output supporting 4K@60fps
Multiple MIPI CSI camera interfaces
USB 3.0 and USB 2.0 interfaces via the 100-pin connectors
The LPDDR5 memory is a notable upgrade over the LPDDR4 found in many competing boards, offering higher bandwidth and better power efficiency. In our testing, memory bandwidth didn't appear to be a significant bottleneck for CPU-bound workloads, though applications that heavily stress memory subsystems (large dataset processing, video encoding, etc.) may benefit from the faster RAM.
The inclusion of both eMMC storage and M.2 NVMe support provides excellent flexibility. The eMMC serves as a reliable boot medium with consistent performance, while the NVMe slot allows for high-capacity, high-speed storage expansion. This dual-storage approach is superior to SD card-only solutions, which suffer from reliability issues and inconsistent performance.
WiFi 6 and Bluetooth 5.3 represent current-generation wireless standards, providing better performance and lower latency than the WiFi 5 found in older boards. For robotics applications, low-latency wireless communication can be crucial for remote control and telemetry, making this a meaningful upgrade.
The NPU: 6 TOPS of AI Potential
The RK3576's integrated 6 TOPS Neural Processing Unit is the CM5-Pro's headline AI feature, designed to accelerate machine learning inference workloads. The NPU supports multiple quantization formats (INT4/INT8/INT16/BF16/TF32) and can interface with mainstream frameworks including TensorFlow, PyTorch, MXNet, and Caffe through Rockchip's RKNN toolkit.
In our testing, we confirmed the presence of the NPU hardware at /sys/kernel/iommu_groups/0/devices/27700000.npu and verified that the RKNN runtime library (librknnrt.so) and server (rknn_server) were installed and accessible. To validate real-world NPU performance, we ran MobileNet V1 image classification inference tests using the pre-installed RKNN model.
NPU Inference Benchmarks - MobileNet V1:
Running 10 inference iterations on a 224x224 RGB image (bell.jpg), we measured consistent performance:
Average inference time: 161.8ms per image
Min/Max: 146ms to 172ms
Standard deviation: ~7.2ms
Throughput: ~6.2 frames per second
The model successfully classified test images with appropriate confidence scores across 1,001 ImageNet classes. The inference pipeline includes:
JPEG decoding and preprocessing
Image resizing and color space conversion
INT8 quantized inference on the NPU
FP16 output tensor postprocessing
This demonstrates that the NPU is fully functional and provides practical acceleration for computer vision workloads. The ~160ms inference time for MobileNet V1 is reasonable for edge AI applications, though more demanding models like YOLOv8 or larger classification networks would benefit from the full 6 TOPS capacity.
Rockchip's RKNN toolkit provides a development workflow that converts trained models into RKNN format for efficient execution on the NPU. The process involves:
Training a model using a standard framework (TensorFlow, PyTorch, etc.)
Exporting the model to ONNX or framework-specific format
Converting the model using rknn-toolkit2 on a PC
Quantizing the model to INT8 or other supported formats
Deploying the RKNN model file to the board
Running inference using RKNN C/C++ or Python APIs
This workflow is more complex than simply running a PyTorch or TensorFlow model directly, but the trade-off is significantly improved inference performance and lower power consumption compared to CPU-only execution. For applications like real-time object detection, the 6 TOPS NPU can deliver:
Face recognition: 240fps @ 1080p
Object detection (YOLO-based models): 50fps @ 4K
Semantic segmentation: 30fps @ 2K
These performance figures represent substantial improvements over CPU-based inference, making the NPU genuinely useful for edge AI applications. However, they also require investment in learning the RKNN toolchain, optimizing models for the specific NPU architecture, and managing the conversion pipeline as part of your development workflow.
RKLLM and Large Language Model Support:
To thoroughly test LLM capabilities, we performed end-to-end testing: model conversion on an x86_64 platform (LattePanda IOTA), transfer to the CM5-Pro, and NPU inference validation. RKLLM (Rockchip Large Language Model) toolkit enables running quantized LLMs on the RK3576's 6 TOPS NPU, supporting models including Qwen, Llama, ChatGLM, Phi, Gemma, InternLM, MiniCPM, and others.
LLM Model Conversion Benchmark:
We converted TinyLLAMA 1.1B Chat from Hugging Face format to RKLLM format using an Intel N150-powered LattePanda IOTA:
Total Conversion Time: 264.83 seconds (4.41 minutes)
Output File Size: 644.75 MB (increased from 505 MB due to RKNN format overhead)
The cross-platform requirement is important: RKLLM-Toolkit is distributed as x86_64-only Python wheels, so model conversion must be performed on an x86 PC or VM, not on the ARM-based CM5-Pro itself. Conversion time scales with model size and CPU performance - larger models on slower CPUs will take proportionally longer.
NPU LLM Inference Testing:
After transferring the converted model to the CM5-Pro, we successfully:
✓ Loaded the TinyLLAMA 1.1B model (645 MB) into RKLLM runtime
✓ Initialized NPU with 2-core configuration for W4A16 inference
✓ Verified token generation and text output
✓ Confirmed the model runs on NPU cores (not CPU fallback)
The RKLLM runtime v1.2.2 correctly identified the model configuration (W4A16, max_context=2048, 2 NPU cores) and enabled the Cortex-A72 cores [4,5,6,7] for host processing while the NPU handled inference.
Actual RK3576 LLM Performance (Official Rockchip Benchmarks):
Based on Rockchip's published benchmarks for the RK3576, small language models perform as follows:
For context, the RK3588 (with more powerful NPU) achieves 42.58 tokens/second for Qwen2 0.5B - about 1.85x faster than the RK3576.
Practical Assessment:
The 30-35 tokens/second achieved with 0.5B models is usable for offline chatbots, text classification, and simple Q&A applications, but would feel noticeably slow compared to cloud LLM APIs or GPU-accelerated solutions. Humans typically read at 200-300 words per minute (~50 tokens/second), so 35 tokens/second is borderline for comfortable real-time conversation. Larger models (1.8B+) drop to 13 tokens/second or less, which feels sluggish for interactive use.
The complete workflow (download model → convert on x86 → transfer to ARM → run inference) works as designed but requires infrastructure: an x86 machine or VM for conversion, network transfer for large model files (645 MB), and familiarity with Python environments and RKLLM APIs. For embedded deployments, this is acceptable; for rapid prototyping, it adds friction compared to cloud-based LLM solutions.
Compared to Google's Coral TPU (4 TOPS), the RK3576's 6 TOPS provides 1.5x more computational power, though the Coral benefits from more mature tooling and broader community support. Against the Horizon X3's 5 TOPS, the RK3576 offers 20% more capability with far better CPU performance backing it up. For serious AI workloads, NVIDIA's Jetson platforms (40+ TOPS) remain in a different performance class, but at significantly higher price points and power requirements.
To assess the Banana Pi CM5-Pro's CPU performance, we ran our standard Rust compilation benchmark: building a complex ballistics simulation engine with numerous dependencies from a clean state, three times, and averaging the results. This real-world workload stresses CPU cores, memory bandwidth, compiler performance, and I/O subsystems.
Banana Pi CM5-Pro Compilation Times:
Run 1: 173.16 seconds (2 minutes 53 seconds)
Run 2: 162.29 seconds (2 minutes 42 seconds)
Run 3: 165.99 seconds (2 minutes 46 seconds)
Average: 167.15 seconds (2 minutes 47 seconds)
For context, here's how the CM5-Pro compares to other contemporary single-board computers:
System
CPU
Cores
Average Time
vs. CM5-Pro
Orange Pi 5 Max
Cortex-A55/A76
8 (4+4)
62.31s
2.68x faster
Raspberry Pi CM5
Cortex-A76
4
71.04s
2.35x faster
LattePanda IOTA
Intel N150
4
72.21s
2.31x faster
Raspberry Pi 5
Cortex-A76
4
76.65s
2.18x faster
Banana Pi CM5-Pro
Cortex-A53/A72
8 (4+4)
167.15s
1.00x (baseline)
The results reveal the CM5-Pro's positioning: it's significantly slower than top-tier ARM and x86 single-board computers, but respectable within its price and power class. The 2.68x performance deficit versus the Orange Pi 5 Max is substantial, explained by the RK3588's newer Cortex-A76 cores running at higher clock speeds (2.4 GHz) with more advanced microarchitecture.
More telling is the comparison to the Raspberry Pi 5 and Raspberry Pi CM5, both featuring four Cortex-A76 cores at 2.4 GHz. Despite having eight cores to the Pi's four, the CM5-Pro is approximately 2.2x slower. This performance gap illustrates the generational advantage of the A76 architecture - the Pi 5's four newer cores outperform the CM5-Pro's four A72 cores plus four A53 cores combined for this workload.
The LattePanda IOTA's Intel N150, despite having only four cores, also outperforms the CM5-Pro by 2.3x. Intel's Alder Lake-N architecture, even in its low-power form, delivers superior single-threaded performance and more effective multi-threading than the RK3576.
However, context matters. The CM5-Pro's 167-second compilation time is still quite usable for development workflows. A project that takes 77 seconds to compile on a Raspberry Pi 5 will take 167 seconds on the CM5-Pro - an additional 90 seconds. For most developers, this difference is noticeable but not crippling. Compile times remain in the "get a coffee" range rather than the "go to lunch" range.
More importantly, the CM5-Pro vastly outperforms older ARM platforms. Compared to boards using only Cortex-A53 cores (like the Horizon X3 CM at 379 seconds), the CM5-Pro is 2.27x faster, demonstrating the value of the Cortex-A72 performance cores.
Geekbench 6 CPU Performance
To provide standardized synthetic benchmarks, we ran Geekbench 6.5.0 on the Banana Pi CM5-Pro:
Geekbench 6 Scores:
Single-Core Score: 328
Multi-Core Score: 1337
These scores reflect the RK3576's positioning as a mid-range ARM platform. The single-core score of 328 indicates modest per-core performance from the Cortex-A72 cores, while the multi-core score of 1337 demonstrates reasonable scaling across all eight cores (4x A72 + 4x A53). For context, the Raspberry Pi 5 with Cortex-A76 cores typically scores around 550-600 single-core and 1700-1900 multi-core, showing the generational advantage of the newer ARM architecture.
Notable individual benchmark results include:
PDF Renderer: 542 single-core, 2904 multi-core
Ray Tracer: 2763 multi-core
Asset Compression: 2756 multi-core
Horizon Detection: 540 single-core
HTML5 Browser: 455 single-core
The relatively strong performance on PDF rendering and asset compression tasks suggests the RK3576 handles real-world productivity workloads reasonably well, though the lower single-core scores indicate that latency-sensitive interactive applications may feel less responsive than on platforms with faster per-core performance.
Full Geekbench results: https://browser.geekbench.com/v6/cpu/14853854
Comparative Analysis: CM5-Pro vs. the Competition
vs. Orange Pi 5 Max
The Orange Pi 5 Max represents the performance leader in our testing, powered by Rockchip's flagship RK3588 SoC with four Cortex-A76 + four Cortex-A55 cores. The 5 Max compiled our benchmark in 62.31 seconds - 2.68x faster than the CM5-Pro's 167.15 seconds.
Key differences:
Performance: The 5 Max's Cortex-A76 cores deliver substantially better single-threaded and multi-threaded performance. For CPU-intensive development work, the performance gap is significant.
NPU: The RK3588 includes a 6 TOPS NPU, matching the RK3576's AI capabilities. Both boards can run similar RKNN-optimized models with comparable inference performance.
Form Factor: The 5 Max is a full-sized single-board computer with on-board ports and connectors, while the CM5-Pro is a compute module requiring a carrier board. This makes the 5 Max more suitable for standalone projects and the CM5-Pro better for embedded integration.
Price: The Orange Pi 5 Max sells for approximately \$150-180 with 8GB RAM, compared to $103 for the CM5-Pro. The 5 Max's superior performance comes at a premium, but the cost-per-performance ratio remains competitive.
Memory: Both support up to 16GB RAM, though the 5 Max typically ships with higher-capacity configurations.
Verdict: If raw CPU performance is your priority and you can accommodate a full-sized SBC, the Orange Pi 5 Max is the clear choice. The CM5-Pro makes sense if you need the compute module form factor, want to minimize cost, or have thermal/power constraints that favor the slightly more efficient RK3576.
vs. Raspberry Pi 5
The Raspberry Pi 5, with its Broadcom BCM2712 SoC featuring four Cortex-A76 cores at 2.4 GHz, compiled our benchmark in 76.65 seconds - 2.18x faster than the CM5-Pro.
Key differences:
Performance: The Pi 5's four A76 cores outperform the CM5-Pro's 4+4 big.LITTLE configuration for most workloads. Single-threaded performance heavily favors the Pi 5, while multi-threaded performance depends on whether the workload can effectively utilize the CM5-Pro's additional A53 cores.
NPU: The Pi 5 lacks integrated AI acceleration, while the CM5-Pro includes a 6 TOPS NPU. For AI-heavy applications, this is a significant advantage for the CM5-Pro.
Ecosystem: The Raspberry Pi ecosystem is vastly more mature, with extensive documentation, massive community support, and guaranteed long-term software maintenance. While Banana Pi has committed to supporting the CM5-Pro until 2034, the Pi Foundation's track record inspires more confidence.
Software: Raspberry Pi OS is polished and actively maintained, with hardware-specific optimizations. The CM5-Pro runs generic ARM Linux distributions (Debian, Ubuntu) which work well but lack Pi-specific refinements.
Price: The Raspberry Pi 5 (8GB model) retails for \$80, significantly cheaper than the CM5-Pro's \$103. The Pi 5 offers better performance for less money - a compelling value proposition.
Expansion: The Pi 5's standard SBC form factor provides easier access to GPIO, HDMI, USB, and other interfaces. The CM5-Pro requires a carrier board, adding cost and complexity but enabling more customized designs.
Verdict: For general-purpose computing, development, and hobbyist projects, the Raspberry Pi 5 is the better choice: faster, cheaper, and better supported. The CM5-Pro makes sense if you specifically need AI acceleration, prefer the compute module form factor, or want more RAM/storage capacity than the Pi 5 offers.
vs. LattePanda IOTA
The LattePanda IOTA, powered by Intel's N150 Alder Lake-N processor with four cores, compiled our benchmark in 72.21 seconds - 2.31x faster than the CM5-Pro.
Key differences:
Architecture: The IOTA uses x86_64 architecture, providing compatibility with a wider range of software that may not be well-optimized for ARM. The CM5-Pro's ARM architecture benefits from lower power consumption and better mobile/embedded software support.
Performance: Intel's N150, despite having only four cores, delivers superior single-threaded performance and competitive multi-threaded performance against the CM5-Pro's eight cores. Intel's microarchitecture and higher sustained frequencies provide an edge for CPU-bound tasks.
NPU: The IOTA lacks dedicated AI acceleration, relying on CPU or external accelerators for machine learning workloads. The CM5-Pro's integrated 6 TOPS NPU is a clear advantage for AI applications.
Power Consumption: The N150 is a low-power x86 chip, but still consumes more power than ARM solutions under typical workloads. The CM5-Pro's big.LITTLE configuration can achieve better power efficiency for mixed workloads.
Form Factor: The IOTA is a small x86 board with Arduino co-processor integration, targeting maker/IoT applications. The CM5-Pro's compute module format serves different use cases, primarily embedded systems and custom carrier board designs.
Price: The LattePanda IOTA sells for approximately $149, more expensive than the CM5-Pro. However, it includes unique features like the Arduino co-processor and x86 compatibility that may justify the premium for specific applications.
Software Ecosystem: x86 enjoys broader commercial software support, while ARM excels in embedded and mobile-focused applications. Choose based on your software requirements.
Verdict: If you need x86 compatibility or want a compact standalone board with Arduino integration, the LattePanda IOTA makes sense despite its higher price. If you're working in ARM-native embedded Linux, need AI acceleration, or want the compute module form factor, the CM5-Pro is the better choice at a lower price point.
vs. Raspberry Pi CM5
The Raspberry Pi Compute Module 5 is the most direct competitor to the Banana Pi CM5-Pro, offering the same CM4-compatible form factor with different specifications. The Pi CM5 compiled our benchmark in 71.04 seconds - 2.35x faster than the CM5-Pro.
Key differences:
Performance: The Pi CM5's four Cortex-A76 cores at 2.4 GHz significantly outperform the CM5-Pro's 4x A72 + 4x A53 configuration. The architectural advantage of the A76 over the A72 translates to approximately 2.35x better performance in our testing.
NPU: The CM5-Pro's 6 TOPS NPU provides integrated AI acceleration, while the Pi CM5 requires external solutions (Hailo-8, Coral TPU) for hardware-accelerated inference. If AI is central to your application, the CM5-Pro's integrated NPU is more elegant.
Memory Options: The CM5-Pro supports up to 16GB LPDDR5, while the Pi CM5 offers up to 8GB LPDDR4X. For memory-intensive applications, the CM5-Pro's higher capacity could be decisive.
Storage: Both offer eMMC options, with the CM5-Pro available up to 128GB and the Pi CM5 up to 64GB. Both support additional storage via carrier board interfaces.
Price: The Raspberry Pi CM5 (8GB/32GB eMMC) sells for approximately $95, slightly cheaper than the CM5-Pro's $103. The CM5-Pro's extra features (more RAM/storage options, integrated NPU) justify the small price premium for those who need them.
Ecosystem: The Pi CM5 benefits from Raspberry Pi's ecosystem, tooling, and community. The CM5-Pro has decent support but can't match the Pi's extensive resources.
Carrier Boards: Both are CM4-compatible, meaning they can use the same carrier boards. However, some boards may not fully support CM5-Pro-specific features, and subtle electrical differences could cause issues in rare cases.
Verdict: For maximum CPU performance in the CM4 form factor, choose the Pi CM5. Its 2.35x performance advantage is significant for compute-intensive applications. Choose the CM5-Pro if you need integrated AI acceleration, more than 8GB of RAM, more than 64GB of eMMC storage, or prefer the better wireless connectivity (WiFi 6 vs. WiFi 5).
Use Cases and Recommendations
Based on our testing and analysis, here are scenarios where the Banana Pi CM5-Pro excels and where alternatives might be better:
Choose the Banana Pi CM5-Pro if you:
Need AI acceleration in a compute module: The integrated 6 TOPS NPU eliminates the need for external AI accelerators, simplifying hardware design and reducing BOM costs. For robotics, smart cameras, or IoT devices with AI workloads, this is a compelling advantage.
Require more than 8GB of RAM: The CM5-Pro supports up to 16GB LPDDR5, double the Pi CM5's maximum. If your application processes large datasets, runs multiple VMs, or needs extensive buffering, the extra RAM headroom matters.
Want high-capacity built-in storage: With up to 128GB eMMC options, the CM5-Pro can store large datasets, models, or applications without requiring external storage. This simplifies deployment and improves reliability compared to SD cards or network storage.
Prefer WiFi 6 and Bluetooth 5.3: Current-generation wireless standards provide better performance and lower latency than WiFi 5. For wireless robotics control or IoT applications with many connected devices, WiFi 6's improvements are meaningful.
Value long production lifetime: Banana Pi's commitment to produce the CM5-Pro until August 2034 provides assurance for commercial products with multi-year lifecycles. You can design around this module without fear of it being discontinued in 2-3 years.
Have thermal or power constraints: The RK3576's 8nm process and big.LITTLE architecture can deliver better power efficiency than always-on high-performance cores, extending battery life or reducing cooling requirements for fanless designs.
Choose alternatives if you:
Prioritize raw CPU performance: The Raspberry Pi 5, Pi CM5, Orange Pi 5 Max, and LattePanda IOTA all deliver significantly faster CPU performance. If your application is CPU-bound and doesn't benefit from the NPU, these platforms are better choices.
Want the simplest development experience: The Raspberry Pi ecosystem's polish, documentation, and community support make it the easiest platform for beginners and rapid prototyping. The Pi 5 or Pi CM5 will get you running faster with fewer obstacles.
Need maximum AI performance: NVIDIA Jetson platforms provide 40+ TOPS of AI performance with mature CUDA/TensorRT tooling. If AI is your primary workload, the investment in a Jetson module is worthwhile despite higher costs.
Require x86 compatibility: The LattePanda IOTA or other x86 platforms provide better software compatibility for commercial applications that depend on x86-specific libraries or software.
Work with standard SBC form factors: If you don't need a compute module and prefer the convenience of a full-sized SBC with onboard ports, the Orange Pi 5 Max or Raspberry Pi 5 are better choices.
The NPU in Practice: RKNN Toolkit and Ecosystem
While we didn't perform exhaustive AI benchmarking, our exploration of the RKNN ecosystem reveals both promise and challenges. The infrastructure exists: the NPU hardware is present and accessible, the runtime libraries are installed, and documentation is available from both Rockchip and Banana Pi. The RKNN toolkit can convert mainstream frameworks to NPU-optimized models, and community examples demonstrate YOLO11n object detection running successfully on the CM5-Pro.
However, the RKNN development experience is not as streamlined as more mature ecosystems. Converting and optimizing models requires learning Rockchip-specific tools and workflows. Debugging performance issues or accuracy degradation during quantization demands patience and experimentation. The documentation is improving but remains fragmented across Rockchip's official site, Banana Pi's docs, and community forums.
For developers already familiar with embedded AI deployment, the RKNN workflow will feel familiar - it follows similar patterns to TensorFlow Lite, ONNX Runtime, or other edge inference frameworks. For developers new to edge AI, the learning curve is steeper than cloud-based solutions but gentler than some alternatives (looking at you, Hailo's toolchain).
The 6 TOPS performance figure is real and achievable for properly optimized models. INT8 quantized YOLO models can indeed run at 50fps @ 4K, and simpler models scale accordingly. The NPU's support for INT4 and BF16 formats provides flexibility for trading off accuracy versus performance. For many robotics and IoT applications, the 6 TOPS NPU hits a sweet spot: enough performance for useful AI workloads, integrated into the SoC to minimize complexity and cost, and accessible through reasonable (if not perfect) tooling.
Build Quality and Physical Characteristics
The Banana Pi CM5-Pro adheres to the Raspberry Pi CM4 mechanical specification, featuring dual 100-pin high-density connectors arranged in the standard layout. Physical dimensions match the CM4, allowing drop-in replacement in compatible carrier boards. Our sample unit appeared well-manufactured with clean solder joints, proper component placement, and no obvious defects.
The module includes an on-board WiFi/Bluetooth antenna connector (U.FL/IPEX), power management IC, and all necessary supporting components. Unlike some compute modules that require extensive external components on the carrier board, the CM5-Pro is relatively self-contained, simplifying carrier board design.
Thermal performance is adequate but not exceptional. Under sustained load during our compilation benchmarks, the SoC reached temperatures requiring thermal management. For applications running continuous AI inference or heavy CPU workloads, active cooling (fan) or substantial passive cooling (heatsink and airflow) is recommended. The carrier board design should account for thermal dissipation, especially if the module will be enclosed in a case.
Software and Ecosystem
The CM5-Pro ships with Banana Pi's custom Debian-based Linux distribution, featuring a 6.1.75 kernel with Rockchip-specific patches and drivers. In our testing, the system worked well out of the box: networking functioned, sudo worked (refreshingly, after the Horizon X3 CM disaster), and package management operated normally.
The distribution includes pre-installed RKNN libraries and tools, enabling NPU development without additional setup. Python 3 and essential development packages are available, and standard Debian repositories provide access to thousands of additional packages. For developers comfortable with Debian/Ubuntu, the environment feels familiar and capable.
However, the software ecosystem lags behind Raspberry Pi's. Raspberry Pi OS includes countless optimizations, hardware-specific integrations, and utilities that simply don't exist for Rockchip platforms. Camera support, GPIO access, and peripheral interfaces work, but often require more manual configuration or programming compared to the Pi's plug-and-play experience.
Third-party software support varies. Popular frameworks like ROS2, OpenCV, and TensorFlow compile and run without issues. Hardware-specific accelerators (GPU, NPU) may require additional configuration or custom builds. Overall, the software situation is "good enough" for experienced developers but not as polished as the Raspberry Pi ecosystem.
Banana Pi's documentation has improved significantly over the years, with reasonably comprehensive guides covering basic setup, GPIO usage, and RKNN deployment. Community support exists through forums and GitHub, though it's smaller and less active than Raspberry Pi's communities. Expect to do more troubleshooting independently and rely less on finding someone who's already solved your exact problem.
Conclusion: A Capable Platform for Specific Niches
The Banana Pi CM5-Pro is a solid, if unspectacular, compute module that serves specific niches well while falling short of being a universal recommendation. Its combination of integrated 6 TOPS NPU, up to 16GB RAM, WiFi 6 connectivity, and CM4-compatible form factor creates a unique offering that competes effectively against alternatives when your requirements align with its strengths.
For projects needing AI acceleration in a compute module format, the CM5-Pro is arguably the best choice currently available. The integrated NPU eliminates the complexity and cost of external AI accelerators while delivering genuine performance improvements for inference workloads. The RKNN toolkit, while imperfect, provides a workable path to deploying optimized models. If your robotics platform, smart camera, or IoT device depends on local AI processing, the CM5-Pro deserves serious consideration.
For projects requiring more than 8GB of RAM or more than 64GB of storage in a compute module, the CM5-Pro is the only game in town among CM4-compatible options. This makes it the default choice for memory-intensive applications that need the compute module form factor.
For general-purpose computing, development, or applications where AI is not central, the Raspberry Pi CM5 is the better choice. Its 2.35x performance advantage is substantial and directly translates to faster build times, quicker application responsiveness, and better user experience. The Pi's ecosystem advantages further tip the scales for most users.
Our compilation benchmark results - 167 seconds for the CM5-Pro versus 71-77 seconds for Pi5/CM5 - illustrate the performance gap clearly. For development workflows, this difference is noticeable but workable. Most developers can tolerate the CM5-Pro's slower compilation times if other factors (AI acceleration, RAM capacity, price) favor it. But if maximum CPU performance is your priority, look elsewhere.
The comparison to the Orange Pi 5 Max reveals a significant performance gap (62 vs. 167 seconds), but also highlights different market positions. The 5 Max is a full-featured SBC designed for standalone use, while the CM5-Pro is a compute module designed for embedded integration. They serve different purposes and target different applications.
Against the LattePanda IOTA's x86 architecture, the CM5-Pro trades x86 compatibility for better power efficiency, integrated AI, and lower cost. The choice between them depends entirely on software requirements - x86-specific applications favor the IOTA, while ARM-native embedded applications favor the CM5-Pro.
The Banana Pi CM5-Pro earns a qualified recommendation: excellent for AI-focused embedded projects, good for high-RAM compute module applications, acceptable for general embedded Linux development, and not recommended if raw CPU performance or ecosystem maturity are priorities. At $103 for the 8GB/64GB configuration, it offers reasonable value for applications that leverage its strengths, though it won't excite buyers seeking the fastest or cheapest option.
If your project needs:
AI acceleration integrated into a compute module
More than 8GB RAM in CM4 form factor
WiFi 6 and current wireless standards
Guaranteed long production life (until 2034)
Then the Banana Pi CM5-Pro is a solid choice that delivers on its promises.
If your project needs:
Maximum CPU performance
The most polished software ecosystem
The easiest development experience
The lowest cost
Then the Raspberry Pi CM5 or Pi 5 remains the better option.
The CM5-Pro occupies a middle ground: not the fastest, not the cheapest, not the easiest, but uniquely capable in specific areas. For the right application, it's exactly what you need. For others, it's a compromise that doesn't quite satisfy. Choose accordingly.
Recommendation: Good choice for AI-focused embedded projects requiring compute module form factor; not recommended if raw CPU performance is the priority.
Review Date: November 3, 2025
Hardware Tested: Banana Pi CM5-Pro (ArmSoM-CM5) with 4GB RAM, 29GB eMMC, 932GB NVMe SSD
OS Tested: Banana Pi Debian (based on Debian GNU/Linux), kernel 6.1.75
Conclusion: Solid middle-ground option with integrated AI acceleration; best for specific niches rather than general-purpose use.