The Rise and Fall of Infocom: How Zork Changed Gaming Forever

Before there were photorealistic graphics and ray tracing, before polygons and sprites, there was something far more powerful: your imagination. In the late 1970s, a group of MIT hackers created a game that would define an entire genre and influence gaming for decades to come. This is the story of Infocom and their masterpiece, Zork.

Zork - the text adventure that launched an empire

The Birth of Interactive Fiction

The story of Zork begins not with Infocom, but with a cave. In 1975, Will Crowther, a programmer at Bolt, Beranek and Newman (the company that built the ARPANET), created a text-based game called Adventure (also known as Colossal Cave Adventure). Crowther was an avid caver, and he combined his knowledge of Kentucky's Mammoth Cave system with fantasy elements inspired by Dungeons & Dragons. The game spread across ARPANET like wildfire, captivating programmers at universities and research institutions worldwide.

Among those captivated were several students and staff at MIT's Laboratory for Computer Science, working on the Dynamic Modeling Group's PDP-10 mainframe. In 1977, Marc Blank, Dave Lebling, Tim Anderson, and Bruce Daniels decided they could do better. They set out to create a more sophisticated adventure game with a richer world, better puzzles, and a far more capable text parser that could understand complex English sentences.

They called their creation Zork - a nonsense word used as a placeholder during MIT projects. The name stuck.

In this 1985 interview from Computer Chronicles, Dave Lebling discusses Infocom's approach to interactive fiction and what made their games unique:

Zork: The Great Underground Empire

The original mainframe Zork was massive by the standards of the day. Written in MDL (a LISP-like language), it occupied over a megabyte of memory - an absurd amount when most personal computers had 16KB or 32KB at most. The game featured hundreds of locations, dozens of objects, and a parser that could understand sentences like "put the jeweled egg in the trophy case" or "attack the troll with the elvish sword."

The premise was simple yet irresistible: you are an adventurer exploring the ruins of the Great Underground Empire, a vast subterranean realm filled with treasures, dangers, and puzzles. The white house. The mailbox. The brass lantern. The thief. The grues lurking in the darkness. These elements became iconic, instantly recognizable to anyone who played.

What set Zork apart from Adventure was its sophistication. The parser understood prepositions, adjectives, and complex commands. The game responded intelligently to absurd inputs, often with dry wit. The puzzles were clever and interconnected. And the writing - sparse, evocative, occasionally hilarious - created a world more vivid than any graphics could render.

"It is pitch black. You are likely to be eaten by a grue."

That single sentence has become one of gaming's most famous lines.

From Mainframe to Microcomputer: The Founding of Infocom

By 1979, the Zork creators faced a problem and an opportunity. The game was hugely popular among the relatively small community of people with access to mainframes, but personal computers were beginning to proliferate. The Apple II, TRS-80, and Commodore PET were bringing computing into homes. Could Zork be brought to these machines?

The technical challenges were formidable. Zork was written in MDL for a PDP-10 with vast resources. Personal computers had neither the memory nor the processing power to run it directly. The solution came from Joel Berez, a recent MIT graduate, and Marc Blank: create a virtual machine.

They called it the Z-machine. Instead of porting Zork directly to each platform, they would write a compact virtual machine interpreter for each computer and compile Zork into a bytecode format (called a "story file") that any Z-machine could run. This was brilliant engineering - write once, run anywhere, decades before Java made the concept mainstream.

The Z-machine's influence extended beyond Infocom. When Lucasfilm Games faced the same cross-platform challenge in 1987, Ron Gilbert and Aric Wilmunder created SCUMM (Script Creation Utility for Maniac Mansion) - a scripting engine and virtual machine that powered Maniac Mansion, Monkey Island, and dozens of other LucasArts adventures. While there's no documented direct lineage, the architectural similarity is unmistakable: both systems abstracted game logic from hardware, both used bytecode interpreters, and both solved the 1980s problem of incompatible platforms with elegant virtualization. Today, both live on through modern reimplementations - Frotz and other interpreters run original Infocom story files, while ScummVM preserves the LucasArts catalog.

In June 1979, the founders incorporated Infocom in Cambridge, Massachusetts. The company's mission was straightforward: bring Zork to personal computers and see if people would pay for it.

They split the massive mainframe Zork into three parts, each substantial enough to be a full game. Zork I: The Great Underground Empire shipped in 1980 for the PDP-11 and appeared on the TRS-80 in 1981. It was an immediate hit.

The Golden Age of Infocom

The early 1980s were Infocom's golden years. The Z-machine proved its worth, allowing the company to release games simultaneously across multiple platforms with minimal porting effort. After completing the Zork trilogy (Zork I, Zork II: The Wizard of Frobozz, and Zork III: The Dungeon Master), Infocom expanded into other genres.

Deadline (1982) pioneered the interactive mystery, dropping players into a murder investigation with real-time elements and multiple endings. Suspended (1983) put players in control of a complex through six robots, each with different capabilities. Planetfall (1983) introduced humor and emotional depth through Floyd, a childlike robot companion whose fate remains one of gaming's most affecting moments.

The games were uniformly excellent, but Infocom distinguished itself in another way: packaging. Each game came in an elaborate box filled with "feelies" - physical props that enhanced the experience and served as copy protection. A Deadline box contained a police interview transcript, pills from the crime scene, and a coroner's report. The Hitchhiker's Guide to the Galaxy included pocket fluff, a button proclaiming "Don't Panic," and a pair of Peril Sensitive Sunglasses (opaque black cardboard glasses).

This was premium entertainment at premium prices, and customers loved it.

The Hitchhiker's Guide and Literary Ambitions

In 1984, Infocom achieved something remarkable: a collaboration with Douglas Adams, author of The Hitchhiker's Guide to the Galaxy. Adams worked directly with designer Steve Meretzky to create an interactive version of his beloved absurdist science fiction comedy.

The game was deliberately, gloriously unfair. It featured puzzles that couldn't be solved without items collected hours earlier, locked players in unwinnable states without warning, and regularly killed the player for seemingly innocuous actions. It was also brilliant, capturing Adams's wit and subversive humor perfectly. The game sold 350,000 copies, Infocom's biggest hit.

But Hitchhiker's also revealed Infocom's split personality. The company saw itself not just as a game developer but as a publisher of "interactive fiction" - a literary form deserving respect alongside traditional novels. This artistic ambition was genuine and produced remarkable work like A Mind Forever Voyaging (1985), which explored a dystopian future through time-jumping vignettes, and Trinity (1986), a meditation on nuclear weapons that remains one of the most ambitious games ever made.

The problem was that artistic ambition doesn't always translate to commercial success.

The Business That Broke Them

Infocom's downfall didn't come from their games - it came from a database product called Cornerstone.

By 1984, Infocom was generating significant revenue from games, but investors and management believed the future lay in business software. They developed Cornerstone, a relational database for business users that was technically impressive and received strong reviews. It was also expensive to develop and entered a market dominated by Ashton-Tate's dBase.

The Cornerstone disaster was comprehensive. Development costs ballooned. Sales were disappointing. By 1985, Infocom was losing money rapidly despite strong game sales. The company that had bootstrapped itself on games now needed outside investment to survive.

Enter Activision.

Acquisition by Activision

In 1986, Activision acquired Infocom for approximately $7.5 million. On paper, the deal made sense: Activision got a prestigious studio with a loyal customer base, and Infocom got capital to continue operations.

In practice, the acquisition was contentious from the start. Activision was a very different company - focused on graphics-heavy console and computer games, not text adventures. Cultural clashes were immediate and severe. The Cornerstone division was shut down, but the damage was done: Infocom had lost money, lost momentum, and lost the confidence of its new parent.

Activision pressed Infocom to add graphics to their games, believing text-only adventures were becoming obsolete. Games like Beyond Zork (1987) and Zork Zero (1988) featured graphical elements, attempting to bridge old and new paradigms. The results were mixed - loyal fans felt betrayed while new audiences weren't convinced.

The market was shifting away from text adventures. The NES was ascendant. Computer games were becoming increasingly graphical. Infocom's sales declined. Staff left or were laid off. Marc Blank had departed by 1986. Dave Lebling stayed until 1989.

In 1989, Activision finally closed the Infocom offices in Cambridge. The last game released under the original Infocom banner was Shogun (1989). The company that had defined a genre was effectively dead, absorbed into its acquirer after just three years.

The Intellectual Property Journey

But Zork didn't die with Infocom. The intellectual property continued its journey through corporate America's M&A machinery.

Activision attempted to revive the brand periodically. Return to Zork (1993) was a graphical adventure game that traded text for full-motion video. Zork Nemesis (1996) and Zork: Grand Inquisitor (1997) followed in the Myst-inspired point-and-click adventure style. These games sold reasonably well but felt distant from Infocom's text-based origins.

Activision merged with Blizzard Entertainment in 2008, forming Activision Blizzard. The Infocom and Zork properties came along for the ride, sitting in a vast intellectual property portfolio that included Call of Duty, World of Warcraft, and Candy Crush (after the King acquisition in 2016).

Then came the biggest acquisition in gaming history. In January 2022, Microsoft announced it would acquire Activision Blizzard for $68.7 billion. After regulatory battles across multiple continents, the deal closed in October 2023. Microsoft now owned not just Call of Duty and World of Warcraft, but also Zork.

The game that began as a project by MIT hackers in 1977 now belongs to one of the world's largest technology companies. Somewhere in Microsoft's intellectual property databases, alongside Minecraft, Halo, and Doom, sits the rights to a brass lantern, a rubber raft, and a maze of twisty little passages, all alike.

The Legacy of Infocom

Infocom's direct output was relatively small: about 35 games over ten years. But their influence far exceeds their catalog.

The Z-machine was ahead of its time. Its philosophy of platform-independent bytecode influenced later virtual machines and interpreters. When fans reverse-engineered the format in the 1990s, they created the Inform programming language, which remains the dominant tool for creating interactive fiction today. New Z-machine interpreters exist for every platform imaginable, from modern web browsers to vintage calculators.

The writing standards Infocom established remain the benchmark. Their games proved that text could create experiences as immersive as any visual medium - that a skilled writer could make a player feel genuine fear, joy, grief, or triumph through words alone.

The "Great Underground Empire" and its associated lore have become part of gaming's shared cultural heritage. Grues are referenced in everything from World of Warcraft to Minecraft. The concept of "interactive fiction" that Infocom championed has evolved into a thriving indie scene, producing works that push the boundaries of what games can be.

Most importantly, Infocom proved that games could aspire to be art. Not every game needs to chase that goal, but the possibility exists because companies like Infocom demonstrated it was achievable.

Playing Zork Today

The original Zork games remain playable and are freely available online. Various interpreters run the original story files on any modern system. Fire up a Z-machine interpreter, load ZORK1.DAT, and you'll see:

ZORK I: The Great Underground Empire
Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded front door.
There is a small mailbox here.

>

Or better yet - play it right here. The embedded player below uses Encrusted, a Z-machine interpreter written in Rust and compiled to WebAssembly. Zork I loads automatically - just start typing commands.

Click to play Zork

Forty-five years later, the magic remains intact. No graphics. No sound beyond your imagination. Just words on a screen and a world waiting to be explored.

Open the mailbox.

CP/M 2.2 on Physical Hardware: RetroShield Z80 with 1MB DRAM and SD Card Storage

There's a profound difference between emulation and the real thing. While my previous post covered running CP/M on a software-based Z80 emulator, this post documents the journey of bringing CP/M 2.2 to life on actual Z80 silicon - a real Zilog Z80 CPU executing real machine code, with 1MB of DRAM and SD card storage for disk images.

The result? A fully functional CP/M system running Zork, all on an Arduino Mega 2560 acting as the glue between vintage and modern technology.

The Hardware Stack

Building a working CP/M system requires three essential components: a CPU, memory, and storage. Here's what I used:

The complete hardware stack: Arduino Mega 2560, KDRAM2560 DRAM shield, RetroShield Z80, and SD card module

The RetroShield Z80

The RetroShield from 8bitforce is a clever piece of engineering. It's a shield that holds a real Z80 CPU and lets an Arduino Mega control it cycle-by-cycle. The Arduino provides the clock, handles bus transactions, and emulates peripherals - but the Z80 is doing the actual computation.

Close-up of the RetroShield Z80 with a Zilog Z84C0004PSC CPU - the real silicon that runs CP/M

The RetroShield uses nearly every pin on the Arduino Mega:

Function Arduino Pins
Address Bus (A0-A15) Pins 22-37
Data Bus (D0-D7) Pins 42-49
Control Signals Pins 38-41, 50-53

This pin-hungry design means we need to be creative about adding peripherals.

KDRAM2560: 1MB of Dynamic RAM

The KDRAM2560 is another 8bitforce product - a DRAM shield that provides a full megabyte of memory to the Arduino Mega. It uses the analog pins (A0-A15) for its interface, leaving digital pins available for other uses.

Why DRAM instead of SRAM? Cost and density. A megabyte of SRAM would be expensive and physically large. DRAM is cheap but requires periodic refresh to maintain data integrity. The KDRAM2560 library handles this automatically using one of the Arduino's hardware timers.

For CP/M, we only need 64KB of the available 1MB, but having extra memory opens possibilities for RAM disks or bank switching in future projects.

#define DRAM_REFRESH_USE_TIMER_1
#include <kdram2560.h>

void setup() {
    // Initialize DRAM - this also starts the refresh interrupt
    if (DRAM.begin(&Serial)) {
        Serial.println("KDRAM2560: OK (1MB DRAM)");
    } else {
        Serial.println("KDRAM2560: FAILED!");
        while (1) {}  // Halt
    }
}

The API is beautifully simple:

// Read a byte from any address in the 1MB space
byte data = DRAM.read8(address);

// Write a byte
DRAM.write8(address, data);

Internally, the library handles the complex multiplexed addressing that DRAM requires - splitting the 20-bit address into row and column components, managing RAS/CAS timing, and ensuring refresh cycles happen frequently enough to prevent data loss.

Software SPI SD Card

Here's where things get interesting. The obvious choice for SD card storage would be the Arduino's hardware SPI on pins 50-53. But look back at that pin table - the RetroShield uses pins 50-53 for Z80 control signals!

The solution is software SPI - bit-banging the SPI protocol on different pins. I chose pins 4-7, safely away from both the RetroShield and KDRAM2560:

SD Card Pin Arduino Pin
MISO Pin 4
MOSI Pin 5
SCK Pin 6
CS Pin 7

The SD card module connected via rainbow ribbon cable to the KDRAM2560's prototyping area

The SdFat library supports software SPI through its SoftSpiDriver template class. One important note: you must set SPI_DRIVER_SELECT to 2 in SdFatConfig.h to enable this:

// In SdFat/src/SdFatConfig.h
#define SPI_DRIVER_SELECT 2  // Enable software SPI

Then in your sketch:

#include "SdFat.h"

const uint8_t SOFT_MISO_PIN = 4;
const uint8_t SOFT_MOSI_PIN = 5;
const uint8_t SOFT_SCK_PIN  = 6;
const uint8_t SD_CS_PIN     = 7;

SoftSpiDriver<SOFT_MISO_PIN, SOFT_MOSI_PIN, SOFT_SCK_PIN> softSpi;
#define SD_CONFIG SdSpiConfig(SD_CS_PIN, DEDICATED_SPI, SD_SCK_MHZ(0), &softSpi)

SdFs sd;

void setup() {
    if (sd.begin(SD_CONFIG)) {
        Serial.println("SD Card: OK (Software SPI)");
    }
}

Software SPI is slower than hardware SPI - roughly 20-50 KB/s compared to 1-2 MB/s. For loading programs at boot and occasional disk access, this is perfectly acceptable. You won't notice the difference playing Zork.

The Complete System Architecture

Here's how all the pieces fit together:

┌─────────────────────────────────────────────────────────────┐
                      Arduino Mega 2560                      
├─────────────────────────────────────────────────────────────┤
  Z80 RetroShield (top layer)                                
    Real Zilog Z80 CPU @ ~100kHz                             
    Address: pins 22-37 (directly mapped)                    
    Data: pins 42-49 (directly mapped)                       
    Control: pins 38-41, 50-53                               
├─────────────────────────────────────────────────────────────┤
  KDRAM2560 (middle layer)                                   
    1MB DRAM via analog pins A0-A15                          
    Timer 1 interrupt for automatic refresh                  
    Only 64KB used for Z80 address space                     
├─────────────────────────────────────────────────────────────┤
  MicroSD Card (external, via jumper wires)                  
    Software SPI on pins 4-7                                 
    FAT32 formatted, 32GB                                   
    Contains: boot.bin, CPM.SYS, A.DSK, B.DSK                
└─────────────────────────────────────────────────────────────┘

Peripheral Emulation: The Arduino's Role

While the Z80 executes code, the Arduino handles peripheral I/O. When the Z80 performs an IN or OUT instruction, the Arduino intercepts it and provides the appropriate response.

MC6850 ACIA (Serial Console)

The console uses a virtual MC6850 ACIA on I/O ports 0x80 (control/status) and 0x81 (data). This connects to the Arduino's Serial interface, which in turn connects to your terminal:

#define ADDR_6850_CONTROL     0x80
#define ADDR_6850_DATA        0x81

// In the I/O read handler:
if (ADDR_L == ADDR_6850_DATA) {
    // Z80 is reading from serial
    prevDATA = Serial.read();
}
else if (ADDR_L == ADDR_6850_CONTROL) {
    // Z80 is checking status
    // Bit 0: Receive data ready
    // Bit 1: Transmit buffer empty
    prevDATA = reg6850_STATUS;
}

// In the I/O write handler:
if (ADDR_L == ADDR_6850_DATA) {
    // Z80 is writing to serial
    Serial.write(DATA_IN);
}

SD Card Interface

The SD card interface uses ports 0x10-0x19, providing commands for file operations and DMA block transfers:

Port Function
0x10 Command register
0x11 Status register
0x12 Data byte (single-byte I/O)
0x13 Filename character input
0x14-0x15, 0x19 Seek position (24-bit)
0x16-0x17 DMA address (16-bit)
0x18 Block command (0=read, 1=write)

The key innovation is the DMA block transfer. Instead of the Z80 reading 128 bytes one at a time through port 0x12 (which would require 128 IN instructions), it sets a DMA address and issues a single block command. The Arduino then copies 128 bytes directly between the SD card and DRAM:

void sd_do_block_read() {
    uint8_t buffer[128];
    sdFile.read(buffer, 128);

    // Copy directly to DRAM
    for (int i = 0; i < 128; i++) {
        DRAM.write8((unsigned long)(sdDmaAddr + i), buffer[i]);
    }
    sdBlockStatus = 0;  // Success
}

This makes disk operations reasonably fast despite the software SPI limitation.

The Boot Process

When the Arduino powers up, here's what happens:

  1. Arduino Setup
  2. Initialize Serial at 115200 baud
  3. Initialize KDRAM2560 (starts refresh interrupt)
  4. Initialize SD card via software SPI
  5. Load boot.bin from SD card into DRAM at address 0x0000
  6. Release Z80 from reset

  7. Z80 Boot Loader (boot.bin)

  8. Initialize the MC6850 ACIA
  9. Print boot banner
  10. Open CPM.SYS from SD card
  11. Load it into DRAM at 0xE000 (53 sectors = 6,784 bytes)
  12. Jump to BIOS cold start at 0xF600

  13. CP/M BIOS Cold Start

  14. Initialize disk variables
  15. Set up page zero jump vectors
  16. Print the welcome message
  17. Jump to CCP (Console Command Processor)

  18. You see the A> prompt!

The boot loader is about 330 bytes of Z80 assembly:

;========================================================================
; CP/M Boot Loader for RetroShield Z80
;========================================================================

CCP_BASE:       equ     0xE000
BIOS_BASE:      equ     0xF600
LOAD_SIZE:      equ     53              ; Sectors to load

                org     0x0000

BOOT:
                di
                ld      sp, 0x0400

                ; Print boot message
                ld      hl, MSG_BOOT
                call    PRINT_STR

                ; Open CPM.SYS
                ld      hl, FILENAME
                call    SD_SEND_NAME
                ld      a, CMD_OPEN_READ
                out     (SD_CMD), a

                ; Load CP/M system to memory
                ld      hl, CCP_BASE
                ld      b, LOAD_SIZE

LOAD_LOOP:
                push    bc
                push    hl

                ; Set DMA address
                ld      a, l
                out     (SD_DMA_LO), a
                ld      a, h
                out     (SD_DMA_HI), a

                ; Read 128-byte block via DMA
                xor     a
                out     (SD_BLOCK), a

                ; Print progress dot
                ld      a, '.'
                call    PRINT_CHAR

                pop     hl
                ld      de, 128
                add     hl, de
                pop     bc
                djnz    LOAD_LOOP

                ; Jump to BIOS
                jp      BIOS_BASE

CP/M Disk Images

CP/M uses a specific disk format based on the 8-inch floppy standard:

  • 77 tracks
  • 26 sectors per track
  • 128 bytes per sector
  • 256KB total capacity

The first two tracks are reserved for the system (though we load from CPM.SYS instead). The directory starts at track 2, sector 0 (byte offset 6,656 or 0x1A00).

I wrote a Python tool to create and manage these disk images:

# Create an empty disk image
python3 cpm_disk.py create A.DSK

# Add a file
python3 cpm_disk.py add A.DSK ZORK1.COM

# List files
python3 cpm_disk.py list A.DSK

The directory entry format is straightforward - 32 bytes per entry:

Offset Size Description
0 1 User number (0xE5 = empty/deleted)
1-8 8 Filename (space-padded)
9-11 3 Extension (space-padded)
12-15 4 Extent info and record count
16-31 16 Block allocation map

One gotcha: empty directory entries must be marked with 0xE5, not 0x00. A disk full of zeros will confuse CP/M into thinking it has files with blank names!

Loading Classic Software: Zork on Real Hardware

With the infrastructure in place, loading classic software is straightforward. I grabbed Zork I, II, and III from the cpm-dist repository:

# Add Zork to the A: drive
python3 cpm_disk.py add A.DSK ZORK1.COM
python3 cpm_disk.py add A.DSK ZORK1.DAT
python3 cpm_disk.py add A.DSK ZORK2.COM
python3 cpm_disk.py add A.DSK ZORK2.DAT

# Hitchhiker's Guide goes on B:
python3 cpm_disk.py add B.DSK HITCH.COM
python3 cpm_disk.py add B.DSK HITCHHIK.DAT

Copy the disk images to the SD card, insert it into the module, and reset the Arduino:

Arduino IDE showing Zork I running on CP/M on the RetroShield Z80

======================================
RetroShield Z80 CP/M 2.2
======================================

KDRAM2560:  OK (1MB DRAM)
SD Card:    OK (Software SPI)
Loading boot.bin...
Loaded 331 bytes to DRAM at 0x0000
Starting Z80...

RetroShield Z80 Boot Loader
Copyright (c) 2025 Alex Jokela, tinycomputers.io

Loading CPM.SYS.....................................................
Boot complete.


RetroShield CP/M 2.2
56K TPA

A>DIR
A: ZORK1    COM : ZORK1    DAT : ZORK2    COM : ZORK2    DAT
A>ZORK1

ZORK I: The Great Underground Empire
Copyright (c) 1981, 1982, 1983 Infocom, Inc. All rights reserved.
ZORK is a registered trademark of Infocom, Inc.
Revision 88 / Serial number 840726

West of House
You are standing in an open field west of a white house, with a boarded
front door.
There is a small mailbox here.

>

There's something deeply satisfying about this. The Z80 CPU running this game is the same architecture that ran it in 1981. The actual opcodes being executed are identical. We've just swapped floppy drives for SD cards and CRT terminals for USB serial.

The CPU Tick Loop: Where It All Comes Together

The heart of the system is the cpu_tick() function - called continuously in loop(), it handles one Z80 clock cycle:

inline __attribute__((always_inline))
void cpu_tick() {
    // Check for serial input
    if (Serial.available()) {
        reg6850_STATUS |= 0x01;  // Set RDRF
    }

    CLK_HIGH;  // Rising clock edge
    uP_ADDR = ADDR;  // Capture address bus

    // Memory access?
    if (!STATE_MREQ_N) {
        if (!STATE_RD_N) {
            // Memory read - get byte from DRAM
            DATA_DIR = DIR_OUT;
            DATA_OUT = DRAM.read8((unsigned long)uP_ADDR);
        }
        else if (!STATE_WR_N) {
            // Memory write - store byte to DRAM
            DRAM.write8((unsigned long)uP_ADDR, DATA_IN);
        }
    }
    // I/O access?
    else if (!STATE_IORQ_N) {
        if (!STATE_RD_N && prevIORQ) {
            // I/O read
            DATA_DIR = DIR_OUT;
            if (sd_handles_port(ADDR_L)) {
                prevDATA = sd_read_port(ADDR_L);
            }
            else if (ADDR_L == ADDR_6850_DATA) {
                prevDATA = Serial.read();
            }
            else if (ADDR_L == ADDR_6850_CONTROL) {
                prevDATA = reg6850_STATUS;
            }
            DATA_OUT = prevDATA;
        }
        else if (!STATE_WR_N && prevIORQ) {
            // I/O write
            DATA_DIR = DIR_IN;
            if (sd_handles_port(ADDR_L)) {
                sd_write_port(ADDR_L, DATA_IN);
            }
            else if (ADDR_L == ADDR_6850_DATA) {
                Serial.write(DATA_IN);
            }
        }
    }

    prevIORQ = STATE_IORQ_N;
    CLK_LOW;  // Falling clock edge
    DATA_DIR = DIR_IN;
}

This runs at roughly 100kHz - slow by modern standards, but plenty fast for interactive programs. The Z80 was designed for clock speeds of 2-4MHz, so we're running at about 3-5% of original speed. Text adventures don't mind.

The BIOS: Hardware Abstraction in 1KB

The CP/M BIOS is where the magic happens. It's the only part of CP/M that needs to be written for each new hardware platform. The BDOS and CCP are universal - they work on any machine with a conforming BIOS.

Our BIOS implements 17 entry points:

                org     BIOS_BASE       ; 0xF600

                jp      BOOT            ; 00 - Cold boot
WBOOTE:         jp      WBOOT           ; 03 - Warm boot
                jp      CONST           ; 06 - Console status
                jp      CONIN           ; 09 - Console input
                jp      CONOUT          ; 0C - Console output
                jp      LIST            ; 0F - List output
                jp      PUNCH           ; 12 - Punch output
                jp      READER          ; 15 - Reader input
                jp      HOME            ; 18 - Home disk
                jp      SELDSK          ; 1B - Select disk
                jp      SETTRK          ; 1E - Set track
                jp      SETSEC          ; 21 - Set sector
                jp      SETDMA          ; 24 - Set DMA address
                jp      READ            ; 27 - Read sector
                jp      WRITE           ; 2A - Write sector
                jp      LISTST          ; 2D - List status
                jp      SECTRAN         ; 30 - Sector translate

The most complex routines are the disk operations. READ and WRITE must calculate the byte offset within the disk image from track and sector numbers:

CALC_OFFSET:
        ; offset = (track * 26 + sector) * 128
        ld      hl, (TRACK)
        ld      de, 26          ; Sectors per track
        call    MULT16          ; HL = track * 26
        ld      de, (SECTOR)
        add     hl, de          ; HL = track * 26 + sector

        ; Multiply by 128 (shift left 7 times)
        xor     a               ; Clear carry byte
        add     hl, hl          ; *2
        adc     a, 0
        add     hl, hl          ; *4
        adc     a, a
        ; ... continue shifting ...

        ld      (SEEKPOS), hl
        ld      (SEEKPOS+2), a  ; 24-bit result
        ret

The Disk Parameter Block (DPB) tells CP/M about our disk geometry:

DPB:
        defw    26              ; SPT - sectors per track
        defb    3               ; BSH - block shift (1K blocks)
        defb    7               ; BLM - block mask
        defb    0               ; EXM - extent mask
        defw    242             ; DSM - total blocks - 1
        defw    63              ; DRM - directory entries - 1
        defb    0xC0            ; AL0 - allocation bitmap
        defb    0x00            ; AL1
        defw    16              ; CKS - checksum size
        defw    2               ; OFF - reserved tracks

These parameters define a standard 256KB 8-inch floppy format - the same format used by countless CP/M machines in the late 1970s.

Understanding CP/M's Memory Model

CP/M's memory layout is elegantly simple. The entire operating system fits in the top 8KB of the 64KB address space:

┌───────────────────────────────────────┐ 0xFFFF
              BIOS                      ~1KB - Hardware abstraction
├───────────────────────────────────────┤ 0xF600
              BDOS                      ~3.5KB - File system, I/O
├───────────────────────────────────────┤ 0xE800
              CCP                       ~2KB - Command processor
├───────────────────────────────────────┤ 0xE000
                                       
                                       
              TPA                       ~56KB - Your programs!
     (Transient Program Area)          
                                       
                                       
├───────────────────────────────────────┤ 0x0100
          Page Zero                     256 bytes - System variables
└───────────────────────────────────────┘ 0x0000

Page Zero contains crucial system information:

  • 0x0000-0x0002: Jump to warm boot
  • 0x0005-0x0007: Jump to BDOS entry
  • 0x005C: Default FCB (File Control Block)
  • 0x0080: Default DMA buffer / command tail

When you type ZORK1 at the command prompt, CP/M loads ZORK1.COM at address 0x0100 and jumps there. The program has nearly 56KB to work with - a luxurious amount of memory for 1970s software.

Debugging Tips

Getting CP/M running required extensive debugging. Here are some tips if you're attempting something similar:

Enable Debug Output

Set outputDEBUG to 1 in the Arduino sketch to see every I/O operation:

#define outputDEBUG     1

This prints every port read/write, which is invaluable for tracking down why the BIOS isn't finding files or why sectors are being read from wrong locations.

Check Your Directory Format

The most common issue I encountered was improperly formatted disk images. Use a hex editor to verify:

  • Directory starts at offset 0x1A00 (6656 bytes)
  • Empty entries have 0xE5 in byte 0, not 0x00
  • Filenames are space-padded to 8 characters, extensions to 3

Verify DMA Addresses

If programs load but crash immediately, check that the DMA address is being set correctly. The BIOS must output both low and high bytes:

ld      a, l
out     (SD_DMA_LO), a
ld      a, h
out     (SD_DMA_HI), a

Watch for Register Clobbering

Z80 subroutine calls don't preserve registers by default. If your SELDSK routine returns garbage, check whether the OPENDISK helper is destroying HL before the return.

Performance Considerations

The system runs at approximately 100kHz - about 3% of the Z80's original 4MHz speed. This is limited by:

  1. Arduino loop overhead: Each cpu_tick() call has function call overhead
  2. DRAM access time: Software-controlled DRAM is slower than dedicated hardware
  3. Software SPI: Bit-banging SPI adds latency to disk operations

For interactive programs like text adventures, this is imperceptible. For computation-heavy tasks, you'd notice the slowdown. WordStar feels sluggish but usable; compiling code would test your patience.

Future optimizations could include:

  • Assembly-optimized cpu_tick() routine
  • Hardware SPI with a different pin arrangement
  • Overclocking the Arduino (at your own risk)

Challenges and Solutions

Challenge 1: Pin Conflicts

The RetroShield claims the hardware SPI pins (50-53). Solution: software SPI on alternate pins. The SdFat library's SoftSpiDriver template makes this painless.

Challenge 2: Memory Refresh

DRAM needs refresh every few milliseconds or it loses data. Solution: the KDRAM2560 library uses Timer 1 interrupts to handle this transparently. The refresh happens in the background - you never need to think about it.

Challenge 3: Disk Image Format

CP/M expects 0xE5 (not 0x00) for empty directory entries. A disk image initialized to all zeros will confuse CP/M into displaying phantom files. Solution: the cpm_disk.py tool properly initializes the directory.

Challenge 4: 24-bit Seek Positions

Disk images are 256KB, requiring 18 bits to fully address. My initial 16-bit seek implementation couldn't access sectors past track 51. Solution: added a third seek port (0x19) for bits 16-23.

Challenge 5: SELDSK Return Value Bug

CP/M's BDOS expects SELDSK to return a pointer to the Disk Parameter Header in HL. My initial code calculated this pointer, then called OPENDISK which clobbered HL. Solution: push/pop HL around the OPENDISK call.

Getting Started: Bill of Materials

To build your own CP/M machine, you'll need:

Item Approximate Cost
Arduino Mega 2560 $15-40
Z80 RetroShield $35
KDRAM2560 $20
MicroSD Card Module $5
MicroSD Card (≤32GB FAT32) $5-10
Jumper wires $5
Total ~$100-130

Files and Resources

The complete Arduino sketch and supporting files are available on GitHub:

Required libraries: - KDRAM2560 - 1MB DRAM library - SdFat - SD card with software SPI support

Conclusion

Building this system was a journey through computing history. CP/M's clean architecture - the separation of BIOS, BDOS, and CCP - made it possible to port a 45-year-old operating system to completely alien hardware in a matter of days. For a fascinating look at CP/M in its heyday, check out the Computer Chronicles episode on CP/M from 1984.

The Z80 doesn't know it's being fed clock pulses by an Arduino, that its memory is dynamic RAM on a shield, or that its "floppy drives" are files on an SD card. It just executes its opcodes, one after another, exactly as it did in 1978.

And somewhere in that stream of opcodes, a small mailbox waits west of a white house, just as it has for over four and a half decades.

>open mailbox
Opening the small mailbox reveals a leaflet.

>read leaflet
"WELCOME TO ZORK!

ZORK is a game of adventure, danger, and low cunning. In it you will
explore some of the most amazing territory ever seen by mortals..."

Welcome to the underground empire. The password is nostalgia, and the treasure is understanding how elegantly simple these early systems really were.

The Paradox of Cheap Compute

Chart showing cost per GFLOP declining from \$18 billion in 1961 to \$0.02 in 2023

In 1961, if you wanted to perform one billion floating-point calculations per second—one gigaflop—you would have needed to spend approximately \$18.7 billion. Today, that same computational power costs about two cents. That's not a typo. The cost of compute has fallen by a factor of nearly one trillion over sixty years.

A floating-point operation is simply a mathematical calculation involving numbers with decimal points—the kind of math that powers everything from spreadsheets to video games to weather simulations. When your computer renders a 3D scene, calculates a mortgage payment, or trains an AI model, it's performing millions or billions of these operations every second. The "FLOP" has become the standard yardstick for measuring computational power, and tracking its cost over time reveals one of the most dramatic price collapses in economic history.

You might expect that as something becomes a trillion times cheaper, we'd use less of it. After all, we don't need as much anymore, right? But that's not what happened. Not even close.

Instead, humanity's consumption of computational power has exploded beyond anything the pioneers of computing could have imagined. We went from a world where only governments and the largest corporations could afford to compute, to a world where the phone in your pocket contains more processing power than all the computers that existed in 1960 combined—and we still want more.

This phenomenon has a name: Jevons' Paradox.

The Coal Question

In 1865, the English economist William Stanley Jevons published The Coal Question, in which he made a counterintuitive observation. As steam engines became more efficient at converting coal to useful work, coal consumption didn't decrease—it increased. Dramatically.

Jevons' reasoning was elegant: when something becomes more efficient, it becomes more economical. When it becomes more economical, people use it for more things. New applications emerge that weren't feasible before. The efficiency gains are swamped by the explosion in demand.

"It is wholly a confusion of ideas to suppose that the economical use of fuel is equivalent to a diminished consumption," Jevons wrote. "The very contrary is the truth."

What Jevons observed about coal in the 19th century applies with uncanny precision to computation in the 20th and 21st centuries. Every order of magnitude drop in the cost of compute has triggered an explosion in what we use it for—from calculating artillery trajectories to streaming video, from running payroll to training artificial intelligence.

Let's trace this paradox through the history of computing.

The Era of Scarcity: 1960s-1970s

The first electronic computers were, quite literally, priceless—not because they were invaluable, but because there was no market for them. ENIAC, completed in 1945, cost about \$400,000 (roughly \$7 million in today's dollars) and consumed 150 kilowatts of power. It was built to calculate artillery firing tables for the U.S. Army, and it remained in military hands.

By the early 1960s, commercial computing had arrived, but it remained extraordinarily expensive. The IBM System/360, announced in 1964, represented IBM's famous "\$5 billion gamble"—the largest privately funded commercial project in history at that time. The smallest System/360 Model 30 rented for \$2,700 to \$20,000 per month; more powerful configurations could cost \$115,000 monthly.

At these prices, computing was rationed. Universities developed time-sharing systems that allowed dozens of users to share a single machine, each getting small slices of processor time. Programmers submitted jobs on punch cards and waited hours—sometimes days—for results. Computing cycles were tracked and allocated like a precious resource.

The machines themselves were monuments to computational scarcity. A computer room required raised floors for cooling ducts, dedicated electrical systems, and climate control. The IBM 7090, the fastest computer of the early 1960s, performed about 100,000 floating-point operations per second and cost millions of dollars. To achieve one gigaflop of performance in 1961, you would have needed to operate roughly 10,000 such machines simultaneously—an impossibility even for governments.

The CDC 6600, designed by Seymour Cray and released in 1964, claimed the title of world's fastest computer with performance of up to 3 megaflops—three million floating-point operations per second. It cost \$9 million, roughly equivalent to \$90 million today. At that price, a single gigaflop of sustained performance would have cost \$3 billion. Only a handful of institutions could afford one: national laboratories, major research universities, and aerospace companies working on space programs and defense contracts.

Yet beneath the surface, miniaturization was beginning its relentless march. The transistor, invented at Bell Labs in 1947, had replaced vacuum tubes. In 1958, Jack Kilby at Texas Instruments and Robert Noyce at Fairchild Semiconductor independently invented the integrated circuit, putting multiple transistors on a single chip. In 1965, Gordon Moore made his famous observation: the number of transistors on an integrated circuit was doubling roughly every two years.

The implications were staggering. If transistor density determined computing power, and density was doubling every two years while costs remained roughly constant, then computing was about to become very, very cheap.

The Microcomputer Revolution: Late 1970s

The microprocessor changed everything. Intel's 4004, released in 1971, put an entire CPU on a single chip. Its successors—the 8008, 8080, and eventually the 8086—brought enough processing power to enable a new category of machine: the personal computer.

In 1977, the trinity of the Apple II, Commodore PET, and TRS-80 brought computing to the home. These machines cost between \$600 and \$1,300—expensive, but within reach of middle-class families. For the first time, ordinary people could own a computer.

The Apple II is a useful marker of the era's economics. Priced at \$1,298 for a base configuration, it offered roughly 0.5 MIPS (million instructions per second). By our gigaflop metric, that works out to roughly \$100 million per GFLOP—still astronomical, but already three orders of magnitude cheaper than the mainframe era.

This first Jevons moment transformed computing. When you had to rent time on a shared mainframe, you used it for serious business: scientific calculations, financial modeling, database management. But when you owned the machine outright, you could use it for anything.

VisiCalc, released in 1979 for the Apple II, demonstrated the power of cheap ownership. Dan Bricklin and Bob Frankston created the first spreadsheet program—a concept that simply didn't exist before. Accountants and business planners who would never have rented mainframe time bought Apple IIs specifically to run VisiCalc. The software created its own demand, and that demand consumed the newly affordable compute.

This pattern—the "killer app" that justifies hardware purchases and consumes available resources—would repeat throughout computing history. Games appeared immediately. Atari and Commodore built empires on entertainment software. Educational programs promised to teach children everything from typing to calculus. Each application justified the purchase of hardware, and each new user created demand for more software.

New applications emerged that no one had anticipated. Hobbyists wrote software for fun. Children learned programming. Bulletin board systems connected users over phone lines, creating the first online communities—the proto-internet. Word processing moved from dedicated Wang machines to general-purpose computers, democratizing document creation.

The machines were also shrinking. The Apple II contained about 14,000 transistors and fit on a desktop. The mainframes it aspired to replace filled rooms. This miniaturization wasn't just about convenience—it was about cost. Smaller meant cheaper to manufacture, cheaper to ship, cheaper to operate.

The IBM PC and the Clone Wars: 1980s

When IBM entered the personal computer market in 1981, it conferred legitimacy on an industry that many had dismissed as a hobbyist toy. The IBM PC wasn't technically superior to its competitors, but it carried the IBM name—and in corporate America, "nobody ever got fired for buying IBM."

More importantly, IBM made a fateful decision: it published the technical specifications of the PC and used off-the-shelf components. This openness, unusual for IBM, enabled an explosion of compatible clones. Compaq, Dell, and dozens of others built machines that ran the same software as the IBM PC, and competition drove prices down relentlessly.

The result was the second Jevons moment. When only IBM made PCs, prices stayed high. When everyone could make PCs, prices collapsed. By the end of the decade, a capable PC could be had for under \$1,000.

The Intel 80386, released in 1985, was the first mass-market 32-bit processor. It could address 4 gigabytes of memory—far more than anyone could afford to install at the time—and ran at up to 33 MHz. By 1990, a 386-based PC cost roughly \$2,000-3,000 and delivered performance that would have cost millions a decade earlier. The cost per GFLOP had fallen to roughly \$1 million.

Corporations responded to falling prices by putting a computer on every desk. In 1980, the concept of "one computer per worker" was absurd. By 1990, it was becoming standard practice in white-collar industries. Personal computing had become business computing.

The software industry exploded to match. Microsoft, founded in 1975, grew from a startup to a behemoth. Desktop publishing, computer-aided design, databases, and networking all found mass markets. Each application consumed the cheaper compute and demanded more.

The Internet Era: 1990s-2000s

The 1990s brought the "megahertz wars" between Intel and AMD. Clock speeds climbed from 25 MHz to 100 MHz to 500 MHz and beyond. Each generation brought not just speed but integration—math coprocessors, cache memory, and eventually multiple cores all moved onto the main processor die.

In 1997, Intel's ASCI Red at Sandia National Laboratories became the first computer to achieve one teraflop—one trillion floating-point operations per second. It cost \$55 million and consumed 850 kilowatts of power. The cost per GFLOP had fallen to roughly \$30,000.

But the real story of the 1990s wasn't the supercomputers—it was the network connecting all the ordinary ones. The World Wide Web, invented by Tim Berners-Lee in 1989, went mainstream in the mid-1990s. Suddenly every connected computer could communicate with every other.

This was the third Jevons moment. The internet didn't just use computing resources—it multiplied the uses for them. Email replaced letters. Websites replaced catalogs. Search engines made information universally accessible. E-commerce created entirely new markets.

Each of these applications consumed compute at both ends of the connection. A web server needed processing power. So did the browser displaying the page. Multiply by millions of users, and the aggregate demand for computing grew exponentially even as the cost per unit plunged.

By 2000, a capable desktop PC cost around \$1,000 and delivered roughly 1 GFLOP of performance. The cost per GFLOP had crossed below \$1,000. The entire computing power of ASCI Red—the world's fastest supercomputer just three years earlier—now cost less than \$100 million in commodity hardware.

By 2010, the cost had fallen to roughly \$1 per GFLOP. Smartphones had arrived, putting general-purpose computers in billions of pockets worldwide. Each one streamed video, played games, ran apps, and synced to cloud services—all consuming compute at unprecedented scales.

GPUs and the AI Explosion: 2010s-Present

While CPUs followed Moore's Law in a measured march, graphics processing units took a different path. GPUs were originally designed for one task: rendering pixels for video games. This required performing the same mathematical operations on thousands of data points simultaneously—massively parallel computation.

In 2007, NVIDIA released CUDA (Compute Unified Device Architecture), which allowed programmers to use GPUs for general-purpose computing. What had been a gaming component became a scientific instrument. Tasks that were computationally intractable on CPUs became feasible on GPUs.

The cost per GFLOP for GPU computation fell to roughly \$48 in 2007. By 2013, it was \$0.12. By 2017, it was \$0.03. Today, it hovers around \$0.02.

This unleashed the fourth—and most dramatic—Jevons moment: artificial intelligence.

Neural networks had existed since the 1950s. The theory was understood. What was missing was compute. Training a neural network requires performing billions or trillions of mathematical operations. At 1990s prices, training a modern large language model would have cost more than the entire GDP of most countries.

But at 2020s prices, it became merely expensive rather than impossible. OpenAI's GPT-3, released in 2020, was trained using approximately 3,640 petaflop-days of compute—roughly the equivalent of running 10,000 high-end GPUs for 14 days straight. By one estimate, the compute for training cost around \$4.6 million.

That sounds like a lot, but consider what that buys: a system that can write essays, answer questions, generate code, and engage in conversation. Just thirty years earlier, the same computation would have cost trillions of dollars—more than the entire world economy.

The AI industry responded to cheap compute exactly as Jevons would have predicted: by consuming vastly more of it. GPT-4 reportedly used 10-100 times more compute than GPT-3. Each generation of models grows larger. Each company trains more models. Each application uses more inference.

Training is only half the story. Every time someone asks ChatGPT a question or generates an image with Midjourney, that's "inference"—running the trained model to produce output. A single trained model might serve millions of users, each query consuming GPU cycles. The aggregate inference compute now exceeds training compute by orders of magnitude.

NVIDIA, the primary supplier of AI training hardware, saw its market capitalization rise from \$150 billion in early 2023 to over \$3 trillion by late 2024. The company couldn't manufacture GPUs fast enough to meet demand. Datacenters expanded. Power grids strained under the load. Microsoft, Google, and Amazon raced to build facilities that consume as much electricity as small cities—all to serve the insatiable demand for AI computation.

The physical infrastructure tells the story. A modern AI datacenter requires megawatts of power and sophisticated cooling systems. Server racks packed with GPUs generate heat densities that would have been unimaginable a decade ago. Companies are exploring nuclear power plants, offshore platforms, and even orbital datacenters to feed the demand.

The cost of compute had fallen by twelve orders of magnitude, and humanity's total spending on compute had never been higher.

The Paradox in Numbers

Let's put some numbers to this paradox:

Year Cost per GFLOP Approximate Global Computing Capacity
1961 \$18,672,000,000 ~10 GFLOPS (total)
1984 \$18,750,000 ~100 GFLOPS
1997 \$30,000 ~100 TFLOPS
2007 \$48 ~1 PFLOPS
2017 \$0.03 ~1 EFLOPS
2023 \$0.02 ~10+ EFLOPS

The cost fell by a factor of nearly one trillion. The total capacity grew by a factor of at least one trillion. We didn't save any money—we spent it all on more computation.

This is Jevons' Paradox in its purest form. Efficiency gains don't reduce consumption; they enable it. The cheaper compute becomes, the more uses we find for it, until we've consumed every efficiency gain and then some.

Miniaturization: The Engine of the Paradox

Underlying this entire history is miniaturization—the relentless shrinking of transistors that drives both efficiency gains and cost reductions.

In 1971, Intel's 4004 contained 2,300 transistors on a chip fabricated with a 10-micrometer process. Today, Apple's M-series chips contain over 100 billion transistors fabricated at 3 nanometers—more than 3,000 times smaller. Each generation of shrinkage brings more transistors per dollar, more operations per watt, and more capability per cubic centimeter.

This shrinkage is why your smartphone is more powerful than the supercomputers of the 1990s. It's why a \$500 graphics card can train machine learning models that would have required national laboratories thirty years ago. And it's why the economics of computing have followed Jevons' prediction so precisely: smaller transistors mean cheaper computation, and cheaper computation means more computation.

The industry euphemistically calls the end of Moore's Law—the point where further shrinkage becomes physically or economically impractical—"the wall." Various experts have predicted its arrival for decades. Yet the wall keeps receding. New techniques—multi-chip packages, 3D stacking, specialized accelerators—continue to deliver more compute per dollar even as individual transistor shrinkage slows.

What Comes Next?

If history is any guide, the future holds more of the same: continued cost reduction, continued demand growth, and continued surprise at what becomes possible.

Quantum computing looms on the horizon, promising exponential speedups for certain problems. If quantum computers become practical and affordable, they will trigger another Jevons moment. Problems that are currently intractable—drug discovery, materials science, cryptography—will become computable. New applications will emerge. Demand will explode.

Some argue that AI itself represents a new kind of computing, one that produces not calculations but intelligence. If artificial general intelligence arrives, it may consume computational resources at scales we can barely imagine—each AI agent requiring the equivalent of human-brain-level compute, running continuously, at massive scale.

The pattern is remarkably consistent. In 1965, computing was so expensive that only mission-critical calculations justified the cost. In 1985, it was cheap enough for word processing and spreadsheets. In 2005, it was cheap enough for social media and video streaming. In 2025, it's cheap enough to generate human-like text and photorealistic images on demand.

At each stage, we found new uses for the cheaper compute. At each stage, we consumed more total computation than before. At each stage, we spent more money on computing even as the cost per unit plummeted.

This is not a failure of planning or a lack of conservation. It is the predictable outcome of making something useful cheaper. The more valuable computation becomes per dollar, the more dollars we are willing to spend on it.

Jevons would not be surprised. "It is the very economy of its use," he wrote of coal, "which leads to its extensive consumption."

The same is true of compute. We have made it cheap beyond the wildest dreams of the 1960s pioneers, and we have consumed every bit of savings in an insatiable hunger for more.

The paradox endures.


Data sources: AI Impacts, Human Progress, Epoch AI, and historical hardware records.

Further Reading

If you'd like to explore these topics further:

  • The Coal Question by William Stanley Jevons — The 1865 original that introduced the paradox. Dense Victorian prose, but historically fascinating.

  • The Innovators by Walter Isaacson — A sweeping history of the digital revolution, from Ada Lovelace to the modern internet.

  • Hackers: Heroes of the Computer Revolution by Steven Levy — The definitive account of the microcomputer era and the culture that built it.

  • The Dream Machine by M. Mitchell Waldrop — The story of J.C.R. Licklider and the vision that became the internet.

  • Chip War by Chris Miller — How semiconductors became the world's most critical technology and reshaped geopolitics.

Rust on Z80: From LLVM Backend to Hello World

In my previous post, I documented building an LLVM backend for the Z80 processor. The backend worked — simple LLVM IR compiled to valid Z80 assembly. But that post ended with a sobering admission: Rust's core library remained out of reach, its abstractions overwhelming the constraints of 1976 hardware.

This post picks up where that one left off. The question nagging at me was simple: can we actually compile real Rust code into Z80 assembly? Not just hand-crafted LLVM IR, but genuine Rust source files with functions and variables and all the conveniences we expect from a modern language?

The answer is yes. But getting there required more RAM than any Z80 system ever had, a creative workaround that sidesteps Rust's build system entirely, and a willingness to accept that sometimes the elegant solution isn't the one that works.

The Hardware Reality Check

Before diving into the technical details, I need to address something that caught me off guard: the sheer computational resources required to compile code for an 8-bit processor.

My first attempt was on my M3 Max MacBook Pro. The machine is no slouch — 64GB of unified memory, fast SSD, Apple's impressive silicon. Building LLVM with the Z80 backend worked fine. Building stage 1 of the Rust compiler worked, albeit slowly. But when I tried to build Rust's core library for the Z80 target, the process crawled. After watching it churn for hours with no end in sight, I gave up.

The next attempt used a Linux workstation with 32GB of RAM. This seemed reasonable — surely 32GB is enough to compile code for a processor with a 64KB address space? It wasn't. The build process hit out-of-memory errors during the compilation of compiler_builtins, a Rust crate that provides low-level runtime functions.

To understand why, you need to know what compiler_builtins actually does. When you write code like let x: u64 = a * b;, and your target processor doesn't have native 64-bit multiplication (the Z80 doesn't even have 8-bit multiplication), something has to implement that operation in software. That something is compiler_builtins. It contains hundreds of functions: software implementations of multiplication, division, floating-point operations, and various other primitives that high-level languages take for granted. Each of these functions gets compiled, optimized, and linked into your final binary.

For the Z80, every one of these functions presents a challenge. 64-bit division on an 8-bit processor expands into an enormous sequence of instructions. The LLVM optimizer works hard to improve this code, and that optimization process consumes memory — lots of it.

The machine that finally worked was a dedicated build server:

OS: Ubuntu 24.04.3 LTS x86_64
Host: Gigabyte G250-G51 Server
CPU: Intel Xeon E5-2697A v4 (64 cores) @ 3.600GHz
Memory: 252GB DDR4
GPU: 4x NVIDIA Tesla P40 (unused for compilation)

With 252GB of RAM and 64 cores, the build finally had room to breathe. LLVM with Z80 support built in about 45 minutes. The Rust stage 1 compiler built in 11 minutes. And when we attempted to build compiler_builtins for Z80, memory usage peaked at 169GB.

Let that sink in: compiling runtime support code for a processor with 64KB of addressable memory required 169GB of RAM. The ratio is absurd — we needed 2.6 million times more memory to compile the code than the target system could ever access. This is what happens when modern software toolchains, designed for 64-bit systems with gigabytes of RAM, encounter hardware from an era when 16KB was a luxury.

The Naive Approach and Why It Fails

With our beefy build server ready, the obvious approach was to build Rust's core library for the Z80 target. The core library is Rust's foundation — it provides basic types like Option and Result, fundamental traits like Copy and Clone, and essential operations like memory manipulation and panicking. Unlike std, which requires an operating system, core is designed for bare-metal embedded systems. If anything could work on a Z80, surely core could.

The first obstacle was unexpected. Rust's build system uses a crate called cc to compile C code and detect target properties. When we ran the build, it immediately failed:

error occurred in cc-rs: target `z80-unknown-none-elf` had an unknown architecture

The cc crate maintains a list of known CPU architectures, and Z80 wasn't on it. The fix was simple — a one-line patch to add "z80" => "z80" to the architecture matching code — but we had to apply it to every version of cc in the cargo registry cache. Not elegant, but effective.

With that patched, the build progressed further before hitting a more fundamental problem:

rustc-LLVM ERROR: unable to legalize instruction: %35:_(s16) = nneg G_UITOFP %10:_(s64)

This error comes from LLVM's GlobalISel pipeline, specifically the Legalizer. To understand it, I need to explain how LLVM actually turns high-level code into machine instructions.

What is GlobalISel and Why Does It Matter?

When you compile code with LLVM, there's a critical step called "instruction selection" — the process of converting LLVM's abstract intermediate representation (IR) into concrete machine instructions for your target CPU. This is harder than it sounds. LLVM IR might say "add these two 32-bit integers," but your CPU might only have 8-bit addition, or it might have three different add instructions depending on whether the operands are in registers or memory.

Historically, LLVM used a framework called SelectionDAG for this task. SelectionDAG works, but it operates on individual basic blocks (straight-line code between branches) and makes decisions that are hard to undo later. For well-established targets like x86 and ARM, SelectionDAG is mature and produces excellent code. But for new or unusual targets, it's difficult to work with.

GlobalISel (Global Instruction Selection) is LLVM's modern replacement. The "Global" in the name refers to its ability to see across basic block boundaries, making better optimization decisions. More importantly for our purposes, GlobalISel breaks instruction selection into distinct, understandable phases:

  1. IRTranslator: Converts LLVM IR into generic machine instructions. These instructions have names like G_ADD (generic add), G_LOAD (generic load), and G_UITOFP (generic unsigned integer to floating-point conversion). At this stage, the code is still target-independent — G_ADD doesn't know if it'll become an x86 ADD, an ARM add, or a Z80 ADD A,B.

  2. Legalizer: This is where target constraints enter the picture. The Legalizer transforms operations that the target can't handle into sequences it can. If your target doesn't support 64-bit addition directly, the Legalizer breaks it into multiple 32-bit or 16-bit additions. If your target lacks a multiply instruction (hello, Z80), the Legalizer replaces multiplication with a function call to a software implementation.

  3. RegBankSelect: Assigns each value to a register bank. For the Z80, this means deciding whether something lives in 8-bit registers (A, B, C, D, E, H, L) or 16-bit register pairs (BC, DE, HL). This phase is crucial for the Z80 because using the wrong register bank means extra move instructions.

  4. InstructionSelector: Finally converts the now-legal, register-bank-assigned generic instructions into actual target-specific instructions. G_ADD becomes ADD A,B or ADD HL,DE depending on the operand types.

For the Z80 backend, GlobalISel was the right choice. It gave us fine-grained control over how operations get lowered on extremely constrained hardware. The downside is that every operation needs explicit handling — if the Legalizer doesn't know how to transform a particular instruction for Z80, compilation fails.

The error we hit was in the Legalizer. The G_UITOFP instruction converts an unsigned integer to floating-point. In this case, it was trying to convert a 64-bit integer to a 16-bit half-precision float. This operation appears deep in Rust's core library, in the decimal number parsing code used for floating-point literals.

The Z80 has no floating-point hardware whatsoever. It can't even do integer multiplication in a single instruction. Teaching LLVM to "legalize" 64-bit-to-float conversions on such constrained hardware would require implementing software floating-point operations — a significant undertaking that would generate hundreds of Z80 instructions for a single high-level operation.

Even setting aside the floating-point issue, we encountered another class of failures: LLVM assertion errors in the GlobalISel pipeline when handling complex operations. These manifested as crashes with messages about register operand sizes not matching expectations. The Z80 backend is experimental, and its GlobalISel support doesn't cover every edge case that Rust's core library exercises.

The fundamental problem became clear: Rust's core library, while designed for embedded systems, assumes a level of hardware capability that the Z80 simply doesn't have. It assumes 32-bit integers work efficiently. It assumes floating-point parsing is reasonable. It assumes the register allocator can handle moderately complex control flow.

The Workaround: Cross-Compile and Retarget

When the direct path is blocked, you find another way around.

The key insight is that LLVM IR (Intermediate Representation) is largely target-agnostic. When Rust compiles your code, it first generates LLVM IR, and then LLVM transforms that IR into target-specific assembly. The IR describes your program's logic — additions, function calls, memory accesses — without committing to a specific instruction set.

This suggests a workaround: compile Rust code to LLVM IR using a different target that Rust fully supports, then manually retarget that IR to Z80 and run it through our Z80 LLVM backend.

For the donor target, I chose thumbv6m-none-eabi — the ARM Cortex-M0, a 32-bit embedded processor. This target is well-supported in Rust's ecosystem, and crucially, it's a no_std target designed for resource-constrained embedded systems. The generated IR would be reasonably close to what we'd want for Z80, minus the data layout differences.

The workflow looks like this:

  1. Write Rust code with #![no_std] and #![no_main]
  2. Compile for ARM: cargo +nightly build --target thumbv6m-none-eabi -Zbuild-std=core
  3. Extract the LLVM IR from the build artifacts (the .ll files)
  4. Modify the IR's target triple and data layout for Z80
  5. Compile to Z80 assembly: llc -march=z80 -O2 input.ll -o output.s

The data layout change is important. ARM uses 32-bit pointers; Z80 uses 16-bit pointers. The Z80 data layout string is:

e-m:e-p:16:8-i16:8-i32:8-i64:8-n8:16

This tells LLVM: little-endian, ELF mangling, 16-bit pointers with 8-bit alignment, native types are 8-bit and 16-bit. When we retarget the IR, we need to update this layout and the target triple to z80-unknown-unknown.

Is this elegant? No. It's a hack that bypasses Rust's proper build system. But it works, and sometimes working beats elegant.

Hello Z80 World

Let's put this into practice with the classic first program.

Here's the Rust source code:

#![no_std]
#![no_main]

use core::panic::PanicInfo;

// Memory-mapped serial output at address 0x8000
const SERIAL_OUT: *mut u8 = 0x8000 as *mut u8;

#[inline(never)]
#[no_mangle]
pub extern "C" fn putchar(c: u8) {
    unsafe {
        core::ptr::write_volatile(SERIAL_OUT, c);
    }
}

#[no_mangle]
pub extern "C" fn hello_z80() {
    putchar(b'H'); putchar(b'e'); putchar(b'l'); putchar(b'l'); putchar(b'o');
    putchar(b' '); putchar(b'Z'); putchar(b'8'); putchar(b'0'); putchar(b'!');
    putchar(b'\r'); putchar(b'\n');
}

#[panic_handler]
fn panic(_: &PanicInfo) -> ! { loop {} }

This is genuine Rust code. We're using core::ptr::write_volatile for memory-mapped I/O, the extern "C" calling convention for predictable symbol names, and #[no_mangle] to preserve function names in the output. The #[inline(never)] on putchar ensures it remains a separate function rather than being inlined into the caller.

After compiling to ARM IR and retargeting to Z80, we run it through llc. The output is real Z80 assembly:

    .globl  putchar
putchar:
    ld   de,32768        ; Load address 0x8000
    push de
    pop  hl              ; DE -> HL (address now in HL)
    ld   (hl),a          ; Store A register to memory
    ret

    .globl  hello_z80
hello_z80:
    push ix              ; Save frame pointer
    ld   ix,0
    add  ix,sp           ; Set up stack frame
    dec  sp              ; Allocate 1 byte on stack
    ld   a,72            ; 'H'
    call putchar
    ld   a,101           ; 'e'
    call putchar
    ld   a,108           ; 'l'
    ld   (ix+-1),a       ; Save 'l' to stack (optimization!)
    call putchar
    ld   a,(ix+-1)       ; Reload 'l' for second use
    call putchar
    ld   a,111           ; 'o'
    call putchar
    ld   a,32            ; ' '
    call putchar
    ld   a,90            ; 'Z'
    call putchar
    ld   a,56            ; '8'
    call putchar
    ld   a,48            ; '0'
    call putchar
    ld   a,33            ; '!'
    call putchar
    ld   a,13            ; '\r'
    call putchar
    ld   a,10            ; '\n'
    call putchar
    ld   sp,ix           ; Restore stack
    pop  ix              ; Restore frame pointer
    ret

This is valid Z80 assembly that would run on real hardware. The putchar function loads the serial port address into the HL register pair and stores the character from the A register. The hello_z80 function calls putchar twelve times, once for each character in "Hello Z80!\r\n".

Notice something interesting: the compiler optimized the duplicate 'l' character. Instead of loading 108 into the A register twice, it saves the value to the stack after the first use and reloads it for the second. This is LLVM's register allocator at work, recognizing that reusing a value from the stack is cheaper than reloading an immediate. The Z80 backend is generating genuinely optimized code.

Running on (Emulated) Hardware

Generating assembly is satisfying, but seeing it actually execute closes the loop. I have a Rust-based Z80 emulator that I use for testing RetroShield firmware. It emulates the Z80 CPU along with common peripheral chips, including the MC6850 ACIA serial chip that my physical hardware uses.

To run our Hello World, we need to adapt the memory-mapped I/O to use the ACIA's port-based I/O instead. The MC6850 uses port $80 for status and port $81 for data. A proper implementation waits for the Transmit Data Register Empty (TDRE) bit before sending each character:

; Hello Z80 World - Compiled from Rust via LLVM
; Adapted for MC6850 ACIA serial output

ACIA_STATUS:    equ     $80
ACIA_DATA:      equ     $81

        org     $0000

_start:
        ld      hl, MESSAGE
        ld      b, MESSAGE_END - MESSAGE

print_loop:
wait_ready:
        in      a, (ACIA_STATUS)
        and     $02                ; Check TDRE bit
        jr      z, wait_ready

        ld      a, (hl)
        out     (ACIA_DATA), a
        inc     hl
        djnz    print_loop

halt_loop:
        halt
        jr      halt_loop

MESSAGE:
        defb    "Hello, Z80 World!", $0D, $0A
MESSAGE_END:

This is the essence of what our Rust code does, translated to the actual hardware interface. The infinite loop at the end mirrors Rust's loop {} — on bare metal, there's nowhere to return to.

Assembling with z80asm produces a 39-byte binary. Running it in the emulator:

Hello Z80 World running in the TUI debugger

$ ./retroshield -d -c 10000 hello_rust.bin
Loaded 39 bytes from hello_rust.bin
Starting Z80 emulation...
Hello, Z80 World!

CPU halted at PC=0011 after 1194 cycles

The program executes in 1,194 Z80 cycles — roughly 300 microseconds at the original 4MHz clock speed. The complete pipeline works:

  1. Rust source code → compiled to LLVM IR via rustc
  2. LLVM IR → retargeted to Z80 and compiled to assembly
  3. Z80 assembly → assembled to binary with z80asm
  4. Binary → executed in the Z80 emulator

The 39-byte binary breaks down to about 20 bytes of executable code and 19 bytes for the message string. This is exactly what bare-metal #![no_std] Rust should produce — tight, efficient code with zero runtime overhead.

What Works and What Doesn't

Through experimentation, we've mapped out the boundaries of what the Z80 backend handles well.

Works reliably:

  • 8-bit arithmetic: addition, subtraction, bitwise operations. These map directly to Z80 instructions like ADD A,B and AND B.
  • 16-bit arithmetic: addition and subtraction use the Z80's 16-bit register pairs (HL, DE, BC) efficiently.
  • Memory operations: loads and stores generate clean LD (HL),A and LD A,(HL) sequences.
  • Function calls: the calling convention uses registers efficiently, avoiding unnecessary stack operations for simple cases.
  • Simple control flow: conditional branches and unconditional jumps work as expected.

Works but generates bulky code:

  • 32-bit arithmetic: every 32-bit operation expands into multiple 16-bit operations with careful carry flag handling. A 32-bit addition becomes a sequence that would make a Z80 programmer wince.
  • Multiplication: even 8-bit multiplication requires a library call to __mulhi3 since the Z80 lacks a multiply instruction.

Breaks the register allocator:

  • Loops with phi nodes: in LLVM IR, loops use phi nodes to represent values that differ depending on which path entered the loop. Complex phi nodes exhaust the Z80's seven registers, causing "ran out of registers" errors.
  • Functions with many live variables: if you need more than a handful of values alive simultaneously, the backend can't handle it.

Not supported:

  • Floating-point operations: no legalization rules exist for converting the Z80's lack of FPU into software equivalents.
  • Complex core library features: iterators, formatters, and most of the standard library infrastructure trigger unsupported operations.

The Calling Convention

Through testing, we've empirically determined how our Z80 backend passes arguments and returns values:

Type First Argument Second Argument Return Value
u8 / i8 A register L register A register
u16 / i16 HL register pair DE register pair HL register pair

Additional arguments go on the stack. The stack frame uses the IX register as a frame pointer when needed. This convention minimizes register shuffling for common cases — a function taking two 16-bit arguments and returning one uses HL and DE for input and HL for output, requiring no setup at all.

This differs from traditional Z80 calling conventions used by C compilers, which typically pass all arguments on the stack. Our approach is more register-heavy, which suits the short functions typical of embedded code.

Practical Implications

Let me be clear about what we've achieved and what remains out of reach.

What you can realistically build:

  • Simple embedded routines: LED patterns, sensor reading, basic I/O handling
  • Mathematical functions: integer arithmetic, lookup tables, state machines
  • Protocol handlers: parsing simple data formats, generating responses
  • Anything that would fit in a few kilobytes of hand-written assembly

What you cannot build:

  • Anything requiring heap allocation: no Vec, no String, no dynamic data structures
  • Code using iterators or closures: these generate complex LLVM IR that overwhelms the register allocator
  • Formatted output: Rust's write! macro and formatting infrastructure are far too heavy
  • Floating-point calculations: not without significant backend work

The path to making this more capable is visible but non-trivial. A custom minimal core implementation that avoids floating-point entirely would help. Improving the register allocator's handling of phi nodes would enable loops. Adding software floating-point legalization would unlock numerical code. Each of these is a substantial project.

Reflections

Building a compiler backend for a 50-year-old processor using a 21st-century language toolchain is an exercise in contrasts. Modern software assumes abundant resources. The Z80 was designed when resources were precious. Making them meet requires translation across decades of computing evolution.

The fact that we needed 252GB of RAM to compile code for a processor with a 64KB address space is almost poetic. It captures something essential about how far computing has come and how much we've traded simplicity for capability.

But here's what satisfies me: the generated Z80 code is good. It's not bloated or obviously inefficient. When we compile a simple function, we get a simple result. The LLVM optimization passes do their job, and our backend translates the result into idiomatic Z80 assembly. The 'l' character optimization in our Hello World example isn't something I would have thought to do by hand, but the compiler found it automatically.

Rust on Z80 isn't practical for production use. The core library is too heavy, the workarounds are too fragile, and the resulting code size would exceed most Z80 systems' capacity. But as a demonstration that modern toolchains can target ancient hardware? As an exploration of what compilers actually do? As an answer to "I wonder if this is possible?"

Yes. It's possible. And the journey to get here taught me more about LLVM, register allocation, and instruction selection than any tutorial ever could.

What's Next

With emulation working, the obvious next step is running this code on actual hardware. My RetroShield Z80 sits waiting on my workbench, ready to execute whatever binary we load into it. The emulator uses the same ACIA interface as the physical hardware, so the transition should be straightforward — load the binary, connect a terminal, and watch "Hello, Z80 World!" appear on genuine 8-bit silicon.

Beyond hardware validation, the Z80 backend needs work on loop handling. Phi nodes are the enemy. There may be ways to lower them earlier in the pipeline, before they reach the register-hungry instruction selector. That's a project for another day, another blog post, and probably another round of pair programming with Claude.

The projects are available on GitHub for anyone curious enough to try them:

Be warned: you'll need more RAM than seems reasonable. But if you've read this far, you probably already suspected that.

Resources

If you want to dive deeper into any of the topics covered here, these resources might help:

Books:

  • Programming the Z80 by Rodnay Zaks — The definitive Z80 reference, covering every instruction and addressing mode in detail
  • The Rust Programming Language by Klabnik and Nichols — The official Rust book, essential for understanding no_std embedded development
  • Engineering a Compiler by Cooper and Torczon — Comprehensive compiler textbook covering instruction selection, register allocation, and code generation
  • Crafting Interpreters by Robert Nystrom — Excellent practical guide to building language implementations

Hardware:

Running CP/M 2.2 on the RetroShield Z80 Emulator

There's something magical about watching a 45-year-old operating system boot on modern hardware. CP/M 2.2, the operating system that launched a thousand microcomputers and paved the way for MS-DOS, still has lessons to teach us about elegant system design.

This post documents my journey getting CP/M 2.2 running on the RetroShield Z80 emulator, a Rust-based Z80 emulator I've been developing. The result is a fully functional CP/M system that can run classic software like Zork and WordStar.

What is CP/M?

CP/M (Control Program for Microcomputers) was created by Gary Kildall at Digital Research in 1974. It became the dominant operating system for 8-bit microcomputers in the late 1970s and early 1980s, running on machines like the Altair 8800, IMSAI 8080, Osborne 1, and Kaypro.

CP/M's genius was its portability. The system separated into three layers:

  • CCP (Console Command Processor) - The command line interface
  • BDOS (Basic Disk Operating System) - File and I/O services
  • BIOS (Basic Input/Output System) - Hardware abstraction

Only the BIOS needed to be rewritten for each machine. This architecture directly influenced MS-DOS and, by extension, every PC operating system that followed.

The RetroShield Z80 Emulator

The RetroShield is a hardware shield that lets you run vintage CPUs on modern microcontrollers. My emulator takes this concept further by providing a complete software simulation of the Z80 and its peripherals.

The emulator includes:

  • Full Z80 CPU emulation (via the rz80 crate)
  • MC6850 ACIA serial port (console I/O)
  • SD card emulation with DMA block transfers
  • TUI debugger with memory viewer, disassembly, and single-stepping

The Challenge: Disk I/O

Getting CP/M's console I/O working was straightforward. The real challenge was disk I/O. CP/M expects to read and write 128-byte sectors from floppy disks. I needed to emulate this using files on the host system.

The standard 8" single-sided, single-density floppy format that CP/M uses:

  • 77 tracks
  • 26 sectors per track
  • 128 bytes per sector
  • 256KB total capacity

DMA Block Transfers

Rather than transferring bytes one at a time through I/O ports (which would be painfully slow), I implemented DMA block transfers. The BIOS sets up a DMA address and issues a single command to transfer an entire 128-byte sector:

; Set DMA address
ld      hl, (DMAADR)
ld      a, l
out     (SD_DMA_LO), a
ld      a, h
out     (SD_DMA_HI), a

; Issue block read
xor     a
out     (SD_BLOCK), a

; Check status
in      a, (SD_BLOCK)
ret                     ; A = 0 if OK

On the emulator side, this triggers a direct memory copy from the disk image file into emulated RAM.

The Bug That Almost Defeated Me

After implementing everything, CP/M would boot and print its banner, but then hang or show garbage. The debug output revealed the BDOS was requesting insane track numbers like 0x0083 instead of track 2.

The culprit? A classic use-after-move bug in the BIOS:

SELDSK:
    ; Calculate DPH address in HL
    ld      l, c
    ld      h, 0
    add     hl, hl          ; *16
    add     hl, hl
    add     hl, hl
    add     hl, hl
    ld      de, DPH0
    add     hl, de

    call    OPENDISK        ; BUG: This overwrites HL!
    ret                     ; Returns garbage instead of DPH

The OPENDISK subroutine was using HL internally, destroying the Disk Parameter Header address that SELDSK was supposed to return. The BDOS would then read garbage from the wrong memory location for its disk parameters.

The fix was simple:

    push    hl
    call    OPENDISK
    pop     hl              ; Restore DPH address
    ret

24-bit Seek Positions

Another issue: the disk images are 256KB, but I initially only supported 16-bit seek positions (64KB max). I added an extended seek port for the high byte:

pub const SD_SEEK_LO: u8 = 0x14;    // Bits 0-7
pub const SD_SEEK_HI: u8 = 0x15;    // Bits 8-15
pub const SD_SEEK_EX: u8 = 0x19;    // Bits 16-23

The Memory Map

CP/M's memory layout for a 56KB TPA (Transient Program Area):

0000-00FF   Page Zero (jump vectors, FCB, command buffer)
0100-DFFF   TPA - User programs load here (56KB)
E000-E7FF   CCP - Console Command Processor
E800-F5FF   BDOS - Basic Disk Operating System
F600-FFFF   BIOS - Hardware abstraction layer

The BIOS is only about 1KB of Z80 assembly, handling:

  • Console I/O via the MC6850 ACIA
  • Disk I/O via SD card emulation
  • Drive selection and track/sector positioning

Running Classic Software

With CP/M booting successfully, I could run classic software:

Zork I - Infocom's legendary text adventure runs perfectly:

Zork I running on CP/M in the RetroShield TUI emulator

WordStar 3.3 and SuperCalc also run, though they need terminal escape codes configured properly (the Kaypro version uses ADM-3A codes).

Try It Yourself

The code is available on GitHub:

To run:

cd emulator/rust
cargo build --release
./target/release/retroshield_tui -s storage path/to/boot.bin

Press F5 to run, then type zork1 at the A> prompt.

Lessons Learned

Building this system reinforced some timeless principles:

  1. Abstraction layers work. CP/M's BIOS/BDOS/CCP split made porting trivial. Only 1KB of code needed to be written for a completely new "hardware" platform.

  2. Debug output is essential. Adding hex dumps of track/sector values immediately revealed the SELDSK bug.

  3. Read the documentation. The CP/M 2.2 System Alteration Guide is remarkably well-written and explained exactly what the BIOS functions needed to do.

  4. Old code still runs. With the right emulation layer, 45-year-old binaries execute flawlessly. The Z80 instruction set is eternal.

There's a certain satisfaction in seeing that A> prompt appear. It's the same prompt that greeted users in 1977, now running on code I wrote in 2025. The machines change, but the software endures.

Sampo: Designing a 16-bit RISC CPU from Scratch - Part 1: Theory and Architecture

In Finnish mythology, the Sampo is a magical artifact from the epic poem Kalevala, compiled by Elias Lönnrot in 1835. According to legend, the Sampo was forged by Ilmarinen, a legendary blacksmith and sky god, from a swan's feather, a grain of barley, a ball of wool, a drop of milk, and a shaft of a distaff. The resulting creation took the form of a magical mill that could produce flour, salt, and gold endlessly—bringing riches and good fortune to its holder.

The exact nature of the Sampo has been debated by scholars since 1818, with over 30 theories proposed—ranging from a world pillar to an astrolabe to a decorated shield. This mystery makes it a fitting namesake for a CPU architecture: something that transforms simple inputs into useful outputs, whose inner workings invite exploration and understanding.

This is the first part of a two-part series exploring the Sampo CPU architecture. In this article, we'll dive deep into the theory, design philosophy, and architectural decisions that shaped Sampo. In Part 2, we'll get our hands dirty with an actual FPGA implementation using Amaranth HDL, bringing this processor to life on real silicon.

The Problem Space: Why Another CPU?

Before diving into Sampo's architecture, it's worth asking: why design a new CPU at all? The retrocomputing community has no shortage of classic processors to explore—the Z80, 6502, 68000—and modern RISC architectures like RISC-V offer clean, well-documented designs for educational purposes.

The answer lies in a specific niche that existing architectures don't quite fill. Consider the typical workloads of classic 8-bit systems: interpreters for languages like BASIC and Forth, operating systems like CP/M, text editors, and simple games. These workloads have distinct characteristics:

  1. Heavy use of memory operations: Block copies, string manipulation, memory fills
  2. Port-based I/O: Serial terminals, disk controllers, sound chips accessed via dedicated I/O instructions
  3. Context switching: Interrupt handlers that need to save and restore register state quickly
  4. BCD arithmetic: Calculator applications, financial software

The Z80 excels at these tasks through specialized instructions (LDIR, LDDR, IN, OUT) and its alternate register set. But the Z80 is an 8-bit CISC processor with irregular encoding, complex addressing modes, and over 300 instruction variants. This makes it challenging to implement efficiently in modern hardware or to target with optimizing compilers.

Modern RISC architectures like RISC-V take the opposite approach: clean, orthogonal instruction sets optimized for pipelining and compiler code generation. But they typically use memory-mapped I/O (no dedicated I/O instructions), lack block operations, and provide no alternate register sets for fast context switching.

Sampo occupies the middle ground—a "Z80 programmer's RISC" that combines the regularity and simplicity of RISC design with the specialized capabilities that made the Z80 so effective for its target workloads.

Design Goals

Sampo was designed with five primary goals:

  1. RISC-inspired instruction set: Clean, orthogonal design with predictable encoding
  2. 16-bit native word size: Registers, ALU, and memory addressing all 16-bit
  3. Efficient for interpreters and compilers: Stack operations, indirect addressing, hardware multiply/divide
  4. Simple to implement: Suitable for FPGA synthesis or software emulation
  5. Z80-workload compatible: Port-based I/O, BCD support, block operations, alternate registers

These goals create natural tensions. RISC purity would eliminate block operations and port-based I/O. Maximum Z80 compatibility would preserve its irregular encoding. Sampo resolves these tensions by borrowing selectively from multiple architectural traditions.

Architectural Lineage

Sampo's design draws from four distinct sources, each contributing specific elements:

From RISC-V

RISC-V's influence is most visible in Sampo's register conventions:

  • Zero register (R0): A register that always reads as zero and ignores writes. This eliminates the need for separate "clear" or "load zero" instructions—ADD R4, R0, R0 clears R4, ADD R4, R5, R0 copies R5 to R4.
  • Register naming conventions: Return address (RA), stack pointer (SP), global pointer (GP), argument registers (A0-A3), temporaries (T0-T3), and saved registers (S0-S3).
  • Load/store architecture: Only load and store instructions access memory; all computation occurs between registers.

From MIPS

MIPS contributed Sampo's approach to instruction encoding:

  • Simple, orthogonal formats: A small number of instruction formats (R, I, S, B, J) with consistent field positions
  • 4-bit primary opcode: Sixteen instruction categories, each with function codes for variants
  • PC-relative branching: Branch targets specified as signed offsets from the program counter

From ARM Thumb/Thumb-2

ARM's Thumb instruction set inspired Sampo's hybrid encoding strategy:

  • 16-bit base instruction width: Most common operations fit in 16 bits for improved code density
  • 32-bit extended forms: Operations requiring larger immediates use a two-word format
  • Prefix-based extension: The 0xF opcode prefix indicates a 32-bit instruction, simplifying decode

From the Z80

The Z80 provides Sampo's "personality"—the features that make it feel familiar to retrocomputing enthusiasts:

  • Port-based I/O: IN and OUT instructions with 8-bit port addresses, separate from the memory address space
  • Alternate register set: The EXX instruction swaps working registers with shadow copies for fast interrupt handling
  • Block operations: LDIR, LDDR, FILL, and CPIR for efficient memory manipulation
  • BCD support: The DAA (Decimal Adjust Accumulator) instruction for binary-coded decimal arithmetic
  • 64KB address space: 16-bit addresses, matching the Z80's memory model

The Register File

Sampo provides 16 general-purpose 16-bit registers, organized with RISC-V-style conventions:

Register Alias Convention
R0 ZERO Always reads as 0, writes ignored
R1 RA Return address (saved by caller)
R2 SP Stack pointer
R3 GP Global pointer (optional)
R4-R7 A0-A3 Arguments / Return values
R8-R11 T0-T3 Temporaries (caller-saved)
R12-R15 S0-S3 Saved registers (callee-saved)

The zero register deserves special attention. Having a register that always contains zero eliminates entire classes of instructions found in other architectures:

  • MOV Rd, Rs becomes ADD Rd, Rs, R0
  • CLR Rd becomes ADD Rd, R0, R0
  • NEG Rd, Rs can use R0 as the implicit minuend
  • CMP Rs, #0 becomes SUB R0, Rs, R0 (result discarded, flags set)

This technique, pioneered by MIPS and refined by RISC-V, dramatically simplifies the instruction set while maintaining expressiveness.

Alternate Registers

Unlike the Z80, which swaps all main registers with EXX, Sampo is selective. Only registers R4-R11 (the arguments and temporaries) have shadow copies. The critical system registers—R0 (zero), R1 (return address), R2 (stack pointer), R3 (global pointer), and R12-R15 (saved registers)—are never swapped.

This design decision serves interrupt handling. When an interrupt occurs, the handler can execute EXX to gain a fresh set of working registers without corrupting the interrupted code's arguments or temporaries. The stack pointer remains valid (no need to establish a new stack), and the return address register can be used to save the interrupted PC.

irq_handler:
    EXX                     ; Swap to alternate R4-R11
    ; ... handle interrupt using R4'-R11' ...
    ; Primary registers preserved automatically
    EXX                     ; Swap back
    RETI                    ; Return from interrupt

The Flags Register

Sampo uses an 8-bit flags register with six defined flags:

Bit Flag Name Description
7 N Negative Sign bit of result (bit 15)
6 Z Zero Result is zero
5 C Carry Unsigned overflow / borrow
4 V Overflow Signed overflow
3 H Half-carry Carry from bit 3 to 4 (for BCD)
2 I Interrupt Interrupt enable

The N, Z, C, and V flags follow standard conventions and support the full range of conditional branches. The H (half-carry) flag exists specifically for the DAA instruction, enabling correct BCD arithmetic. The I flag controls interrupt recognition.

Notably, Sampo provides explicit GETF and SETF instructions to read and write the flags register, unlike many RISC architectures that treat flags as implicit state. This supports context switching and debugging.

Memory Model

Sampo uses a straightforward memory model:

  • Address space: 64KB (16-bit addresses)
  • Byte-addressable: Individual bytes can be loaded and stored
  • Little-endian: Multi-byte values stored with LSB at lower address
  • Word alignment: 16-bit words should be aligned on even addresses (optional enforcement)

A suggested memory map divides the 64KB space:

0x0000-0x00FF   Interrupt vectors / Reset
0x0100-0x7FFF   Program ROM (~32KB)
0x8000-0xFEFF   RAM (~32KB)
0xFF00-0xFFFF   Memory-mapped I/O (256 bytes)

This layout provides a clean separation between code, data, and I/O while leaving room for customization. The interrupt vector area at the bottom of memory follows Z80 conventions, with the reset vector at 0x0000 and interrupt vector at 0x0004.

Port-Based I/O

In addition to memory, Sampo provides a separate 256-port I/O address space accessed via IN and OUT instructions. This design directly mirrors the Z80 and enables straightforward porting of code that interacts with serial ports, disk controllers, sound chips, and other peripherals.

The I/O instructions come in two forms:

INI  R4, 0x80       ; Read from port 0x80 (immediate port number)
IN   R4, (R5)       ; Read from port specified in R5
OUTI 0x81, R4       ; Write R4 to port 0x81 (immediate)
OUT  (R5), R4       ; Write R4 to port specified in R5

Extended 32-bit forms (INX, OUTX) allow the full 8-bit port range to be specified in immediate form.

Instruction Encoding

Sampo uses a clean, regular encoding scheme with 16-bit base instructions and 32-bit extended forms. The 4-bit primary opcode in bits 15:12 determines the instruction category:

Opcode Category Description
0x0 ADD Register addition
0x1 SUB Register subtraction
0x2 AND Bitwise AND
0x3 OR Bitwise OR
0x4 XOR Bitwise XOR
0x5 ADDI Add immediate
0x6 LOAD Load from memory
0x7 STORE Store to memory
0x8 BRANCH Conditional branch
0x9 JUMP Unconditional jump/call
0xA SHIFT Shift and rotate
0xB MULDIV Multiply/divide/BCD
0xC MISC Stack, block ops, compare
0xD I/O Port input/output
0xE SYSTEM NOP, HALT, interrupts
0xF EXTENDED 32-bit instructions

Instruction Formats

Six formats cover all instruction types:

Format R (Register-Register):

15       12 11     8 7      4 3      0
+----------+--------+--------+--------+
|  opcode  |   Rd   |  Rs1   |  Rs2   |
+----------+--------+--------+--------+

Used for three-register operations like ADD R4, R5, R6.

Format I (Immediate):

15       12 11     8 7                0
+----------+--------+------------------+
|  opcode  |   Rd   |      imm8        |
+----------+--------+------------------+

Used for operations with 8-bit immediates like ADDI R4, 42.

Format S (Store):

15       12 11     8 7      4 3      0
+----------+--------+--------+--------+
|  opcode  |  imm4  |  Rs1   |  Rs2   |
+----------+--------+--------+--------+

Used for stores where the destination register field holds an offset.

Format B (Branch):

15       12 11     8 7                0
+----------+--------+------------------+
|  opcode  |  cond  |     offset8      |
+----------+--------+------------------+

Used for conditional branches with PC-relative offsets.

Format J (Jump):

15       12 11                       0
+----------+--------------------------+
|  opcode  |        offset12          |
+----------+--------------------------+

Used for unconditional jumps with 12-bit PC-relative offsets.

Format X (Extended):

Word 0:
15       12 11     8 7      4 3      0
+----------+--------+--------+--------+
|   0xF    |   Rd   |  Rs1   |  sub   |
+----------+--------+--------+--------+

Word 1:
15                                   0
+-------------------------------------+
|              imm16                  |
+-------------------------------------+

Used for operations requiring 16-bit immediates or absolute addresses.

Encoding Examples

To illustrate the encoding scheme, let's examine several instructions:

ADD R4, R5, R6 (R4 = R5 + R6):

Opcode = 0x0, Rd = 4, Rs1 = 5, Rs2 = 6
Binary: 0000 0100 0101 0110 = 0x0456

ADDI R4, 10 (R4 = R4 + 10):

Opcode = 0x5, Rd = 4, imm8 = 10
Binary: 0101 0100 0000 1010 = 0x540A

BEQ +8 (branch forward 8 bytes if equal):

Opcode = 0x8, cond = 0 (BEQ), offset = 4 words
Binary: 1000 0000 0000 0100 = 0x8004

LIX R4, 0x1234 (load 16-bit immediate):

Word 0: 0xF (extended), Rd = 4, Rs = 0, sub = 7 (LIX)
Word 1: 0x1234
Binary: 1111 0100 0000 0111 0001 0010 0011 0100 = 0xF407 0x1234

The regularity of this encoding makes instruction decode straightforward—the first nibble determines the instruction category, and subsequent fields are in consistent positions across formats.

The Instruction Set

Sampo provides approximately 66 distinct instructions, organized into ten categories.

Arithmetic (15 instructions)

The arithmetic category includes standard operations (ADD, SUB, ADDI) plus multiply/divide support:

  • MUL: 16×16 multiplication, low 16 bits of result
  • MULH/MULHU: High 16 bits of 32-bit product (signed/unsigned)
  • DIV/DIVU: Integer division (signed/unsigned)
  • REM/REMU: Remainder (signed/unsigned)
  • DAA: Decimal adjust for BCD arithmetic
  • NEG: Two's complement negation
  • CMP: Compare (subtract without storing result)

Hardware multiply and divide are essential for interpreter performance—dividing a 32-bit value by 10 for number formatting would be prohibitively slow without hardware support.

Logic (6 instructions)

Standard bitwise operations: AND, OR, XOR, NOT, plus immediate forms ANDI and ORI.

Shift and Rotate (16 variants)

Sampo provides an unusually rich set of shift operations:

  • SLL/SRL/SRA: Shift left/right logical/arithmetic
  • ROL/ROR: Rotate left/right
  • RCL/RCR: Rotate through carry (17-bit rotation)
  • SWAP: Swap high and low bytes

Each shift type comes in three shift amounts: 1, 4, and 8 bits. The 4-bit shift is particularly useful for hexadecimal digit extraction and insertion. Variable shifts use the extended format with the shift amount in the second register or immediate field.

Load/Store (6 instructions)

Memory access instructions include word and byte loads (with sign or zero extension), word and byte stores, and LUI (Load Upper Immediate) for constructing 16-bit constants:

LUI  R4, 0x12       ; R4 = 0x1200
ORI  R4, R4, 0x34   ; R4 = 0x1234

Branch (16 conditions)

Sampo supports a comprehensive set of branch conditions:

  • BEQ/BNE: Equal/not equal
  • BLT/BGE/BGT/BLE: Signed comparisons
  • BLTU/BGEU/BHI/BLS: Unsigned comparisons
  • BMI/BPL: Negative/positive
  • BVS/BVC: Overflow set/clear
  • BCS/BCC: Carry set/clear

This covers all reasonable comparison outcomes for both signed and unsigned arithmetic.

Jump/Call (4 instructions)

  • J: PC-relative unconditional jump
  • JAL: Jump and link (save return address in RA)
  • JR: Jump to address in register
  • JALR: Jump and link to register address

Block Operations (6 instructions)

The block operations use a fixed register convention (R4=count, R5=source, R6=destination):

  • LDI/LDD: Load single byte, increment/decrement pointers and count
  • LDIR/LDDR: Repeat until count reaches zero
  • FILL: Fill memory region with value
  • CPIR: Compare and search forward

These instructions are decidedly un-RISC—they're multi-cycle operations that modify multiple registers. But they're implemented with predictable behavior (always the same registers, always the same algorithm) and provide enormous speedups for common memory operations.

Stack (4 instructions)

  • PUSH/POP: Single register push/pop
  • PUSHM/POPM: Push/pop multiple registers (via bitmask)

I/O (4 instructions)

  • INI/OUTI: Immediate port address
  • IN/OUT: Register port address

System (9 instructions)

  • NOP: No operation
  • HALT: Stop processor
  • DI/EI: Disable/enable interrupts
  • EXX: Exchange alternate registers
  • RETI: Return from interrupt
  • SWI: Software interrupt
  • SCF/CCF: Set/complement carry flag
  • GETF/SETF: Read/write flags register

Comparison with Other Architectures

To put Sampo in context, consider how it compares with related processors:

Aspect Z80 MIPS RISC-V Sampo
Word size 8-bit 32-bit 32/64-bit 16-bit
Instruction width 1-4 bytes 4 bytes 2/4 bytes 2/4 bytes
Registers 8 + alternates 32 32 16 + alternates
Zero register No $zero x0 R0
I/O model Port-based Memory-mapped Memory-mapped Port-based
Block operations Yes No No Yes
Instruction count ~300+ ~60 ~50 base ~66

Sampo sits in an interesting position: more regular than the Z80 but with Z80-friendly features, smaller and simpler than 32-bit RISC but still cleanly orthogonal.

Code Examples

To demonstrate how Sampo assembly looks in practice, here's a "Hello World" program that outputs text via a serial port:

        .org 0x0100

.equ    ACIA_STATUS 0x80
.equ    ACIA_DATA   0x81
.equ    TX_READY    0x02

start:
        LIX  R4, message        ; Load address of string

loop:
        LBU  R5, (R4)           ; Load byte from string
        CMP  R5, R0             ; Compare with zero
        BEQ  done               ; If null terminator, done

wait_tx:
        INI  R6, ACIA_STATUS    ; Read serial status port
        ANDI R6, R6, TX_READY   ; Check transmit ready bit
        BEQ  wait_tx            ; Wait if not ready

        OUTI ACIA_DATA, R5      ; Write character to data port
        ADDI R4, 1              ; Next character
        J    loop
done:
        HALT

message:
        .asciz "Hello, Sampo!\n"

And here's a Fibonacci function demonstrating the calling convention:

; fib(n) - compute nth Fibonacci number
; Input: R4 (A0) = n
; Output: R4 (A0) = fib(n)

fib:
        ADDI R5, R0, 0      ; a = 0
        ADDI R6, R0, 1      ; b = 1
        CMP  R4, R0
        BEQ  fib_done

fib_loop:
        ADD  R7, R5, R6     ; temp = a + b
        ADD  R5, R6, R0     ; a = b
        ADD  R6, R7, R0     ; b = temp
        ADDI R4, R4, -1     ; n--
        BNE  fib_loop

fib_done:
        ADD  R4, R5, R0     ; return a
        JR   RA

The code reads naturally to anyone familiar with RISC assembly, while the I/O instructions and register conventions provide the Z80-like feel that makes porting classic software straightforward.

Looking Ahead: FPGA Implementation

With the architecture defined, the next step is implementation. In Part 2 of this series, we'll build a working Sampo processor using Amaranth HDL, a modern Python-based hardware description language. We'll cover:

  • The ALU module: Implementing all arithmetic and logic operations
  • The register file: Including the alternate register set and zero register
  • The instruction decoder: Parsing the various instruction formats
  • The control unit: Managing the fetch-decode-execute cycle
  • The memory interface: Connecting to block RAM
  • The I/O subsystem: Implementing the port-based I/O model
  • Integration: Putting it all together into a working system-on-chip

We'll synthesize the design for an affordable FPGA board and run actual Sampo programs, demonstrating that this architecture isn't just a paper exercise but a real, working processor.

The Sampo project on GitHub includes a complete Rust-based assembler (sasm) and emulator (semu) with a TUI debugger, so you can start writing and testing Sampo programs today. The FPGA implementation will let you run those same programs on real hardware, completing the journey from mythological artifact to silicon reality.

Stay tuned for Part 2, where we'll forge our own Sampo—not from swan feathers and barley, but from lookup tables and flip-flops.

What VisiCalc Teaches Us About AI: The 45-Year Pattern of "This Time It's Different"

I was born in the back half of 1980, which means I missed the revolution.

By the time I sat down at an Apple IIe in 1986, with its green phosphor glow and chunky 5.25-inch floppies, the war was already over. The Altair 8800 was a museum piece. CP/M was fading into obscurity. The TRS-80 and Commodore PET were yesterday's news. I arrived just in time for the Apple II's twilight years, blissfully unaware that the machine in front of me represented the victory lap of a decade-long transformation.

I never experienced CP/M. I never loaded WordStar from an 8-inch floppy. I never watched VisiCalc recalculate a spreadsheet and felt the shock of a machine doing in seconds what had taken hours by hand. These were foundational moments in computing history, and I missed them entirely.

Now, decades later, I find myself building Z80 emulators and writing compilers for a processor that had already ceded the PC spotlight by the time I could read — though it quietly lived on in the TI graphing calculators that would later get me through high school math, and still powers them today. It's a form of technological archaeology — reconstructing a world I never lived in, trying to understand the texture of an era I only know through documentation and nostalgia. And from this vantage point, watching the current panic over artificial intelligence, I can't help but notice something: we've been here before.

As 2025 quickly fades. ChatGPT writes code. Midjourney creates art. Claude analyzes documents. The headlines scream that knowledge workers are doomed, that white-collar jobs will evaporate, that "this time it's different."

But it's not different. It's never different. And VisiCalc can prove it.

The World Before the Spreadsheet

To understand why VisiCalc mattered, you need to understand what "spreadsheet" meant in 1978. It wasn't software. It was paper.

Accountants, analysts, and financial planners worked with literal sheets of paper, ruled into rows and columns. They called them spreadsheets because the worksheets spread across multiple pages, sometimes taped together into unwieldy grids that covered entire desks. Every number was written by hand. Every calculation was performed with a mechanical adding machine or, if you were modern, an electronic calculator.

Here's what financial planning looked like: You'd spend hours, maybe days, building a projection. Revenue assumptions in one column, cost structures in another, profit margins calculated cell by cell. Then your boss would ask a simple question: "What if we increase prices by 5%?"

And you'd start over.

Not from the pricing cell — from every cell that pricing touched. The cascade of recalculations could take hours. A complex model might require a full day to revise. And if you made an error somewhere in the middle? Good luck finding it in a forest of pencil marks and eraser smudges.

Word processing was no better. Before WordStar and its competitors, documents were produced on typewriters. The IBM Selectric was the gold standard — a marvel of engineering that let you swap font balls and correct single characters with lift-off tape. But if you found a typo on page 47 of a 60-page contract, you had options: live with it, or retype pages 47 through 60.

Typing was a specialized profession. Companies maintained typing pools — rooms full of secretaries whose primary job was converting handwritten drafts and dictation into finished documents. A skilled typist was a valuable employee precisely because the work was so labor-intensive.

And if you needed computing power for serious analysis, you went to the mainframe. You submitted your job to the MIS department, waited in a queue, and paid by the CPU-minute. Time-sharing systems charged hundreds of dollars per hour. Computing was a scarce resource, rationed by bureaucracy.

This was knowledge work in the mid-1970s: manual, slow, expensive, and error-prone.

The Revolution No One Expected

Dan Bricklin was a Harvard MBA student in 1978 when he had the insight that would change everything. Sitting in a classroom, he watched a professor work through a financial model on a blackboard. The professor would write numbers, perform calculations, and fill in cells. Then he'd change an assumption, and the recalculation cascade would begin — erasing, recomputing, rewriting, sometimes running out of blackboard space.

Bricklin's thought was simple: what if the blackboard could recalculate itself?

Working with programmer Bob Frankston, Bricklin built VisiCalc — the "visible calculator." It ran on the Apple II, which was itself a hobbyist curiosity, a machine that enthusiasts bought to tinker with BASIC programs and play primitive games. VisiCalc transformed it into a business tool.

The software shipped in 1979, priced at \$100. Within a year, it was selling 12,000 copies per month. More importantly, it was selling Apple IIs. The \$2,000 computer became justifiable as a business expense because VisiCalc made it productive.

Consider the economics. A financial analyst in 1980 earned perhaps \$25,000 per year. A secretary earned \$12,000 to \$15,000. The Apple II plus VisiCalc cost roughly \$2,500. If the software saved a few weeks of analyst time, or let one analyst do the work that had previously required two, it paid for itself almost immediately.

But the real magic wasn't cost savings — it was capability. Suddenly you could ask "what if?" as many times as you wanted. Change an assumption, watch the spreadsheet ripple with recalculations, and see the answer in seconds. Financial modeling went from a laborious exercise in arithmetic to an exploratory conversation with your data.

WordStar, released a year earlier in 1978, performed the same transformation for documents. Write, edit, revise, move paragraphs, fix typos — all before committing anything to paper. The document existed as a malleable thing, not a fixed artifact produced through irreversible mechanical action.

Together, these applications (and others like dBASE for databases and SuperCalc as a VisiCalc competitor) created the productivity software category. They didn't sell computers to hobbyists; they sold computers to businesses. And they did it by solving mundane problems: arithmetic and typing.

The pundits of the era made predictions. Accountants would become obsolete. Secretaries would be eliminated. The typing pool would vanish. Knowledge work itself was being automated.

What Actually Happened

The predictions were wrong. Or rather, they were right about the transformation but wrong about the outcome.

Typing pools did shrink. The specialized profession of "typist" largely disappeared as word processing became a universal skill. But administrative assistants didn't vanish — their job changed. Instead of spending hours producing documents, they spent hours managing calendars, coordinating logistics, and handling communication. The mechanical work evaporated; the judgment work remained.

Bookkeepers declined as a profession. The person whose job was to maintain ledgers and perform routine calculations found that job automated. But accountants — the people who interpreted the numbers, made recommendations, and exercised judgment — grew in number. The Bureau of Labor Statistics shows steady growth in accounting employment through the 1980s and 1990s, even as the basic arithmetic of accounting was completely automated.

Financial analysts became more valuable, not less. The spreadsheet didn't replace them; it amplified them. An analyst who could build sophisticated models in VisiCalc or Lotus 1-2-3 was worth more than one limited to paper. The ceiling rose.

And here's the crucial point: the total amount of analysis, documentation, and financial modeling exploded. When something becomes cheaper and faster to produce, you produce more of it. Companies that had operated with crude annual budgets started building detailed monthly projections. Reports that had been quarterly became weekly. The volume of knowledge work grew to fill the new capacity.

This pattern — automation making workers more productive, which increases demand for the work, which maintains or increases employment — has a name in economics. It's called the Jevons paradox, originally observed in coal consumption: as steam engines became more efficient, total coal usage increased rather than decreased, because efficiency made steam power economical for more applications.

The same paradox applies to labor. Make an accountant 10x more productive, and you don't need 1/10th as many accountants. You do 10x as much accounting.

The Pattern Repeats

VisiCalc wasn't the first technology to trigger predictions of labor displacement, and it certainly wasn't the last. The pattern repeats with remarkable consistency:

ATMs (1970s-present): Automated Teller Machines were supposed to eliminate bank tellers. The math seemed obvious — why pay a human to dispense cash when a machine could do it? Yet U.S. bank teller employment roughly doubled between 1970 and 2010. The explanation: ATMs made bank branches cheaper to operate, so banks opened more branches, each requiring fewer but still some tellers. And the tellers' jobs shifted from cash handling to sales, complex transactions, and customer relationships.

CAD Software (1980s): Computer-aided design was going to eliminate draftsmen. Instead, it eliminated hand drafting while increasing demand for designers. The ability to iterate quickly, produce more alternatives, and handle more complex designs meant more design work overall.

Desktop Publishing (1980s): PageMaker and QuarkXPress would kill graphic designers by letting anyone create professional documents. Instead, the volume of designed materials exploded, and graphic design became a larger profession. The average quality rose because the floor rose.

Legal Research Databases (1990s): LexisNexis and Westlaw would eliminate paralegals by automating case research. Instead, faster research enabled more litigation, more thorough preparation, and more legal work overall.

Electronic Trading (1990s-2000s): Algorithmic trading would eliminate floor traders and financial professionals. It did eliminate floor traders, but the financial sector's employment grew as new roles emerged: quants, algorithm developers, risk managers, compliance officers.

In every case, the predictions followed the same logic: Technology X automates task Y, therefore workers who do Y are obsolete. And in every case, the predictions missed the second-order effects: automation makes the overall activity more valuable, demand increases, and workers shift to higher-judgment versions of the same work.

The AI Moment

Which brings us to now.

ChatGPT was released in November 2022. Within two months, it had 100 million users. Within a year, AI assistants were embedded in products from Microsoft to Google to Adobe. Large language models could write essays, generate code, summarize documents, answer questions, and produce content that was — on first glance — indistinguishable from human output.

The predictions arrived immediately. Programmers would become obsolete. Writers were doomed. Customer service, legal research, financial analysis, medical diagnosis — all would be automated. Goldman Sachs estimated 300 million jobs would be affected. The World Economic Forum issued reports. Thought leaders proclaimed that "this time it's different."

But is it?

Let's apply the VisiCalc framework. What exactly does AI automate?

First drafts, not final judgment. AI can produce a draft document, a code snippet, an analysis outline. What it cannot do is determine whether that draft serves the actual goal, handles the edge cases that matter, or fits the political context of the organization. The human reviews, revises, and takes responsibility.

Pattern matching, not pattern breaking. Large language models are, at their core, sophisticated pattern matchers trained on existing text. They excel at producing outputs that look like their training data. They struggle with genuine novelty — situations unlike anything in the training corpus, problems that require inventing new approaches rather than recombining old ones.

The middle of the distribution, not the edges. AI handles routine cases well. It struggles with outliers. The customer service bot can resolve common issues; the unusual complaint needs a human. The coding assistant can generate boilerplate; the architectural decision requires judgment.

Production, not accountability. AI can produce outputs, but it cannot be held accountable for them. When the document goes to the client, someone signs it. When the code ships to production, someone owns it. When the decision has consequences, someone faces them. That someone is human, because accountability requires agency, and agency requires humanity.

This is exactly the pattern we saw with spreadsheets. VisiCalc automated arithmetic, not judgment. It automated production, not accountability. It handled the routine middle, not the novel edges. And the humans who learned to use it became more valuable, not less.

The Irreducible Human

Why do humans remain in the loop? Not for sentimental reasons. Not because we want to preserve jobs. But because certain functions cannot be automated, regardless of how sophisticated the technology.

Accountability requires agency. When something goes wrong, someone must be responsible. Legal systems, regulatory frameworks, and social structures all assume a responsible party. AI systems can produce outputs, but they cannot be sued, fired, jailed, or shamed. The human who relies on AI output remains accountable for that output. This isn't a bug; it's a feature of how human society functions.

Context is infinite and local. AI models are trained on general patterns. Your specific situation — your company's politics, your client's unspoken concerns, your industry's unwritten rules — is not in the training data. The model knows what words typically follow other words. It doesn't know that your CFO hates bullet points, that your customer is going through a divorce, or that mentioning the competitor's product is forbidden in this meeting. The human provides context.

Trust requires relationship. Business transactions ultimately rest on trust between humans. You hire the lawyer, not the legal database. You trust your doctor, not the diagnostic algorithm. You buy from salespeople, not recommendation engines. AI can support these relationships, but it cannot replace them, because trust is a human phenomenon.

The feedback loop requires humans. Here's a subtle but critical point: AI systems are trained on human-generated data. If humans stop producing original work, the training data stops improving. The model learns to produce outputs that look like human outputs because it was trained on human outputs. Remove the humans, and you get a system trained on its own outputs — a recursive degradation. We are the curriculum.

Novel situations require genuine understanding. AI excels at interpolation — finding patterns within the space of its training data. It struggles with extrapolation — handling situations outside that space. Genuine novelty, by definition, lies outside the training distribution. The unprecedented situation, the black swan event, the "we've never seen this before" moment — these require human judgment, because no pattern matching can help when there's no pattern to match.

The Reskilling Reality

None of this means AI changes nothing. It changes a lot. The question is what kind of change.

When spreadsheets arrived, certain skills became less valuable. Manual arithmetic, once essential for financial work, became irrelevant. The ability to maintain error-free ledgers through careful penmanship mattered less. Slide rule proficiency joined buggy whip maintenance in the museum of obsolete competencies.

But new skills became essential. Building spreadsheet models, understanding the logic of cell references, knowing how to structure data for analysis — these became core professional competencies. "Computer literacy" emerged as a job requirement. People who learned the new tools thrived; people who refused to adapt struggled.

AI is triggering the same shift. Consider what becomes less valuable:

Writing first drafts from scratch. When AI can produce a competent first draft in seconds, the ability to stare at a blank page and produce prose is less differentiating. The value shifts to editing, directing, and refining.

Routine research and compilation. When AI can summarize documents, extract key points, and synthesize information, the human who only does that work has a problem. The value shifts to evaluating sources, asking the right questions, and interpreting results.

Basic code production. When AI can generate boilerplate, implement standard patterns, and translate requirements into code, the programmer whose main skill is typing syntax is in trouble. The value shifts to architecture, debugging, code review, and understanding what the system should do.

And consider what becomes more valuable:

Judgment and curation. AI produces. Humans evaluate. The ability to look at AI output and quickly determine what's useful, what's wrong, and what's missing becomes essential. This is editing in the broadest sense — not just fixing typos, but directing the creative process.

Domain expertise plus AI fluency. The accountant who understands both accounting and how to leverage AI tools is more valuable than either an accountant who ignores AI or an AI operator who doesn't understand accounting. The combination is the new competency.

Handling exceptions and edge cases. As AI handles the routine middle, humans focus on the exceptions. The unusual customer complaint, the novel legal situation, the unprecedented technical problem — these become the human domain. Expertise in handling weirdness becomes more valuable.

Relationship and trust building. As transactional work becomes automated, relationship work becomes relatively more important. The human who can build trust, navigate politics, and close deals face-to-face has a durable advantage.

This is exactly what happened with spreadsheets. The value shifted from arithmetic to analysis, from production to judgment, from routine to exception. The workers who adapted thrived. The workers who clung to obsolete methods struggled.

The Transition Is Never Painless

I don't want to minimize the disruption. Real people, with real skills, face real challenges when technology shifts beneath them.

The typing pool secretary in 1985 had spent years developing speed and accuracy on the Selectric. She could type 80 words per minute with minimal errors. She knew the quirks of carbon paper, the rhythm of the carriage return, the muscle memory of the key layout. These skills, honed over a decade, became worthless in the span of a few years.

Some of those secretaries learned WordPerfect and became administrative assistants. Some moved into other roles entirely. Some struggled, unable or unwilling to adapt, and found themselves squeezed out of the workforce. The aggregate statistics — employment levels, productivity growth, economic expansion — hide individual stories of dislocation and difficulty.

The same will be true of AI. Some knowledge workers will adapt smoothly, integrating AI tools into their workflow and becoming more productive. Some will resist, clinging to methods that worked in 2020 but feel increasingly obsolete by 2030. Some will find themselves displaced, their particular bundle of skills suddenly less valuable in a market that's moved on.

The historical pattern tells us that the net outcome is positive — that technological transitions create more opportunity than they destroy, that the economy adjusts, that new roles emerge. But history is cold comfort to the individual caught in the transition. The typewriter repairman didn't care that computer technicians were a growing field. He cared that his skills were worthless.

This is why the reskilling conversation matters. Not because AI will eliminate all jobs — it won't — but because the specific jobs, the specific skills, the specific ways of working will change. And navigating that change requires awareness, adaptability, and often institutional support.

The workers who thrived through the spreadsheet revolution weren't necessarily the most skilled at the old methods. They were the ones who recognized the shift and moved with it. The accountant who embraced Lotus 1-2-3, even if she was mediocre at mental arithmetic, outcompeted the brilliant human calculator who refused to touch a keyboard.

The same pattern is emerging now. The programmer who integrates AI assistance, even if she's not the fastest typist, will outcompete the keyboard wizard who insists on writing every character manually. The writer who uses AI for drafts and focuses on editing and judgment will outcompete the prose stylist who spends hours on first drafts. The analyst who lets AI handle data compilation and focuses on interpretation will outcompete the Excel jockey who takes pride in manual formula construction.

Adaptation isn't optional. It wasn't optional in 1980, and it isn't optional now.

The Long View

I spend my weekends building emulators for 50-year-old processors. I write compilers that target the Z80, a chip that was designed when Gerald Ford was president. I run BASIC and FORTH on simulated hardware, watching instructions execute that were first written when disco was young.

From this perspective, the current AI moment looks familiar. Technology extends human capability. It always has. The accountant with VisiCalc wasn't replaced; she was amplified. The writer with WordStar wasn't obsolete; he was leveraged. The analyst with a spreadsheet could do in hours what had taken days, and that made analysis more valuable, not less.

When I run my Z80 emulator — JavaScript interpreting WebAssembly interpreting 1976 machine code — I'm witnessing layers of abstraction that would have seemed like science fiction to the engineers who designed the original chip. But the fundamental relationship remains: humans using tools to extend their capabilities.

The nature of work changes. It always changes. The bookkeeper becomes the accountant. The typist becomes the administrative assistant. The draftsman becomes the designer. The job titles shift, the tools evolve, the skills required transform. But the need for human judgment, human accountability, human creativity, and human relationships remains.

This isn't optimism. It's pattern recognition. The 45-year pattern from VisiCalc to ChatGPT is consistent: technology that automates tasks changes the nature of work without eliminating the need for workers. The "this time it's different" predictions have been wrong every time, not because technology isn't powerful, but because the predictions misunderstand the relationship between automation and human labor.

The spreadsheet didn't eliminate the need for human intelligence. It made human intelligence more valuable by freeing it from arithmetic. AI won't eliminate the need for human judgment. It will make human judgment more valuable by freeing it from production.

We've been here before. And we'll be here again, decades from now, when some new technology triggers the same predictions, and historians look back at our AI panic the way we look back at the VisiCalc panic — as an understandable overreaction that missed the larger pattern.

The work changes. The workers adapt. The need for humans persists.

It's not different this time. It never is.

Rust on Z80: An LLVM Backend Odyssey

This is the story of attempting something probably inadvisable: compiling Rust for the Zilog Z80, an 8-bit processor from 1976. It's also a story about using AI as a genuine collaborator on deep systems programming work, and what happens when modern software abstractions collide with hardware constraints from an era when 64 kilobytes was considered generous.

Transparency: Claude Code as Collaborator

I want to be upfront about something: significant portions of this compiler backend were developed in collaboration with Claude Code, Anthropic's AI coding assistant. This isn't a case of "AI wrote the code and I took credit" — it's more nuanced than that. Claude served as an unusually patient pair programmer who happens to have read every LLVM tutorial ever written.

Here's what that collaboration actually looked like:

I would describe a problem: "The instruction selector is failing with cannot select: G_SADDO for signed addition with overflow detection." Claude would analyze the GlobalISel pipeline, identify that the Z80's ADC instruction sets the P/V flag for signed overflow, and propose an implementation. I would review, test, discover edge cases, and we'd iterate.

The debugging sessions were particularly valuable. When compilation hung for seven hours on what should have been a two-minute build, Claude helped trace the issue to an accidental infinite recursion — a replace_all refactoring had changed RBI.constrainGenericRegister(...) to constrainOrSetRegClass(...) inside the constrainOrSetRegClass helper function itself. The function was calling itself forever. Finding that bug manually would have taken hours of printf debugging; with Claude analyzing the code structure, we found it in minutes.

This is what AI-assisted development actually looks like in 2025: not magic code generation, but accelerated iteration with a collaborator who never gets frustrated when you ask "wait, explain register allocation to me again."

Why Z80? Why Rust?

The Z80 powered the TRS-80, ZX Spectrum, MSX computers, and countless embedded systems. It's still manufactured today — you can buy new Z80 chips. I actually did just that, I bought a handful of vintage ceramic Z80 chips off of eBay. There's something appealing about running modern language constructs on hardware designed when ABBA topped the charts.

More practically, I've been building Z80-based projects on the RetroShield platform, which lets you run vintage processors on Arduino-compatible hardware. Having a modern compiler toolchain opens possibilities that hand-written assembly doesn't.

But Rust specifically? Rust's ownership model and zero-cost abstractions are theoretically perfect for resource-constrained systems. The language was designed for systems programming. The question is whether "systems" can stretch back 50 years.

Building LLVM for the Z80

The first step was getting LLVM itself to build with Z80 support. This meant:

  1. Adding Z80 to the list of supported targets in the build system
  2. Creating the target description files (registers, instruction formats, calling conventions)
  3. Implementing the GlobalISel pipeline components
  4. Wiring everything together so llc -mtriple=z80-unknown-unknown actually works

The target description files alone span thousands of lines. Here's what defining just the basic registers looks like:

def A : Z80Reg<0, "a">;
def B : Z80Reg<1, "b">;
def C : Z80Reg<2, "c">;
def D : Z80Reg<3, "d">;
def E : Z80Reg<4, "e">;
def H : Z80Reg<5, "h">;
def L : Z80Reg<6, "l">;

// 16-bit register pairs
def BC : Z80RegWithSub<7, "bc", [B, C]>;
def DE : Z80RegWithSub<8, "de", [D, E]>;
def HL : Z80RegWithSub<9, "hl", [H, L]>;

Every instruction needs similar treatment. The Z80 has over 700 documented instruction variants when you count all the addressing modes. Not all are needed for a basic backend, but getting basic arithmetic, loads, stores, branches, and calls working required implementing dozens of instruction patterns.

The build process itself was surprisingly manageable — LLVM's build system is well-designed. A complete build with the Z80 target takes about 20 minutes on modern hardware. The iteration cycle during development was typically: change a few files, rebuild (30 seconds to 2 minutes depending on what changed), test with llc, fix, repeat.

The LLVM Approach

LLVM provides a framework for building compiler backends. You describe your target's registers, instruction set, and calling conventions; LLVM handles optimization, instruction selection, and register allocation. In theory, adding a new target is "just" filling in these descriptions.

In practice, LLVM assumes certain things about targets. It assumes you have a reasonable number of general-purpose registers. It assumes arithmetic operations work on values that fit in registers. It assumes function calls follow conventions that modern ABIs have standardized.

The Z80 violates all of these assumptions.

The Register Poverty Problem

The Z80 has seven 8-bit registers: A, B, C, D, E, H, and L. Some can be paired into 16-bit registers: BC, DE, HL. That's it. Modern architectures have 16 or 32 general-purpose registers; the Z80 has seven that aren't even all general-purpose — A is the accumulator with special arithmetic privileges, HL is the primary memory pointer.

LLVM's register allocator expects to juggle many virtual registers across many physical registers. When you have more virtual registers than physical registers, it spills values to memory. On the Z80, you're spilling constantly. Every 32-bit operation requires careful choreography of the few registers available.

Here's what a simple 16-bit addition looks like in our backend:

define i16 @add16(i16 %a, i16 %b) {
  %result = add i16 %a, %b
  ret i16 %result
}

This compiles to:

add16:
    add hl,de
    ret

That's clean because we designed the calling convention to pass arguments in HL and DE. The backend recognizes that the inputs are already where they need to be and emits just the ADD instruction.

But 32-bit addition? That becomes a multi-instruction sequence juggling values through the stack because we can't hold four 16-bit values in registers simultaneously.

The Width Problem

The Z80 is fundamentally an 8-bit processor with 16-bit addressing. Rust's standard library uses usize for indexing, which on most platforms is 32 or 64 bits. The Z80 cannot directly perform 32-bit arithmetic. Every u32 operation expands into multiple 8-bit or 16-bit operations.

Consider multiplication. The Z80 has no multiply instruction at all. To multiply two 16-bit numbers, we emit a call to a runtime library function (__mulhi3) that implements multiplication through shifts and adds. 32-bit multiplication requires calling a function that orchestrates four 16-bit multiplications with proper carry handling.

Division is worse. Iterative division algorithms on 8-bit hardware are slow. Floating-point arithmetic doesn't exist in hardware — every floating-point operation becomes a library call to software implementations.

GlobalISel: The Modern Approach

We're using LLVM's GlobalISel framework rather than the older SelectionDAG. GlobalISel provides finer control over instruction selection through explicit lowering steps:

  1. IRTranslator: Converts LLVM IR to generic machine instructions (G_ADD, G_LOAD, etc.)
  2. Legalizer: Transforms operations the target can't handle into sequences it can
  3. RegBankSelect: Assigns register banks (8-bit vs 16-bit on Z80)
  4. InstructionSelector: Converts generic instructions to target-specific instructions

Each step presented challenges. The Legalizer needed custom rules to break 32-bit operations into 16-bit pieces. RegBankSelect needed to understand that some Z80 instructions only work with specific register pairs. The InstructionSelector needed patterns for every Z80 instruction variant.

One particularly tricky issue: LLVM's overflow-detecting arithmetic. Instructions like G_SADDO (signed add with overflow) return both a result and an overflow flag. The Z80's ADC instruction sets the P/V flag on signed overflow, but capturing that flag to a register requires careful instruction sequencing — you can't just read the flag register arbitrarily.

The Bug That Cost Seven Hours

During development, we hit a bug that perfectly illustrates the challenges of compiler work. After implementing a helper function to handle register class assignment, compilation started hanging. Not crashing — hanging. A simple three-function test file that should compile in milliseconds ran for over seven hours before I killed it.

The issue? During a refactoring pass, we used a global search-and-replace to change all calls from RBI.constrainGenericRegister(...) to our new constrainOrSetRegClass(...) helper. But the helper function itself contained a call to RBI.constrainGenericRegister() as its fallback case. The replace-all changed that too:

// Before (correct):
bool constrainOrSetRegClass(Register Reg, ...) {
  if (!MRI.getRegClassOrNull(Reg)) {
    MRI.setRegClass(Reg, &RC);
    return true;
  }
  return RBI.constrainGenericRegister(Reg, RC, MRI);  // Fallback
}

// After (infinite recursion):
bool constrainOrSetRegClass(Register Reg, ...) {
  if (!MRI.getRegClassOrNull(Reg)) {
    MRI.setRegClass(Reg, &RC);
    return true;
  }
  return constrainOrSetRegClass(Reg, RC, MRI);  // Calls itself forever!
}

The function was calling itself instead of the underlying LLVM function. Every attempt to compile anything would recurse until the stack overflowed or the heat death of the universe, whichever came first.

This is the kind of bug that's obvious in hindsight but insidious during development. There were no compiler errors, no warnings, no crashes with helpful stack traces. Just silence as the process spun forever.

Finding it required adding debug output at each step of the instruction selector, rebuilding, and watching where the output stopped. Claude helped immensely here — recognizing the pattern of "output stops here" and immediately checking what that code path did.

The Calling Convention

We designed a Z80-specific calling convention optimized for the hardware's constraints:

  • First 16-bit argument: HL register pair
  • Second 16-bit argument: DE register pair
  • Return value: HL register pair
  • Additional arguments: Stack
  • Caller-saved: All registers (callee can clobber anything)
  • Callee-saved: None

This convention minimizes register shuffling for simple functions. A function taking two 16-bit values and returning one doesn't need any register setup at all — the arguments arrive exactly where the ADD instruction expects them.

For 8-bit arguments, values arrive in the low byte of HL (L register) or DE (E register). This wastes the high byte but simplifies the calling convention.

This is radically different from typical calling conventions. Modern ABIs specify precise preservation rules, stack alignment requirements, and argument passing in specific registers. On the Z80, with so few registers, we had to make pragmatic choices. Every function saves and restores what it needs; there's no concept of "preserved across calls."

A Working Example

Here's LLVM IR that our backend compiles successfully:

target datalayout = "e-m:e-p:16:8-i16:8-i32:8-i64:8-n8:16"
target triple = "z80-unknown-unknown"

define i16 @add16(i16 %a, i16 %b) {
  %result = add i16 %a, %b
  ret i16 %result
}

define i16 @sub16(i16 %a, i16 %b) {
  %result = sub i16 %a, %b
  ret i16 %result
}

define i8 @add8(i8 %a, i8 %b) {
  %result = add i8 %a, %b
  ret i8 %result
}

Compiled output:

    .text
    .globl  add16
add16:
    add hl,de
    ret

    .globl  sub16
sub16:
    and a           ; clear carry
    sbc hl,de
    ret

    .globl  add8
add8:
    ld  c,l
    ld  b,c
    add a,b
    ret

The 16-bit operations are efficient. The 8-bit addition shows the register shuffling required when values aren't in the accumulator — we have to move values through available registers to get them where the ADD instruction expects.

Compilation time for these three functions: 0.01 seconds. The backend works.

Where We Are Now

The backend compiles simple LLVM IR to working Z80 assembly. Integer arithmetic, control flow, function calls, memory access — the fundamentals work. We've implemented handlers for dozens of generic machine instructions and their various edge cases.

Attempting to compile Rust's core library has been... educational. The core library is massive. It includes:

  • All the formatting infrastructure (Display, Debug, write! macros)
  • Iterator implementations and adaptors
  • Option, Result, and their many combinator methods
  • Slice operations, sorting algorithms
  • Panic handling infrastructure
  • Unicode handling

Each of these generates significant code. The formatting system alone probably exceeds the entire memory capacity of a typical Z80 system.

Current status: compilation of core starts, processes thousands of functions, but eventually hits edge cases we haven't handled yet. The most recent error involves register class assignment in the floating-point decimal formatting code — ironic since the Z80 has no floating-point hardware.

Connecting Rust to the Z80 Backend

Getting Rust to use our LLVM backend required modifying the Rust compiler itself. This involved:

  1. Adding a target specification: Defining z80-unknown-none-elf in Rust's target database with the appropriate data layout, pointer width, and feature flags.

  2. Pointing Rust at our LLVM: Rust can use an external LLVM rather than its bundled version. We configured the build to use our Z80-enabled LLVM.

  3. Disabling C compiler-builtins: Rust's standard library includes some C code from compiler-rt for low-level operations. There's no Z80 C compiler readily available, so we had to disable these and rely on pure Rust implementations.

  4. Setting panic=abort: The Z80 can't reasonably support stack unwinding for panic handling.

The Rust target specification looks like this:

Target {
    arch: Arch::Z80,
    data_layout: "e-m:e-p:16:8-i16:8-i32:8-i64:8-n8:16".into(),
    llvm_target: "z80-unknown-unknown".into(),
    pointer_width: 16,
    options: TargetOptions {
        c_int_width: 16,
        panic_strategy: PanicStrategy::Abort,
        max_atomic_width: Some(0),  // No atomics
        atomic_cas: false,
        singlethread: true,
        no_builtins: true,  // No C runtime
        ..TargetOptions::default()
    },
}

The pointer_width: 16 is crucial — this is a 16-bit architecture. The max_atomic_width: Some(0) tells Rust that atomic operations aren't available at all, since the Z80 has no atomic instructions.

When Rust tries to compile core, it invokes rustc, which invokes LLVM, which invokes our Z80 backend. Each function in core goes through this pipeline. The sheer volume is staggering — core contains thousands of generic functions that get monomorphized for every type they're used with.

The Honest Assessment

Will Rust's standard library ever practically run on a Z80? Almost certainly not. The core library alone, compiled for Z80, would likely exceed a megabyte — far beyond the 64KB address space. Even if you could page-swap the code, the runtime overhead of software floating-point, 32-bit arithmetic emulation, and iterator abstractions would make execution glacially slow.

What might actually work:

  • #![no_std] #![no_core] programs: Bare-metal Rust with a tiny custom runtime, no standard library, hand-optimized for the hardware. A few kilobytes of carefully written Rust that compiles to tight Z80 assembly.

  • Code generation experiments: Using the LLVM backend to study how modern language constructs map to constrained hardware, even if the results aren't practical to run.

  • Educational purposes: Understanding compiler internals by working with hardware simple enough to reason about completely.

The value isn't in running production Rust on Z80s. It's in the journey — understanding LLVM's internals, grappling with register allocation on a machine that predates the concept (and myself albeit by only a few years), and seeing how far modern tooling can stretch.

Conclusion

Compiling Rust for the Z80 is somewhere between ambitious and absurd. The hardware constraints are genuinely incompatible with modern language expectations. But the attempt has been valuable — understanding LLVM deeply, exploring what "resource-constrained" really means, and discovering that AI collaboration can work effectively on low-level systems programming.

The Z80 was designed for a world where programmers counted bytes. Rust was designed for a world where programmers trust the compiler to manage complexity. Making them meet is an exercise in translation across decades of computing evolution.

Building Z80 ROMs with Rust: A Modern Approach to Retro Computing

There's something deeply satisfying about watching a nearly 50-year-old CPU execute code you just compiled. The Z80 processor, introduced by Zilog in 1976, powered everything from the TRS-80 to the ZX Spectrum to countless CP/M machines. With roughly 8,500 transistors, it's almost incomprehensibly simple by modern standards; a high-end Intel i9 has around 17 billion. Today, thanks to projects like the RetroShield, you can plug one of these vintage processors into an Arduino and run real 8-bit code.

But here's the thing: actually writing Z80 programs is painful. Traditional approaches involve either hand-assembling hex codes, wrestling with decades-old assemblers that barely run on modern systems, or writing raw bytes into binary files. I wanted something better. What if I could write Z80 programs in Rust, using a fluent API that generates correct machine code without the mental overhead of remembering opcode encodings?

The result is the retroshield-z80-workbench, a Rust crate that powers three substantial retro applications: a dBASE II database clone, a WordStar-compatible text editor, and a VisiCalc-style spreadsheet. The workbench emerged from patterns I discovered while building earlier projects like a C compiler and LISP interpreter. This post explains how it works and what it's enabled.

The Problem with Traditional Z80 Development

I first encountered Z80 assembly in the 1990s, writing programs on a TI-85 graphing calculator. The process was painfully tedious: hand-assemble each instruction to hex using a reference card, type the bytes into the calculator's memory editor, run it, watch it crash, and start over. There was no debugger, no error messages, just a frozen screen or a memory clear if you were unlucky. I spent more time looking up opcodes than thinking about algorithms.

Writing Z80 assembly by hand means memorizing hundreds of opcodes. LD A, B is 0x78. JP NZ, addr is 0xC2 followed by a 16-bit address in little-endian format. Conditional returns, indexed addressing, and the various Z80-specific instructions like LDIR and DJNZ all have their own encodings. One wrong byte and your program jumps into garbage.

Traditional assemblers solve this, but they come with their own problems. Many only run under CP/M or DOS. Modern cross-assemblers exist, but they're another tool to install, another syntax to learn, another build step to manage. And when you're generating code programmatically, like when building a compiler that targets Z80, an external assembler becomes a significant complication.

There are also modern C compilers for the Z80, most notably SDCC (Small Device C Compiler), which is actively maintained and produces decent code. But when your goal is to generate Z80 machine code from Rust, perhaps as the backend of a compiler or code generator, you want something that integrates directly into your Rust toolchain.

What I wanted was the ability to write something like this in Rust:

rom.ld_a(0x42);        // LD A, 0x42
rom.call("print_hex"); // CALL print_hex
rom.ret();             // RET

And have it emit the correct bytes: 0x3E 0x42 0xCD xx xx 0xC9.

The Workbench Architecture

The retroshield-z80-workbench crate is built around three core concepts: emit, label, and fixup.

Emit: The Foundation

At the lowest level, everything is just bytes being appended to a buffer:

pub struct CodeGen {
    rom: Vec<u8>,
    labels: HashMap<String, u16>,
    fixups: Vec<(usize, String)>,
    config: RomConfig,
}

impl CodeGen {
    pub fn emit(&mut self, bytes: &[u8]) {
        self.rom.extend_from_slice(bytes);
    }
}

Every Z80 instruction ultimately calls emit(). The ld_a() method is just:

pub fn ld_a(&mut self, n: u8) {
    self.emit(&[0x3E, n]);  // Opcode 0x3E is LD A, n
}

This pattern scales to cover the entire Z80 instruction set. The crate provides over 80 instruction helpers, from simple register loads to complex block transfer instructions.

Labels: Named Positions

Labels mark positions in the code that can be referenced by jumps and calls:

pub fn label(&mut self, name: &str) {
    let addr = self.config.org + self.rom.len() as u16;
    self.labels.insert(name.to_string(), addr);
}

When you write rom.label("main"), the current position gets recorded. Later, when you write rom.jp("main"), the crate knows exactly where to jump.

Fixups: Forward References

The clever part is handling forward references. When you write rom.call("print_string") before print_string is defined, the crate can't know the address yet. Instead, it records a fixup:

pub fn call(&mut self, label: &str) {
    self.emit(&[0xCD]);  // CALL opcode
    self.fixup(label);   // Record that we need to fill in this address
}

pub fn fixup(&mut self, label: &str) {
    self.fixups.push((self.rom.len(), label.to_string()));
    self.emit_word(0x0000);  // Placeholder
}

At the end, resolve_fixups() walks through all recorded fixups and patches in the correct addresses:

pub fn resolve_fixups(&mut self) {
    for (pos, label) in &self.fixups {
        let addr = self.labels.get(label)
            .expect(&format!("Undefined label: {}", label));
        self.rom[*pos] = *addr as u8;
        self.rom[*pos + 1] = (*addr >> 8) as u8;
    }
}

This simple mechanism enables natural code organization where you can reference routines before defining them.

Building Blocks: The Standard Library

Raw instruction emission is powerful but verbose. The workbench includes pre-built routines for common tasks that any Z80 program needs.

Serial I/O

Our modified RetroShield firmware emulates an MC6850 ACIA for serial communication (the official RetroShield uses an Intel 8251). The standard library provides blocking read/write routines:

pub fn emit_getchar(&mut self) {
    self.label("getchar");
    self.in_a(0x80);           // Read status register
    self.and_a(0x01);          // Test RX ready bit
    self.emit(&[0x28, 0xFA]);  // JR Z, -6 (loop until ready)
    self.in_a(0x81);           // Read data register
    self.ret();
}

This generates a 10-byte routine that any program can call with rom.call("getchar"). The character comes back in the A register, exactly as you'd expect from a standard library function.

Similar routines handle putchar, print_string (for null-terminated strings), and newline (CR+LF).

VT100 Terminal Control

Every program I've written needs cursor positioning, screen clearing, and other terminal operations. The standard library includes VT100 escape sequences:

pub fn emit_clear_screen(&mut self) {
    self.label("clear_screen");
    self.ld_hl_label("_cls_seq");
    self.call("print_string");
    self.ret();
}

// Later, in data section:
rom.label("_cls_seq");
rom.emit_string("\x1B[2J\x1B[H");  // ESC[2J ESC[H

The cursor_pos routine is more complex, converting binary row/column values to the ASCII digits that VT100 expects. It's about 50 bytes of Z80 code that no one wants to write more than once.

Math Routines

The Z80 has limited math capabilities, especially for 16-bit operations. The standard library provides:

  • print_byte_dec: Convert and print A register as decimal (000-255)
  • div16: 16-bit division with remainder
  • negate_hl: Two's complement negation

These become critical building blocks for anything involving numbers.

Pseudo-Assembly as Building Blocks

The real power emerges when you combine these primitives into higher-level constructs. Instead of thinking in individual Z80 instructions, you start thinking in chunks of functionality.

Consider implementing a text editor. You need a routine to insert a character at the cursor position. In pseudo-assembly, this is:

  1. Get the current line pointer
  2. Shift all bytes from cursor to end of buffer right by one
  3. Insert the new character
  4. Update cursor position
  5. Redraw

Each of these steps becomes a Rust method that emits a sequence of Z80 instructions:

fn emit_insert_char(&mut self) {
    self.label("insert_char");

    // Save the character to insert
    self.ld_addr_a(TEMP_A);

    // Get current line pointer
    self.ld_a_addr(CURSOR_ROW);
    self.call("get_line_ptr");  // HL = line start

    // Add cursor column offset
    self.ld_de_addr(CURSOR_COL);
    self.add_hl_de();           // HL = insert position

    // Calculate bytes to shift...
    // (many more instructions)

    // Use LDDR for the actual shift
    self.emit(&[0xED, 0xB8]);   // LDDR

    // Insert the character
    self.ld_a_addr(TEMP_A);
    self.ld_hl_ind_a();

    // Update counters and redraw
    self.call("increment_cursor");
    self.call("draw_current_line");
    self.ret();
}

This method generates about 80 bytes of Z80 machine code. By building up from primitives to routines to complete functions, complex programs become manageable.

Programs Built with the Workbench

The real test of any framework is what you can build with it. Here's what's running on the RetroShield today.

kz80_db: A dBASE II Clone

dBASE II was the database that launched a thousand businesses in the early 1980s. Before SQL became dominant, dBASE gave microcomputer users their first taste of structured data management. My clone implements the authentic 1981 file format: 8-byte headers, 16-byte field descriptors, fixed-length records with delete flags.

The file format is documented in the code itself:

DBF Header (8 bytes):
  Byte 0:    Version (0x02 for dBASE II)
  Bytes 1-2: Number of records (16-bit little-endian)
  Bytes 3-4: Month, Day of last update
  Bytes 5-6: Year of last update
  Byte 7:    Record length (including delete flag)

Field Descriptors (16 bytes each, terminated by 0x0D):
  Bytes 0-10:  Field name (11 bytes, null-padded)
  Byte 11:     Field type (C=Character, N=Numeric, L=Logical)
  Byte 12:     Field length
  Byte 13:     Decimal places (for N type)
  Bytes 14-15: Reserved

The implementation includes:

  • CREATE to define new database structures with up to 16 fields
  • USE to open existing .DBF files from the SD card
  • APPEND to add records interactively
  • LIST to display all records in columnar format
  • EDIT to modify existing records with field-by-field prompts
  • DELETE and PACK for soft-delete and physical removal
  • GO TOP/BOTTOM and GO n for record navigation
  • DISPLAY STRUCTURE to show field definitions

The generated ROM is about 4KB, fitting comfortably in the RetroShield's 8KB ROM space. It reads and writes real .DBF files that you can open in modern database tools like LibreOffice Calc or even current versions of dBASE.

Building this required implementing a command parser that handles the dot-prompt interface, string comparison routines for command matching, file I/O through the SD card interface with seek operations, and the full dBASE command set. Each command is a Rust method that emits the appropriate Z80 code:

fn emit_list_command(&mut self) {
    self.label("cmd_list");

    // Check if database is open
    self.ld_a_addr(DB_OPEN);
    self.or_a_a();
    self.jp_z("no_db_open");

    // Print column headers from field descriptors
    self.call("print_headers");

    // Loop through all records
    self.ld_hl(1);
    self.ld_addr_hl(CURRENT_REC);

    self.label("list_loop");
    self.call("read_record");
    self.call("print_record");

    // Increment and check against record count
    self.ld_hl_addr(CURRENT_REC);
    self.inc_hl();
    self.ld_addr_hl(CURRENT_REC);
    // ... 150+ more lines
}

The SD card interface deserves special mention. The RetroShield includes an SD card reader accessible through I/O ports. Commands like open, read, write, seek, and close are sent through a command register, with data transferred byte-by-byte through a data register. The workbench makes this tolerable by wrapping the low-level port operations in reusable routines.

kz80_ws: A WordStar Clone

WordStar defined text editing for a generation of writers. George R.R. Martin famously still uses it. The diamond cursor movement (^E ^S ^D ^X arranged like arrow keys on the keyboard), the block operations (^KB ^KK ^KC), the search functions, the word wrap, the careful attention to 80-column displays: all of this became muscle memory for millions of users.

The clone implements:

  • Full cursor movement with ^E/^S/^D/^X and ^A/^F for word movement
  • Insert and overwrite modes with ^V toggle
  • Block operations: mark begin (^KB), mark end (^KK), copy (^KC), delete (^KY)
  • File operations: save (^KS), save and exit (^KD), quit without saving (^KQ)
  • Search (^QF), word wrap at configurable right margins
  • Line operations: delete line (^Y), insert line break (^N)
  • Quick movement: top of file (^QR), end of file (^QC), line start/end (^QS/^QD)
  • VT100 terminal output with proper status line showing line/column/mode

The memory layout is carefully designed for the 8KB RAM constraint:

RAM (8KB):
  0x2000-0x201F  State variables (cursor, view, margins)
  0x2100-0x21FF  Input buffer
  0x2200-0x22FF  Filename buffer
  0x2800-0x3BFF  Text buffer (5KB)
  0x3C00-0x3DFF  Line index table
  0x3E00-0x3FFF  Stack

The word wrap implementation is particularly satisfying. When the cursor passes the right margin (default column 65), the editor scans backward to find the last space, then uses the Z80's LDDR instruction to shift the buffer and insert a CR/LF pair. The cursor repositions on the new line at exactly the right column to continue typing the wrapped word. All of this happens fast enough that the user just sees smooth text flow.

The screen update strategy matters on a 4MHz processor. Rather than redrawing the entire screen on each keystroke, the editor tracks what changed and only redraws the affected line. The VT100 "clear to end of line" escape sequence handles trailing garbage. This keeps the interface responsive despite the hardware limitations.

kz80_calc: A VisiCalc-Style Spreadsheet

VisiCalc was the "killer app" that made personal computers business tools. Dan Bricklin and Bob Frankston's 1979 creation turned the Apple II from a hobbyist toy into something accountants would buy. My version brings that experience to the Z80:

  • 1024 cells (16 columns A-P by 64 rows) in 6KB of RAM
  • 8-digit packed BCD arithmetic for accurate decimal math
  • Formula support with cell references (A1+B2*C3)
  • Operator precedence (* and / before + and -)
  • Range functions: @SUM, @AVG, @MIN, @MAX, @COUNT
  • Automatic recalculation when cells change
  • Arrow key navigation and GOTO command for jumping to cells
  • Cell types: numbers, labels, formulas, and repeating characters

The cell storage format uses 6 bytes per cell:

Cell format (6 bytes):
  byte 0:    type (0=empty, 1=number, 2=formula, 3=error, 4=repeat, 5=label)
  byte 1:    sign (0x00=positive, 0x80=negative)
  bytes 2-5: 8-digit packed BCD (d7d6 d5d4 d3d2 d1d0)

The BCD math was the hardest part. Binary floating-point would give wrong answers for financial calculations (the classic 0.1 + 0.2 != 0.3 problem). Packed BCD stores two decimal digits per byte, and the Z80's DAA (Decimal Adjust Accumulator) instruction handles single-byte addition correctly. But building 32-bit multiplication and division from 8-bit DAA takes hundreds of carefully sequenced instructions.

The formula parser handles expressions like =A1+B2*C3-@SUM(D1:D10). This required implementing recursive descent parsing in Z80 machine code, which the workbench made tractable by letting me focus on the algorithm rather than opcode encodings. The parser breaks formulas into tokens, builds a simple AST in memory, and evaluates it with proper operator precedence.

Beyond the Workbench

The workbench proved its value for these three substantial applications. But I've also built other Z80 projects that predate the workbench or use their own code generation approaches:

  • kz80_c: A C compiler with its own emit infrastructure, developed before the workbench was extracted as a reusable crate
  • kz80_lisp: A LISP interpreter with mark-and-sweep garbage collection
  • kz80_prolog: Logic programming with unification and backtracking
  • kz80_ml: An ML compiler with Hindley-Milner type inference
  • kz80_fortran: FORTRAN77 subset for scientific computing nostalgia
  • kz80_lua, kz80_smalltalk, kz80_chip8: Various interpreters and emulators

The experience building these earlier projects is what led to extracting the common patterns into the workbench. The emit/label/fixup pattern appeared independently in several codebases before I recognized it as a reusable abstraction.

Looking back at kz80_c, for instance, I can see the proto-workbench emerging. There's a CodeGen struct with an emit() method, a labels hashmap, and fixup resolution. The same pattern appears in kz80_lisp. Eventually it became clear that this infrastructure should be its own crate, tested once and reused everywhere.

The workbench also benefited from hindsight. Early projects had ad-hoc solutions for things like unique label generation (essential for compiling nested control structures) and relative jump calculation. The workbench handles these correctly from the start, saving debugging time on every subsequent project.

The Hardware: RetroShield Z80

For those unfamiliar with the RetroShield project, it's worth a brief explanation. The RetroShield is an Arduino shield designed by 8BitForce that lets you run real vintage CPUs. You plug an actual Z80 (or 6502, or 6809, or 8085) into a socket on the shield. The Arduino provides clock, reset, and memory by intercepting the CPU's bus signals.

The Z80 variant gives you:

  • ROM at 0x0000 (size depends on your binary)
  • 6KB RAM at 0x2000-0x37FF
  • MC6850 ACIA for serial I/O at ports 0x80-0x81

The original RetroShield Z80 emulated the Intel 8251 USART for serial communication. In 2023, with help from RetroShield creator Erturk Kocalar, I added MC6850 ACIA emulation to run John Hardy's Forth interpreter. The MC6850 is what most CP/M software expects, making it the better choice for running vintage software. The Arduino sketch with MC6850 emulation is available in my RetroShield firmware collection on GitLab.

I added an SD card interface at ports 0x10-0x15, which isn't part of the standard RetroShield but integrates cleanly with the Arduino firmware. This gives the dBASE and WordStar clones persistent file storage.

This constrained environment is actually liberating. You can't reach for a 100MB framework or spawn threads. Every byte matters. The programs you write are complete, self-contained, and comprehensible. The entire WordStar clone is about 4KB of machine code. You can read a hex dump of the ROM and, with patience, trace exactly what every byte does.

The RetroShield connects to an Arduino Mega via two rows of 18 pins, or alternatively to a Teensy 4.1 using a special carrier board. Either way, you interact with your Z80 programs through a terminal emulator over USB serial. The VT100 and VT220 escape sequences that the workbench's terminal routines emit work perfectly in modern terminals like iTerm2 or the venerable screen command, connecting 1970s display protocols to 2020s software.

Why Rust?

Rust brings several advantages to this domain:

Type Safety: The compiler catches mistakes like passing a label where an address is expected, or using the wrong register size. This matters when generating machine code where a single wrong byte corrupts everything.

Zero Runtime: The generated ROMs contain only Z80 code, no runtime, no garbage collector. Rust's abstractions compile away completely.

Excellent Tooling: Cargo handles dependencies, testing, and publishing. The workbench is on crates.io; adding it to a project is one line in Cargo.toml.

Performance: Code generation is fast. Even the complex projects compile in under a second.

Expressiveness: Rust's type system lets me encode Z80 concepts cleanly. A label is a String, an address is a u16, and the compiler keeps them straight.

Lessons Learned

Building the workbench and using it for real projects taught me several things:

Start with the primitives right: The emit/label/fixup core hasn't changed since the first version. Getting the foundation solid paid dividends.

Standard library matters: Having I/O and terminal routines ready to call eliminated boilerplate from every project. I probably use call("print_string") a hundred times across all the projects.

Let the host do the work: Complex string manipulation, parsing, and data structure management happen in Rust on the host computer. The Z80 code just handles the runtime behavior. This split makes everything easier.

Readability over brevity: A Z80 program written in the workbench is longer than the equivalent hand-assembled hex, but it's readable and maintainable. When I need to fix a bug in the WordStar word wrap routine, I can read the Rust code and understand it.

Getting Started

The workbench is available on crates.io:

[dependencies]
retroshield-z80-workbench = "0.1"

A minimal program:

use retroshield_z80_workbench::prelude::*;

fn main() {
    let mut rom = CodeGen::new();

    rom.emit_startup(0x3FFF);
    rom.call("clear_screen");
    rom.ld_hl_label("msg");
    rom.call("print_string");
    rom.halt();

    rom.label("msg");
    rom.emit_string("Hello from Z80!\r\n");

    rom.include_stdlib();
    rom.resolve_fixups();
    rom.write_bin("hello.bin").unwrap();
}

Load hello.bin onto a RetroShield (or run it in a Z80 emulator), and you'll see the greeting on your terminal.

Conclusion

The Z80 is nearly 50 years old, but it's still fun to program. The retroshield-z80-workbench brings modern development practices to vintage hardware: type-safe code generation, proper dependency management, fast iteration, and readable source.

Whether you want to build a clone of classic software, implement your own programming language for 8-bit hardware, or just understand how computers work at the machine code level, having the right tools makes all the difference. And there's still nothing quite like watching your code run on a chip that predates most programmers alive today.

The code for the workbench and all the kz80_* projects is available on GitHub under BSD-3-Clause licenses. PRs welcome.

Building Language Compilers for the Z80: An Anthology of Retrocomputing Languages

Over the past year, I have been building a collection of programming language compilers and interpreters targeting the venerable Zilog Z80 microprocessor. What started as an experiment in retrocomputing has grown into a comprehensive suite of tools spanning multiple programming paradigms: from the functional elegance of LISP to the object-oriented messaging of Smalltalk, from the structured programming of Pascal and Fortran to the low-level control of C. This anthology documents the common architectural patterns, the unique challenges of targeting an 8-bit processor, and the unexpected joys of bringing modern language implementations to 1970s hardware.

My fascination with the Z80 began in the mid-1990s when I got my first TI-85 graphing calculator. That unassuming device, marketed for algebra and calculus homework, contained a Z80 running at 6 MHz with 28KB of RAM. Discovering that I could write programs in Z80 assembly and run them on this pocket computer was revelatory. I accumulated a small library of Z80 assembly books and spent countless hours learning the instruction set, writing simple games, and understanding how software meets hardware at the most fundamental level. Three decades later, this project represents a return to that formative obsession, now armed with modern tools and a deeper understanding of language implementation.

The RetroShield Platform

The RetroShield is a family of hardware adapters that bridge vintage microprocessors to modern Arduino development boards. The product line covers a remarkable range of classic CPUs: the MOS 6502 (powering the Apple II and Commodore 64), the Motorola 6809 (used in the TRS-80 Color Computer), the Intel 8085, the SC/MP, and the Zilog Z80. Each variant allows the original processor to execute real machine code while the Arduino emulates memory, peripherals, and I/O.

For this project, I focused exclusively on the RetroShield Z80. The Z80's rich instruction set, hardware BCD support via the DAA instruction, and historical significance as the CPU behind CP/M made it an ideal target for language implementation experiments. The RetroShield Z80 connects the actual Z80 chip to an Arduino Mega (or Teensy adapter for projects requiring more RAM), which emulates the memory and peripheral chips. This arrangement provides the authenticity of running on actual Z80 silicon while offering the convenience of modern development workflows.

The standard memory map provides 8KB of ROM at addresses 0x0000-0x1FFF and 6KB of RAM at 0x2000-0x37FF, though the Teensy adapter expands this significantly to 256KB. Serial I/O is handled through an emulated MC6850 ACIA chip at ports 0x80 and 0x81, providing the familiar RS-232 interface that connects these vintage programs to modern terminals.

It needs to be mentioned that if you do have a Z80 RetroShield and you want to run the binaries produced by the compilers collections on actual hardware, you will need a couple things: 1) bin2c, this is a program that will take a Z80 binary and turn it into a PROGMEM statement that you can put into an Arduino sketch. 2) Look at this sketch - there is code in there for emulating the MC6850 ACIA.

Common Compiler Architecture: Lexer, Parser, AST, Codegen

Every compiler in this collection follows a similar multi-stage architecture, a pattern that has proven itself across decades of compiler construction. Understanding this common structure reveals how the same fundamental approach can target vastly different source languages while producing efficient Z80 machine code.

The Lexer: Breaking Text into Tokens

The lexer (or tokenizer) is the first stage of compilation, responsible for transforming raw source code into a stream of tokens. Each language has its own lexical grammar: LISP recognizes parentheses and symbols, C identifies keywords and operators, Smalltalk distinguishes between message selectors and literals. Despite these differences, every lexer performs the same fundamental task of categorizing input characters into meaningful units.

In our Rust implementations, the lexer typically maintains a position in the source string and provides a next_token() method that advances through the input. This produces tokens like Token::Integer(42), Token::Plus, or Token::Identifier("factorial"). The lexer handles the tedious work of skipping whitespace, recognizing multi-character operators, and converting digit sequences into numbers.

The Parser: Building the Abstract Syntax Tree

The parser consumes the token stream and constructs an Abstract Syntax Tree (AST) that represents the hierarchical structure of the program. Most of our compilers use recursive descent parsing, a technique where each grammar rule becomes a function that may call other rule functions. This approach is intuitive, produces readable code, and handles the grammars of most programming languages effectively.

For example, parsing an arithmetic expression like 3 + 4 * 5 requires understanding operator precedence. The parser might have functions like parse_expression(), parse_term(), and parse_factor(), each handling operators at different precedence levels. The result is an AST where the multiplication is grouped as a subtree, correctly representing that it should be evaluated before the addition.

Code Generation: Emitting Z80 Machine Code

The code generator walks the AST and emits Z80 machine code. This is where the rubber meets the road: abstract operations like "add two numbers" become concrete sequences of Z80 instructions like LD A,(HL), ADD A,E, and LD (DE),A.

Most of our compilers generate code directly into a byte buffer, manually encoding each instruction's opcode and operands. This approach, while requiring intimate knowledge of the Z80 instruction set, gives us precise control over the generated code and avoids the complexity of an intermediate representation or separate assembler pass.

The DAA Instruction and BCD Arithmetic

One of the most fascinating aspects of Z80 programming is the DAA (Decimal Adjust Accumulator) instruction, opcode 0x27. This single instruction makes the Z80 surprisingly capable at decimal arithmetic, which proves essential for implementing numeric types on an 8-bit processor.

What is BCD?

Binary Coded Decimal (BCD) is a numeric representation where each decimal digit is stored in 4 bits (a nibble). Rather than storing the number 42 as binary 00101010 (its true binary representation), BCD stores it as 0100 0010, with the first nibble representing 4 and the second representing 2. This "packed BCD" format stores two decimal digits per byte.

While BCD is less space-efficient than pure binary (you can only represent 0-99 in a byte rather than 0-255), it has a crucial advantage: decimal arithmetic produces exact decimal results without rounding errors. This is why BCD was the standard for financial calculations on mainframes and why pocket calculators (including the famous TI series) used BCD internally.

How DAA Works

When you perform binary addition on two BCD digits, the result may not be valid BCD. Adding 0x09 and 0x01 gives 0x0A, but 0x0A is not a valid BCD digit. The DAA instruction corrects this: it examines the result and the half-carry flag (which indicates a carry from bit 3 to bit 4, i.e., from the low nibble to the high nibble) and adds 0x06 to any nibble that exceeds 9. After DAA, that 0x0A becomes 0x10, correctly representing decimal 10 in BCD.

This process works for both addition (after ADD or ADC instructions) and subtraction (after SUB or SBC instructions, where DAA subtracts 0x06 instead of adding it). The Z80 remembers whether the previous operation was addition or subtraction through its N flag.

BCD in Our Compilers

Several of our compilers use 4-byte packed BCD integers, supporting numbers up to 99,999,999 (8 decimal digits). The addition routine loads bytes from both operands starting from the least significant byte, adds them with ADC (add with carry) to propagate carries between bytes, applies DAA to correct each byte, and stores the result. The entire operation takes perhaps 20 bytes of code but provides exact decimal arithmetic on an 8-bit processor.

Here is a simplified version of our BCD addition loop:

bcd_add: LD B, 4 ; 4 bytes to process OR A ; Clear carry flag bcd_add_loop: LD A, (DE) ; Load byte from first operand ADC A, (HL) ; Add byte from second operand with carry DAA ; Decimal adjust LD (DE), A ; Store result DEC HL ; Move to next byte DEC DE DJNZ bcd_add_loop RET

This pattern appears in kz80_c, kz80_fortran, kz80_smalltalk, and kz80_lisp, demonstrating how a hardware feature designed in 1976 still provides practical benefits for language implementation.

The Evolution: From Assembly to C to Rust

The journey of implementing these compilers taught us valuable lessons about choosing the right tool for the job, and our approach evolved significantly over time.

First Attempt: Pascal in Z80 Assembly

Our first language implementation was kz80_pascal, a Pascal interpreter written entirely in Z80 assembly language. This approach seemed natural: if you are targeting the Z80, why not write directly in its native language?

The reality proved challenging. Z80 assembly, while powerful, is unforgiving. Building a recursive descent parser in assembly requires manually managing the call stack, carefully preserving registers across function calls, and debugging through hex dumps of memory. The resulting interpreter works and provides an interactive REPL for Pascal expressions, but extending it requires significant effort. Every new feature means more assembly, more potential for subtle bugs, and more time spent on implementation details rather than language design.

Second Attempt: Fortran 77 in C with SDCC

For kz80_fortran, we tried a different approach: writing the interpreter in C and cross-compiling with SDCC (Small Device C Compiler). This was dramatically more productive. C provided structured control flow, automatic stack management, and the ability to organize code into manageable modules.

The result is a comprehensive Fortran 77 subset with floating-point arithmetic (via BCD), subroutines and functions, arrays, and block IF statements. The C source compiles to approximately 19KB of Z80 code, fitting comfortably in ROM with room for program storage in RAM.

However, this approach has limitations. SDCC produces functional but not always optimal code, and debugging requires understanding both the C source and the generated assembly. The interpreter also requires the Teensy adapter with 256KB RAM, as the Arduino Mega's 4KB is insufficient for the runtime data structures.

The Rust Workbench: Our Final Form

Our breakthrough came with the realization that we did not need the compiler itself to run on the Z80, only the generated code. This insight led to what we call the "Rust workbench" approach: write the compiler in Rust, running on a modern development machine, and have it emit Z80 binary images.

This architecture provides enormous advantages:

Modern tooling: Cargo manages dependencies and builds, rustc catches bugs at compile time, and we have access to the entire Rust ecosystem for testing and development.

Fast iteration: Compiling a Rust program takes seconds; testing the generated Z80 code in our emulator takes milliseconds. Compare this to the multi-minute flash cycles required when the compiler runs on the target.

Comprehensive testing: Each compiler includes both Rust unit tests (testing the lexer, parser, and code generator individually) and integration tests that compile source programs and verify their output in the emulator.

Zero-dependency output: Despite being written in Rust, the generated Z80 binaries have no runtime dependencies. They are pure machine code that runs directly on the hardware.

This approach now powers kz80_lisp, kz80_c, kz80_lua, kz80_smalltalk, kz80_chip8, and retrolang. Each is a standalone Rust binary that reads source code and produces a 32KB ROM image.

The Z80 Emulator

None of this would be practical without a way to test generated code quickly. Our RetroShield Z80 Emulator provides exactly this: a cycle-accurate Z80 emulation with the same memory map and I/O ports as the real hardware.

The emulator comes in two versions: a simple passthrough mode (retroshield) that connects stdin/stdout directly to the emulated serial port, and a full TUI debugger (retroshield_nc) with register displays, disassembly views, memory inspection, and single-step execution. The passthrough mode enables scripted testing, piping test inputs through the emulator and comparing outputs against expected results. The TUI debugger proves invaluable when tracking down code generation bugs.

The emulator uses the superzazu/z80 library for CPU emulation, which provides accurate flag behavior and correct cycle counts. Combined with our MC6850 ACIA emulation, it provides a faithful recreation of the RetroShield environment without requiring physical hardware.

Self-Hosting Compilers: LISP and C

Two of our compilers achieve something remarkable: they can compile themselves and run on the target hardware. This property, called "self-hosting," is a significant milestone in compiler development.

What Does Self-Hosting Mean?

A self-hosting compiler is one written in the language it compiles. The classic example is the C compiler: most C compilers are themselves written in C. But this creates a chicken-and-egg problem: how do you compile a C compiler if you need a C compiler to compile it?

The solution is bootstrapping. You start with a minimal compiler written in some other language (or in machine code), use it to compile a slightly better compiler written in the target language, and iterate until you have a full-featured compiler that can compile its own source code. Once bootstrapped, the compiler becomes self-sustaining: future versions compile themselves.

kz80_lisp: A Self-Hosted LISP Compiler

kz80_lisp (crates.io) includes a LISP-to-Z80 compiler written in LISP itself. The compiler.lisp file defines functions that traverse LISP expressions and emit Z80 machine code bytes directly into memory. When you call (COMPILE '(+ 1 2)), it generates the actual Z80 instructions to load 1 and 2 and add them.

The self-hosted compiler supports arithmetic expressions, nested function calls, and can generate code that interfaces with the runtime's I/O primitives. While not a full replacement for the Rust-based code generator, it demonstrates that LISP is expressive enough to describe its own compilation to machine code.

kz80_c: A Self-Hosted C Compiler

kz80_c (crates.io) goes further: its self/cc.c file is a complete C compiler written in the C subset it compiles. This compiler reads C source from stdin and outputs Z80 binary to stdout, making it usable in shell pipelines:

# printf 'int main() { puts("Hello!"); return 0; }\x00' | \ retroshield self/cc.bin > hello.bin # retroshield hello.bin Hello!

The self-hosted C compiler supports all arithmetic operators, pointers, arrays, global variables, control flow statements, and recursive functions. Its main limitation is memory: the compiler source is approximately 66KB, exceeding the 8KB input buffer available on the Z80. This is a fundamental hardware constraint, not a compiler bug. In theory, a "stage 0" minimal compiler could bootstrap larger compilers.

Why Self-Hosting Matters

Self-hosting is more than a technical achievement; it validates the language implementation. If the compiler can compile itself correctly, it demonstrates that the language is expressive enough for real programs and that the code generator produces working code under complex conditions. For our Z80 compilers, self-hosting also connects us to the history of computing: the original Small-C compiler by Ron Cain in 1980 was similarly self-hosted on Z80/CP-M systems.

The Language Implementations

kz80_lisp

A minimal LISP interpreter and compiler featuring the full suite of list operations (CAR, CDR, CONS), special forms (QUOTE, IF, COND, LAMBDA, DEFINE), and recursive function support. The implementation includes a pure-LISP floating-point library and the self-hosted compiler mentioned above.

kz80_lisp v0.1 > (+ 21 21) 42 > (DEFINE (SQUARE X) (* X X)) SQUARE > (SQUARE 7) 49

kz80_c

A C compiler supporting char (8-bit), int (16-bit), float (BCD), pointers, arrays, structs, and a preprocessor with #define and #include. The runtime library provides serial I/O and comprehensive BCD arithmetic functions. The self-hosted variant can compile and run C programs entirely on the Z80.

# cat fibonacci.c int fib(int n) { if (n <= 1) return n; return fib(n-1) + fib(n-2); } int main() { puts("Fibonacci:"); for (int i = 0; i < 10; i = i + 1) print_num(fib(i)); return 0; } # kz80_c fibonacci.c -o fib.bin # retroshield -l fib.bin Fibonacci: 0 1 1 2 3 5 8 13 21 34

kz80_smalltalk

A Smalltalk subset compiler implementing the language's distinctive message-passing syntax with left-to-right operator evaluation. Expressions like 1 + 2 * 3 evaluate to 9 (not 7), matching Smalltalk's uniform treatment of binary messages. All arithmetic uses BCD with the DAA instruction.

# echo "6 * 7" | kz80_smalltalk /dev/stdin -o answer.bin # retroshield -l answer.bin Tiny Smalltalk on Z80 42

kz80_lua

A Lua compiler producing standalone ROM images with an embedded virtual machine. Supports tables (Lua's associative arrays), first-class functions, closures, and familiar control structures. The generated VM interprets Lua bytecode, with frequently-used operations implemented in native Z80 code for performance.

# cat factorial.lua function factorial(n) if n <= 1 then return 1 end return n * factorial(n - 1) end print("5! =", factorial(5)) # kz80_lua factorial.lua -o fact.bin # retroshield -l fact.bin Tiny Lua v0.1 5! = 120

kz80_fortran

A Fortran 77 interpreter with free-format input, REAL numbers via BCD floating point, block IF/THEN/ELSE/ENDIF, DO loops, subroutines, and functions. Requires the Teensy adapter for sufficient RAM. Written in C and cross-compiled with SDCC.

FORTRAN-77 Interpreter v0.3 RetroShield Z80 Ready. > INTEGER X, Y > X = 7 > Y = X * 6 > WRITE(*,*) 'Answer:', Y Answer: 42

kz80_pascal

A Pascal interpreter implemented in pure Z80 assembly. Provides an interactive REPL for expression evaluation with integer arithmetic, boolean operations, and comparison operators. A testament to the challenges of assembly language programming.

Tiny Pascal v0.1 For RetroShield Z80 (Expression Eval Mode) > 2 + 3 * 4 = 00014 > TRUE AND (5 > 3) = TRUE

retrolang

A custom systems programming language with Pascal/C-like syntax, featuring 16-bit integers, 8-bit bytes, pointers, arrays, inline assembly, and full function support with recursion. Compiles to readable Z80 assembly before assembling to binary.

# cat squares.rl proc main() var i: int; print("Squares: "); for i := 1 to 5 do printi(i * i); printc(32); end; println(); end; # retrolang squares.rl --binary -o squares.bin # retroshield -l squares.bin Squares: 1 4 9 16 25

kz80_chip8

A static recompiler that transforms CHIP-8 programs into native Z80 code. Rather than interpreting CHIP-8 bytecode at runtime, the compiler analyzes each instruction and generates equivalent Z80 sequences. Classic games like Space Invaders and Tetris run directly on the hardware.

# kz80_chip8 -d ibm_logo.ch8 200: 00E0 CLS 202: A22A LD I, 22A 204: 600C LD V0, 0C 206: 6108 LD V1, 08 208: D01F DRW V0, V1, 15 20A: 7009 ADD V0, 09 20C: A239 LD I, 239 20E: D01F DRW V0, V1, 15 ... # kz80_chip8 ibm_logo.ch8 -o ibm.bin # retroshield -l ibm.bin CHIP-8 on Z80 [displays IBM logo]

Why Rust for Compiler Development?

The choice of Rust for our compiler workbench was not accidental. Several features make it exceptionally well-suited for this work.

Strong typing catches bugs early. When you're generating machine code, off-by-one errors or type mismatches can produce binaries that crash or compute wrong results. Rust's type system prevents many such errors at compile time.

Pattern matching excels at AST manipulation. Walking a syntax tree involves matching on node types and recursively processing children. Rust's match expressions with destructuring make this natural and exhaustive (the compiler warns if you forget a case).

Zero-cost abstractions. We can use high-level constructs like iterators, enums with data, and trait objects without runtime overhead. The generated compiler code is as efficient as hand-written C.

Excellent tooling. Cargo's test framework made it easy to build comprehensive test suites. Each compiler has dozens to hundreds of tests that run in seconds, providing confidence when making changes.

Memory safety without garbage collection. This matters less for the compilers themselves (which are desktop tools) but more for our mental model: thinking about ownership and lifetimes transfers naturally to thinking about Z80 register allocation and stack management.

Conclusion

Building these compilers has been a journey through computing history, from the Z80's 1976 architecture to modern Rust tooling, from the fundamentals of lexing and parsing to the intricacies of self-hosting. The BCD arithmetic that seemed like a curiosity became a practical necessity; the emulator that started as a debugging aid became essential infrastructure; the Rust workbench that felt like an optimization became the key to productivity.

The Z80 remains a remarkable teaching platform. Its simple instruction set is comprehensible in an afternoon, yet implementing real languages for it requires genuine compiler engineering. Every language in this collection forced us to think carefully about representation, evaluation, and code generation in ways that higher-level targets often obscure.

All of these projects are open source under BSD-3-Clause licenses. The compilers are available on both GitHub and crates.io, ready to install with cargo install. Whether you are interested in retrocomputing, compiler construction, or just curious how programming languages work at the metal level, I hope these tools and their source code prove useful.

The Z80 may be nearly 50 years old, but it still has lessons to teach.