In Finnish mythology, the Sampo is a magical artifact from the epic poem Kalevala, compiled by Elias Lönnrot in 1835. According to legend, the Sampo was forged by Ilmarinen, a legendary blacksmith and sky god, from a swan's feather, a grain of barley, a ball of wool, a drop of milk, and a shaft of a distaff. The resulting creation took the form of a magical mill that could produce flour, salt, and gold endlessly—bringing riches and good fortune to its holder.
The exact nature of the Sampo has been debated by scholars since 1818, with over 30 theories proposed—ranging from a world pillar to an astrolabe to a decorated shield. This mystery makes it a fitting namesake for a CPU architecture: something that transforms simple inputs into useful outputs, whose inner workings invite exploration and understanding.
This is the first part of a two-part series exploring the Sampo CPU architecture. In this article, we'll dive deep into the theory, design philosophy, and architectural decisions that shaped Sampo. In Part 2, we'll get our hands dirty with an actual FPGA implementation using Amaranth HDL, bringing this processor to life on real silicon.
The Problem Space: Why Another CPU?
Before diving into Sampo's architecture, it's worth asking: why design a new CPU at all? The retrocomputing community has no shortage of classic processors to explore—the Z80, 6502, 68000—and modern RISC architectures like RISC-V offer clean, well-documented designs for educational purposes.
The answer lies in a specific niche that existing architectures don't quite fill. Consider the typical workloads of classic 8-bit systems: interpreters for languages like BASIC and Forth, operating systems like CP/M, text editors, and simple games. These workloads have distinct characteristics:
- Heavy use of memory operations: Block copies, string manipulation, memory fills
- Port-based I/O: Serial terminals, disk controllers, sound chips accessed via dedicated I/O instructions
- Context switching: Interrupt handlers that need to save and restore register state quickly
- BCD arithmetic: Calculator applications, financial software
The Z80 excels at these tasks through specialized instructions (LDIR, LDDR, IN, OUT) and its alternate register set. But the Z80 is an 8-bit CISC processor with irregular encoding, complex addressing modes, and over 300 instruction variants. This makes it challenging to implement efficiently in modern hardware or to target with optimizing compilers.
Modern RISC architectures like RISC-V take the opposite approach: clean, orthogonal instruction sets optimized for pipelining and compiler code generation. But they typically use memory-mapped I/O (no dedicated I/O instructions), lack block operations, and provide no alternate register sets for fast context switching.
Sampo occupies the middle ground—a "Z80 programmer's RISC" that combines the regularity and simplicity of RISC design with the specialized capabilities that made the Z80 so effective for its target workloads.
Design Goals
Sampo was designed with five primary goals:
- RISC-inspired instruction set: Clean, orthogonal design with predictable encoding
- 16-bit native word size: Registers, ALU, and memory addressing all 16-bit
- Efficient for interpreters and compilers: Stack operations, indirect addressing, hardware multiply/divide
- Simple to implement: Suitable for FPGA synthesis or software emulation
- Z80-workload compatible: Port-based I/O, BCD support, block operations, alternate registers
These goals create natural tensions. RISC purity would eliminate block operations and port-based I/O. Maximum Z80 compatibility would preserve its irregular encoding. Sampo resolves these tensions by borrowing selectively from multiple architectural traditions.
Architectural Lineage
Sampo's design draws from four distinct sources, each contributing specific elements:
From RISC-V
RISC-V's influence is most visible in Sampo's register conventions:
-
Zero register (R0): A register that always reads as zero and ignores writes. This eliminates the need for separate "clear" or "load zero" instructions—
ADD R4, R0, R0clears R4,ADD R4, R5, R0copies R5 to R4. - Register naming conventions: Return address (RA), stack pointer (SP), global pointer (GP), argument registers (A0-A3), temporaries (T0-T3), and saved registers (S0-S3).
- Load/store architecture: Only load and store instructions access memory; all computation occurs between registers.
From MIPS
MIPS contributed Sampo's approach to instruction encoding:
- Simple, orthogonal formats: A small number of instruction formats (R, I, S, B, J) with consistent field positions
- 4-bit primary opcode: Sixteen instruction categories, each with function codes for variants
- PC-relative branching: Branch targets specified as signed offsets from the program counter
From ARM Thumb/Thumb-2
ARM's Thumb instruction set inspired Sampo's hybrid encoding strategy:
- 16-bit base instruction width: Most common operations fit in 16 bits for improved code density
- 32-bit extended forms: Operations requiring larger immediates use a two-word format
- Prefix-based extension: The 0xF opcode prefix indicates a 32-bit instruction, simplifying decode
From the Z80
The Z80 provides Sampo's "personality"—the features that make it feel familiar to retrocomputing enthusiasts:
- Port-based I/O: IN and OUT instructions with 8-bit port addresses, separate from the memory address space
- Alternate register set: The EXX instruction swaps working registers with shadow copies for fast interrupt handling
- Block operations: LDIR, LDDR, FILL, and CPIR for efficient memory manipulation
- BCD support: The DAA (Decimal Adjust Accumulator) instruction for binary-coded decimal arithmetic
- 64KB address space: 16-bit addresses, matching the Z80's memory model
The Register File
Sampo provides 16 general-purpose 16-bit registers, organized with RISC-V-style conventions:
| Register | Alias | Convention |
|---|---|---|
| R0 | ZERO | Always reads as 0, writes ignored |
| R1 | RA | Return address (saved by caller) |
| R2 | SP | Stack pointer |
| R3 | GP | Global pointer (optional) |
| R4-R7 | A0-A3 | Arguments / Return values |
| R8-R11 | T0-T3 | Temporaries (caller-saved) |
| R12-R15 | S0-S3 | Saved registers (callee-saved) |
The zero register deserves special attention. Having a register that always contains zero eliminates entire classes of instructions found in other architectures:
-
MOV Rd, Rs becomes
ADD Rd, Rs, R0 -
CLR Rd becomes
ADD Rd, R0, R0 - NEG Rd, Rs can use R0 as the implicit minuend
-
CMP Rs, #0 becomes
SUB R0, Rs, R0(result discarded, flags set)
This technique, pioneered by MIPS and refined by RISC-V, dramatically simplifies the instruction set while maintaining expressiveness.
Alternate Registers
Unlike the Z80, which swaps all main registers with EXX, Sampo is selective. Only registers R4-R11 (the arguments and temporaries) have shadow copies. The critical system registers—R0 (zero), R1 (return address), R2 (stack pointer), R3 (global pointer), and R12-R15 (saved registers)—are never swapped.
This design decision serves interrupt handling. When an interrupt occurs, the handler can execute EXX to gain a fresh set of working registers without corrupting the interrupted code's arguments or temporaries. The stack pointer remains valid (no need to establish a new stack), and the return address register can be used to save the interrupted PC.
irq_handler: EXX ; Swap to alternate R4-R11 ; ... handle interrupt using R4'-R11' ... ; Primary registers preserved automatically EXX ; Swap back RETI ; Return from interrupt
The Flags Register
Sampo uses an 8-bit flags register with six defined flags:
| Bit | Flag | Name | Description |
|---|---|---|---|
| 7 | N | Negative | Sign bit of result (bit 15) |
| 6 | Z | Zero | Result is zero |
| 5 | C | Carry | Unsigned overflow / borrow |
| 4 | V | Overflow | Signed overflow |
| 3 | H | Half-carry | Carry from bit 3 to 4 (for BCD) |
| 2 | I | Interrupt | Interrupt enable |
The N, Z, C, and V flags follow standard conventions and support the full range of conditional branches. The H (half-carry) flag exists specifically for the DAA instruction, enabling correct BCD arithmetic. The I flag controls interrupt recognition.
Notably, Sampo provides explicit GETF and SETF instructions to read and write the flags register, unlike many RISC architectures that treat flags as implicit state. This supports context switching and debugging.
Memory Model
Sampo uses a straightforward memory model:
- Address space: 64KB (16-bit addresses)
- Byte-addressable: Individual bytes can be loaded and stored
- Little-endian: Multi-byte values stored with LSB at lower address
- Word alignment: 16-bit words should be aligned on even addresses (optional enforcement)
A suggested memory map divides the 64KB space:
0x0000-0x00FF Interrupt vectors / Reset 0x0100-0x7FFF Program ROM (~32KB) 0x8000-0xFEFF RAM (~32KB) 0xFF00-0xFFFF Memory-mapped I/O (256 bytes)
This layout provides a clean separation between code, data, and I/O while leaving room for customization. The interrupt vector area at the bottom of memory follows Z80 conventions, with the reset vector at 0x0000 and interrupt vector at 0x0004.
Port-Based I/O
In addition to memory, Sampo provides a separate 256-port I/O address space accessed via IN and OUT instructions. This design directly mirrors the Z80 and enables straightforward porting of code that interacts with serial ports, disk controllers, sound chips, and other peripherals.
The I/O instructions come in two forms:
INI R4, 0x80 ; Read from port 0x80 (immediate port number) IN R4, (R5) ; Read from port specified in R5 OUTI 0x81, R4 ; Write R4 to port 0x81 (immediate) OUT (R5), R4 ; Write R4 to port specified in R5
Extended 32-bit forms (INX, OUTX) allow the full 8-bit port range to be specified in immediate form.
Instruction Encoding
Sampo uses a clean, regular encoding scheme with 16-bit base instructions and 32-bit extended forms. The 4-bit primary opcode in bits 15:12 determines the instruction category:
| Opcode | Category | Description |
|---|---|---|
| 0x0 | ADD | Register addition |
| 0x1 | SUB | Register subtraction |
| 0x2 | AND | Bitwise AND |
| 0x3 | OR | Bitwise OR |
| 0x4 | XOR | Bitwise XOR |
| 0x5 | ADDI | Add immediate |
| 0x6 | LOAD | Load from memory |
| 0x7 | STORE | Store to memory |
| 0x8 | BRANCH | Conditional branch |
| 0x9 | JUMP | Unconditional jump/call |
| 0xA | SHIFT | Shift and rotate |
| 0xB | MULDIV | Multiply/divide/BCD |
| 0xC | MISC | Stack, block ops, compare |
| 0xD | I/O | Port input/output |
| 0xE | SYSTEM | NOP, HALT, interrupts |
| 0xF | EXTENDED | 32-bit instructions |
Instruction Formats
Six formats cover all instruction types:
Format R (Register-Register):
15 12 11 8 7 4 3 0 +----------+--------+--------+--------+ | opcode | Rd | Rs1 | Rs2 | +----------+--------+--------+--------+
Used for three-register operations like ADD R4, R5, R6.
Format I (Immediate):
15 12 11 8 7 0 +----------+--------+------------------+ | opcode | Rd | imm8 | +----------+--------+------------------+
Used for operations with 8-bit immediates like ADDI R4, 42.
Format S (Store):
15 12 11 8 7 4 3 0 +----------+--------+--------+--------+ | opcode | imm4 | Rs1 | Rs2 | +----------+--------+--------+--------+
Used for stores where the destination register field holds an offset.
Format B (Branch):
15 12 11 8 7 0 +----------+--------+------------------+ | opcode | cond | offset8 | +----------+--------+------------------+
Used for conditional branches with PC-relative offsets.
Format J (Jump):
15 12 11 0 +----------+--------------------------+ | opcode | offset12 | +----------+--------------------------+
Used for unconditional jumps with 12-bit PC-relative offsets.
Format X (Extended):
Word 0: 15 12 11 8 7 4 3 0 +----------+--------+--------+--------+ | 0xF | Rd | Rs1 | sub | +----------+--------+--------+--------+ Word 1: 15 0 +-------------------------------------+ | imm16 | +-------------------------------------+
Used for operations requiring 16-bit immediates or absolute addresses.
Encoding Examples
To illustrate the encoding scheme, let's examine several instructions:
ADD R4, R5, R6 (R4 = R5 + R6):
Opcode = 0x0, Rd = 4, Rs1 = 5, Rs2 = 6 Binary: 0000 0100 0101 0110 = 0x0456
ADDI R4, 10 (R4 = R4 + 10):
Opcode = 0x5, Rd = 4, imm8 = 10 Binary: 0101 0100 0000 1010 = 0x540A
BEQ +8 (branch forward 8 bytes if equal):
Opcode = 0x8, cond = 0 (BEQ), offset = 4 words Binary: 1000 0000 0000 0100 = 0x8004
LIX R4, 0x1234 (load 16-bit immediate):
Word 0: 0xF (extended), Rd = 4, Rs = 0, sub = 7 (LIX) Word 1: 0x1234 Binary: 1111 0100 0000 0111 0001 0010 0011 0100 = 0xF407 0x1234
The regularity of this encoding makes instruction decode straightforward—the first nibble determines the instruction category, and subsequent fields are in consistent positions across formats.
The Instruction Set
Sampo provides approximately 66 distinct instructions, organized into ten categories.
Arithmetic (15 instructions)
The arithmetic category includes standard operations (ADD, SUB, ADDI) plus multiply/divide support:
- MUL: 16×16 multiplication, low 16 bits of result
- MULH/MULHU: High 16 bits of 32-bit product (signed/unsigned)
- DIV/DIVU: Integer division (signed/unsigned)
- REM/REMU: Remainder (signed/unsigned)
- DAA: Decimal adjust for BCD arithmetic
- NEG: Two's complement negation
- CMP: Compare (subtract without storing result)
Hardware multiply and divide are essential for interpreter performance—dividing a 32-bit value by 10 for number formatting would be prohibitively slow without hardware support.
Logic (6 instructions)
Standard bitwise operations: AND, OR, XOR, NOT, plus immediate forms ANDI and ORI.
Shift and Rotate (16 variants)
Sampo provides an unusually rich set of shift operations:
- SLL/SRL/SRA: Shift left/right logical/arithmetic
- ROL/ROR: Rotate left/right
- RCL/RCR: Rotate through carry (17-bit rotation)
- SWAP: Swap high and low bytes
Each shift type comes in three shift amounts: 1, 4, and 8 bits. The 4-bit shift is particularly useful for hexadecimal digit extraction and insertion. Variable shifts use the extended format with the shift amount in the second register or immediate field.
Load/Store (6 instructions)
Memory access instructions include word and byte loads (with sign or zero extension), word and byte stores, and LUI (Load Upper Immediate) for constructing 16-bit constants:
LUI R4, 0x12 ; R4 = 0x1200 ORI R4, R4, 0x34 ; R4 = 0x1234
Branch (16 conditions)
Sampo supports a comprehensive set of branch conditions:
- BEQ/BNE: Equal/not equal
- BLT/BGE/BGT/BLE: Signed comparisons
- BLTU/BGEU/BHI/BLS: Unsigned comparisons
- BMI/BPL: Negative/positive
- BVS/BVC: Overflow set/clear
- BCS/BCC: Carry set/clear
This covers all reasonable comparison outcomes for both signed and unsigned arithmetic.
Jump/Call (4 instructions)
- J: PC-relative unconditional jump
- JAL: Jump and link (save return address in RA)
- JR: Jump to address in register
- JALR: Jump and link to register address
Block Operations (6 instructions)
The block operations use a fixed register convention (R4=count, R5=source, R6=destination):
- LDI/LDD: Load single byte, increment/decrement pointers and count
- LDIR/LDDR: Repeat until count reaches zero
- FILL: Fill memory region with value
- CPIR: Compare and search forward
These instructions are decidedly un-RISC—they're multi-cycle operations that modify multiple registers. But they're implemented with predictable behavior (always the same registers, always the same algorithm) and provide enormous speedups for common memory operations.
Stack (4 instructions)
- PUSH/POP: Single register push/pop
- PUSHM/POPM: Push/pop multiple registers (via bitmask)
I/O (4 instructions)
- INI/OUTI: Immediate port address
- IN/OUT: Register port address
System (9 instructions)
- NOP: No operation
- HALT: Stop processor
- DI/EI: Disable/enable interrupts
- EXX: Exchange alternate registers
- RETI: Return from interrupt
- SWI: Software interrupt
- SCF/CCF: Set/complement carry flag
- GETF/SETF: Read/write flags register
Comparison with Other Architectures
To put Sampo in context, consider how it compares with related processors:
| Aspect | Z80 | MIPS | RISC-V | Sampo |
|---|---|---|---|---|
| Word size | 8-bit | 32-bit | 32/64-bit | 16-bit |
| Instruction width | 1-4 bytes | 4 bytes | 2/4 bytes | 2/4 bytes |
| Registers | 8 + alternates | 32 | 32 | 16 + alternates |
| Zero register | No | $zero | x0 | R0 |
| I/O model | Port-based | Memory-mapped | Memory-mapped | Port-based |
| Block operations | Yes | No | No | Yes |
| Instruction count | ~300+ | ~60 | ~50 base | ~66 |
Sampo sits in an interesting position: more regular than the Z80 but with Z80-friendly features, smaller and simpler than 32-bit RISC but still cleanly orthogonal.
Code Examples
To demonstrate how Sampo assembly looks in practice, here's a "Hello World" program that outputs text via a serial port:
.org 0x0100 .equ ACIA_STATUS 0x80 .equ ACIA_DATA 0x81 .equ TX_READY 0x02 start: LIX R4, message ; Load address of string loop: LBU R5, (R4) ; Load byte from string CMP R5, R0 ; Compare with zero BEQ done ; If null terminator, done wait_tx: INI R6, ACIA_STATUS ; Read serial status port ANDI R6, R6, TX_READY ; Check transmit ready bit BEQ wait_tx ; Wait if not ready OUTI ACIA_DATA, R5 ; Write character to data port ADDI R4, 1 ; Next character J loop done: HALT message: .asciz "Hello, Sampo!\n"
And here's a Fibonacci function demonstrating the calling convention:
; fib(n) - compute nth Fibonacci number ; Input: R4 (A0) = n ; Output: R4 (A0) = fib(n) fib: ADDI R5, R0, 0 ; a = 0 ADDI R6, R0, 1 ; b = 1 CMP R4, R0 BEQ fib_done fib_loop: ADD R7, R5, R6 ; temp = a + b ADD R5, R6, R0 ; a = b ADD R6, R7, R0 ; b = temp ADDI R4, R4, -1 ; n-- BNE fib_loop fib_done: ADD R4, R5, R0 ; return a JR RA
The code reads naturally to anyone familiar with RISC assembly, while the I/O instructions and register conventions provide the Z80-like feel that makes porting classic software straightforward.
Looking Ahead: FPGA Implementation
With the architecture defined, the next step is implementation. In Part 2 of this series, we'll build a working Sampo processor using Amaranth HDL, a modern Python-based hardware description language. We'll cover:
- The ALU module: Implementing all arithmetic and logic operations
- The register file: Including the alternate register set and zero register
- The instruction decoder: Parsing the various instruction formats
- The control unit: Managing the fetch-decode-execute cycle
- The memory interface: Connecting to block RAM
- The I/O subsystem: Implementing the port-based I/O model
- Integration: Putting it all together into a working system-on-chip
We'll synthesize the design for an affordable FPGA board and run actual Sampo programs, demonstrating that this architecture isn't just a paper exercise but a real, working processor.
The Sampo project on GitHub includes a complete Rust-based assembler (sasm) and emulator (semu) with a TUI debugger, so you can start writing and testing Sampo programs today. The FPGA implementation will let you run those same programs on real hardware, completing the journey from mythological artifact to silicon reality.
Stay tuned for Part 2, where we'll forge our own Sampo—not from swan feathers and barley, but from lookup tables and flip-flops.
