Sampo: Designing a 16-bit RISC CPU from Scratch - Part 1: Theory and Architecture

A.C. Jokela

2025-12-30

In Finnish mythology, the Sampo is a magical artifact from the epic poem Kalevala, compiled by Elias Lönnrot in 1835. According to legend, the Sampo was forged by Ilmarinen, a legendary blacksmith and sky god, from a swan's feather, a grain of barley, a ball of wool, a drop of milk, and a shaft of a distaff. The resulting creation took the form of a magical mill that could produce flour, salt, and gold endlessly—bringing riches and good fortune to its holder.

The exact nature of the Sampo has been debated by scholars since 1818, with over 30 theories proposed—ranging from a world pillar to an astrolabe to a decorated shield. This mystery makes it a fitting namesake for a CPU architecture: something that transforms simple inputs into useful outputs, whose inner workings invite exploration and understanding.

This is the first part of a two-part series exploring the Sampo CPU architecture. In this article, we'll dive deep into the theory, design philosophy, and architectural decisions that shaped Sampo. In Part 2, we'll get our hands dirty with an actual FPGA implementation using Amaranth HDL, bringing this processor to life on real silicon.

The Problem Space: Why Another CPU?

Before diving into Sampo's architecture, it's worth asking: why design a new CPU at all? The retrocomputing community has no shortage of classic processors to explore—the Z80, 6502, 68000—and modern RISC architectures like RISC-V offer clean, well-documented designs for educational purposes.

The answer lies in a specific niche that existing architectures don't quite fill. Consider the typical workloads of classic 8-bit systems: interpreters for languages like BASIC and Forth, operating systems like CP/M, text editors, and simple games. These workloads have distinct characteristics:

Heavy use of memory operations: Block copies, string manipulation, memory fills
Port-based I/O: Serial terminals, disk controllers, sound chips accessed via dedicated I/O instructions
Context switching: Interrupt handlers that need to save and restore register state quickly
BCD arithmetic: Calculator applications, financial software

The Z80 excels at these tasks through specialized instructions (LDIR, LDDR, IN, OUT) and its alternate register set. But the Z80 is an 8-bit CISC processor with irregular encoding, complex addressing modes, and over 300 instruction variants. This makes it challenging to implement efficiently in modern hardware or to target with optimizing compilers.

Modern RISC architectures like RISC-V take the opposite approach: clean, orthogonal instruction sets optimized for pipelining and compiler code generation. But they typically use memory-mapped I/O (no dedicated I/O instructions), lack block operations, and provide no alternate register sets for fast context switching.

Sampo occupies the middle ground—a "Z80 programmer's RISC" that combines the regularity and simplicity of RISC design with the specialized capabilities that made the Z80 so effective for its target workloads.

Design Goals

Sampo was designed with five primary goals:

RISC-inspired instruction set: Clean, orthogonal design with predictable encoding
16-bit native word size: Registers, ALU, and memory addressing all 16-bit
Efficient for interpreters and compilers: Stack operations, indirect addressing, hardware multiply/divide
Simple to implement: Suitable for FPGA synthesis or software emulation
Z80-workload compatible: Port-based I/O, BCD support, block operations, alternate registers

These goals create natural tensions. RISC purity would eliminate block operations and port-based I/O. Maximum Z80 compatibility would preserve its irregular encoding. Sampo resolves these tensions by borrowing selectively from multiple architectural traditions.

Architectural Lineage

Sampo's design draws from four distinct sources, each contributing specific elements:

From RISC-V

RISC-V's influence is most visible in Sampo's register conventions:

Zero register (R0): A register that always reads as zero and ignores writes. This eliminates the need for separate "clear" or "load zero" instructions—ADD R4, R0, R0 clears R4, ADD R4, R5, R0 copies R5 to R4.
Register naming conventions: Return address (RA), stack pointer (SP), global pointer (GP), argument registers (A0-A3), temporaries (T0-T3), and saved registers (S0-S3).
Load/store architecture: Only load and store instructions access memory; all computation occurs between registers.

From MIPS

MIPS contributed Sampo's approach to instruction encoding:

Simple, orthogonal formats: A small number of instruction formats (R, I, S, B, J) with consistent field positions
4-bit primary opcode: Sixteen instruction categories, each with function codes for variants
PC-relative branching: Branch targets specified as signed offsets from the program counter

From ARM Thumb/Thumb-2

ARM's Thumb instruction set inspired Sampo's hybrid encoding strategy:

16-bit base instruction width: Most common operations fit in 16 bits for improved code density
32-bit extended forms: Operations requiring larger immediates use a two-word format
Prefix-based extension: The 0xF opcode prefix indicates a 32-bit instruction, simplifying decode

From the Z80

The Z80 provides Sampo's "personality"—the features that make it feel familiar to retrocomputing enthusiasts:

Port-based I/O: IN and OUT instructions with 8-bit port addresses, separate from the memory address space
Alternate register set: The EXX instruction swaps working registers with shadow copies for fast interrupt handling
Block operations: LDIR, LDDR, FILL, and CPIR for efficient memory manipulation
BCD support: The DAA (Decimal Adjust Accumulator) instruction for binary-coded decimal arithmetic
64KB address space: 16-bit addresses, matching the Z80's memory model

The Register File

Sampo provides 16 general-purpose 16-bit registers, organized with RISC-V-style conventions:

Register	Alias	Convention
R0	ZERO	Always reads as 0, writes ignored
R1	RA	Return address (saved by caller)
R2	SP	Stack pointer
R3	GP	Global pointer (optional)
R4-R7	A0-A3	Arguments / Return values
R8-R11	T0-T3	Temporaries (caller-saved)
R12-R15	S0-S3	Saved registers (callee-saved)

The zero register deserves special attention. Having a register that always contains zero eliminates entire classes of instructions found in other architectures:

MOV Rd, Rs becomes ADD Rd, Rs, R0
CLR Rd becomes ADD Rd, R0, R0
NEG Rd, Rs can use R0 as the implicit minuend
CMP Rs, #0 becomes SUB R0, Rs, R0 (result discarded, flags set)

This technique, pioneered by MIPS and refined by RISC-V, dramatically simplifies the instruction set while maintaining expressiveness.

Alternate Registers

Unlike the Z80, which swaps all main registers with EXX, Sampo is selective. Only registers R4-R11 (the arguments and temporaries) have shadow copies. The critical system registers—R0 (zero), R1 (return address), R2 (stack pointer), R3 (global pointer), and R12-R15 (saved registers)—are never swapped.

This design decision serves interrupt handling. When an interrupt occurs, the handler can execute EXX to gain a fresh set of working registers without corrupting the interrupted code's arguments or temporaries. The stack pointer remains valid (no need to establish a new stack), and the return address register can be used to save the interrupted PC.

irq_handler:
    EXX                     ; Swap to alternate R4-R11
    ; ... handle interrupt using R4'-R11' ...
    ; Primary registers preserved automatically
    EXX                     ; Swap back
    RETI                    ; Return from interrupt

The Flags Register

Sampo uses an 8-bit flags register with six defined flags:

Bit	Flag	Name	Description
7	N	Negative	Sign bit of result (bit 15)
6	Z	Zero	Result is zero
5	C	Carry	Unsigned overflow / borrow
4	V	Overflow	Signed overflow
3	H	Half-carry	Carry from bit 3 to 4 (for BCD)
2	I	Interrupt	Interrupt enable

The N, Z, C, and V flags follow standard conventions and support the full range of conditional branches. The H (half-carry) flag exists specifically for the DAA instruction, enabling correct BCD arithmetic. The I flag controls interrupt recognition.

Notably, Sampo provides explicit GETF and SETF instructions to read and write the flags register, unlike many RISC architectures that treat flags as implicit state. This supports context switching and debugging.

Memory Model

Sampo uses a straightforward memory model:

Address space: 64KB (16-bit addresses)
Byte-addressable: Individual bytes can be loaded and stored
Little-endian: Multi-byte values stored with LSB at lower address
Word alignment: 16-bit words should be aligned on even addresses (optional enforcement)

A suggested memory map divides the 64KB space:

0x0000-0x00FF   Interrupt vectors / Reset
0x0100-0x7FFF   Program ROM (~32KB)
0x8000-0xFEFF   RAM (~32KB)
0xFF00-0xFFFF   Memory-mapped I/O (256 bytes)

This layout provides a clean separation between code, data, and I/O while leaving room for customization. The interrupt vector area at the bottom of memory follows Z80 conventions, with the reset vector at 0x0000 and interrupt vector at 0x0004.

Port-Based I/O

In addition to memory, Sampo provides a separate 256-port I/O address space accessed via IN and OUT instructions. This design directly mirrors the Z80 and enables straightforward porting of code that interacts with serial ports, disk controllers, sound chips, and other peripherals.

The I/O instructions come in two forms:

INI  R4, 0x80       ; Read from port 0x80 (immediate port number)
IN   R4, (R5)       ; Read from port specified in R5
OUTI 0x81, R4       ; Write R4 to port 0x81 (immediate)
OUT  (R5), R4       ; Write R4 to port specified in R5

Extended 32-bit forms (INX, OUTX) allow the full 8-bit port range to be specified in immediate form.

Instruction Encoding

Sampo uses a clean, regular encoding scheme with 16-bit base instructions and 32-bit extended forms. The 4-bit primary opcode in bits 15:12 determines the instruction category:

Opcode	Category	Description
0x0	ADD	Register addition
0x1	SUB	Register subtraction
0x2	AND	Bitwise AND
0x3	OR	Bitwise OR
0x4	XOR	Bitwise XOR
0x5	ADDI	Add immediate
0x6	LOAD	Load from memory
0x7	STORE	Store to memory
0x8	BRANCH	Conditional branch
0x9	JUMP	Unconditional jump/call
0xA	SHIFT	Shift and rotate
0xB	MULDIV	Multiply/divide/BCD
0xC	MISC	Stack, block ops, compare
0xD	I/O	Port input/output
0xE	SYSTEM	NOP, HALT, interrupts
0xF	EXTENDED	32-bit instructions

Instruction Formats

Six formats cover all instruction types:

Format R (Register-Register):

15       12 11     8 7      4 3      0
+----------+--------+--------+--------+
|  opcode  |   Rd   |  Rs1   |  Rs2   |
+----------+--------+--------+--------+

Used for three-register operations like ADD R4, R5, R6.

Format I (Immediate):

15       12 11     8 7                0
+----------+--------+------------------+
|  opcode  |   Rd   |      imm8        |
+----------+--------+------------------+

Used for operations with 8-bit immediates like ADDI R4, 42.

Format S (Store):

15       12 11     8 7      4 3      0
+----------+--------+--------+--------+
|  opcode  |  imm4  |  Rs1   |  Rs2   |
+----------+--------+--------+--------+

Used for stores where the destination register field holds an offset.

Format B (Branch):

15       12 11     8 7                0
+----------+--------+------------------+
|  opcode  |  cond  |     offset8      |
+----------+--------+------------------+

Used for conditional branches with PC-relative offsets.

Format J (Jump):

15       12 11                       0
+----------+--------------------------+
|  opcode  |        offset12          |
+----------+--------------------------+

Used for unconditional jumps with 12-bit PC-relative offsets.

Format X (Extended):

Word 0:
15       12 11     8 7      4 3      0
+----------+--------+--------+--------+
|   0xF    |   Rd   |  Rs1   |  sub   |
+----------+--------+--------+--------+

Word 1:
15                                   0
+-------------------------------------+
|              imm16                  |
+-------------------------------------+

Used for operations requiring 16-bit immediates or absolute addresses.

Encoding Examples

To illustrate the encoding scheme, let's examine several instructions:

ADD R4, R5, R6 (R4 = R5 + R6):

Opcode = 0x0, Rd = 4, Rs1 = 5, Rs2 = 6
Binary: 0000 0100 0101 0110 = 0x0456

ADDI R4, 10 (R4 = R4 + 10):

Opcode = 0x5, Rd = 4, imm8 = 10
Binary: 0101 0100 0000 1010 = 0x540A

BEQ +8 (branch forward 8 bytes if equal):

Opcode = 0x8, cond = 0 (BEQ), offset = 4 words
Binary: 1000 0000 0000 0100 = 0x8004

LIX R4, 0x1234 (load 16-bit immediate):

Word 0: 0xF (extended), Rd = 4, Rs = 0, sub = 7 (LIX)
Word 1: 0x1234
Binary: 1111 0100 0000 0111 0001 0010 0011 0100 = 0xF407 0x1234

The regularity of this encoding makes instruction decode straightforward—the first nibble determines the instruction category, and subsequent fields are in consistent positions across formats.

The Instruction Set

Sampo provides approximately 66 distinct instructions, organized into ten categories.

Arithmetic (15 instructions)

The arithmetic category includes standard operations (ADD, SUB, ADDI) plus multiply/divide support:

MUL: 16×16 multiplication, low 16 bits of result
MULH/MULHU: High 16 bits of 32-bit product (signed/unsigned)
DIV/DIVU: Integer division (signed/unsigned)
REM/REMU: Remainder (signed/unsigned)
DAA: Decimal adjust for BCD arithmetic
NEG: Two's complement negation
CMP: Compare (subtract without storing result)

Hardware multiply and divide are essential for interpreter performance—dividing a 32-bit value by 10 for number formatting would be prohibitively slow without hardware support.

Logic (6 instructions)

Standard bitwise operations: AND, OR, XOR, NOT, plus immediate forms ANDI and ORI.

Shift and Rotate (16 variants)

Sampo provides an unusually rich set of shift operations:

SLL/SRL/SRA: Shift left/right logical/arithmetic
ROL/ROR: Rotate left/right
RCL/RCR: Rotate through carry (17-bit rotation)
SWAP: Swap high and low bytes

Each shift type comes in three shift amounts: 1, 4, and 8 bits. The 4-bit shift is particularly useful for hexadecimal digit extraction and insertion. Variable shifts use the extended format with the shift amount in the second register or immediate field.

Load/Store (6 instructions)

Memory access instructions include word and byte loads (with sign or zero extension), word and byte stores, and LUI (Load Upper Immediate) for constructing 16-bit constants:

LUI  R4, 0x12       ; R4 = 0x1200
ORI  R4, R4, 0x34   ; R4 = 0x1234

Branch (16 conditions)

Sampo supports a comprehensive set of branch conditions:

BEQ/BNE: Equal/not equal
BLT/BGE/BGT/BLE: Signed comparisons
BLTU/BGEU/BHI/BLS: Unsigned comparisons
BMI/BPL: Negative/positive
BVS/BVC: Overflow set/clear
BCS/BCC: Carry set/clear

This covers all reasonable comparison outcomes for both signed and unsigned arithmetic.

Jump/Call (4 instructions)

J: PC-relative unconditional jump
JAL: Jump and link (save return address in RA)
JR: Jump to address in register
JALR: Jump and link to register address

Block Operations (6 instructions)

The block operations use a fixed register convention (R4=count, R5=source, R6=destination):

LDI/LDD: Load single byte, increment/decrement pointers and count
LDIR/LDDR: Repeat until count reaches zero
FILL: Fill memory region with value
CPIR: Compare and search forward

These instructions are decidedly un-RISC—they're multi-cycle operations that modify multiple registers. But they're implemented with predictable behavior (always the same registers, always the same algorithm) and provide enormous speedups for common memory operations.

Stack (4 instructions)

PUSH/POP: Single register push/pop
PUSHM/POPM: Push/pop multiple registers (via bitmask)

I/O (4 instructions)

INI/OUTI: Immediate port address
IN/OUT: Register port address

System (9 instructions)

NOP: No operation
HALT: Stop processor
DI/EI: Disable/enable interrupts
EXX: Exchange alternate registers
RETI: Return from interrupt
SWI: Software interrupt
SCF/CCF: Set/complement carry flag
GETF/SETF: Read/write flags register

Comparison with Other Architectures

To put Sampo in context, consider how it compares with related processors:

Aspect	Z80	MIPS	RISC-V	Sampo
Word size	8-bit	32-bit	32/64-bit	16-bit
Instruction width	1-4 bytes	4 bytes	2/4 bytes	2/4 bytes
Registers	8 + alternates	32	32	16 + alternates
Zero register	No	$zero	x0	R0
I/O model	Port-based	Memory-mapped	Memory-mapped	Port-based
Block operations	Yes	No	No	Yes
Instruction count	~300+	~60	~50 base	~66

Sampo sits in an interesting position: more regular than the Z80 but with Z80-friendly features, smaller and simpler than 32-bit RISC but still cleanly orthogonal.

Code Examples

To demonstrate how Sampo assembly looks in practice, here's a "Hello World" program that outputs text via a serial port:

        .org 0x0100

.equ    ACIA_STATUS 0x80
.equ    ACIA_DATA   0x81
.equ    TX_READY    0x02

start:
        LIX  R4, message        ; Load address of string

loop:
        LBU  R5, (R4)           ; Load byte from string
        CMP  R5, R0             ; Compare with zero
        BEQ  done               ; If null terminator, done

wait_tx:
        INI  R6, ACIA_STATUS    ; Read serial status port
        ANDI R6, R6, TX_READY   ; Check transmit ready bit
        BEQ  wait_tx            ; Wait if not ready

        OUTI ACIA_DATA, R5      ; Write character to data port
        ADDI R4, 1              ; Next character
        J    loop
done:
        HALT

message:
        .asciz "Hello, Sampo!\n"

And here's a Fibonacci function demonstrating the calling convention:

; fib(n) - compute nth Fibonacci number
; Input: R4 (A0) = n
; Output: R4 (A0) = fib(n)

fib:
        ADDI R5, R0, 0      ; a = 0
        ADDI R6, R0, 1      ; b = 1
        CMP  R4, R0
        BEQ  fib_done

fib_loop:
        ADD  R7, R5, R6     ; temp = a + b
        ADD  R5, R6, R0     ; a = b
        ADD  R6, R7, R0     ; b = temp
        ADDI R4, R4, -1     ; n--
        BNE  fib_loop

fib_done:
        ADD  R4, R5, R0     ; return a
        JR   RA

The code reads naturally to anyone familiar with RISC assembly, while the I/O instructions and register conventions provide the Z80-like feel that makes porting classic software straightforward.

Looking Ahead: FPGA Implementation

With the architecture defined, the next step is implementation. In Part 2 of this series, we'll build a working Sampo processor using Amaranth HDL, a modern Python-based hardware description language. We'll cover:

The ALU module: Implementing all arithmetic and logic operations
The register file: Including the alternate register set and zero register
The instruction decoder: Parsing the various instruction formats
The control unit: Managing the fetch-decode-execute cycle
The memory interface: Connecting to block RAM
The I/O subsystem: Implementing the port-based I/O model
Integration: Putting it all together into a working system-on-chip

We'll synthesize the design for an affordable FPGA board and run actual Sampo programs, demonstrating that this architecture isn't just a paper exercise but a real, working processor.

The Sampo project on GitHub includes a complete Rust-based assembler (sasm) and emulator (semu) with a TUI debugger, so you can start writing and testing Sampo programs today. The FPGA implementation will let you run those same programs on real hardware, completing the journey from mythological artifact to silicon reality.

Stay tuned for Part 2, where we'll forge our own Sampo—not from swan feathers and barley, but from lookup tables and flip-flops.