Part 2: Implementing Sampo on the ULX3S FPGA

A.C. Jokela

2026-02-02

After designing the Sampo RISC architecture on paper—complete with a working assembler and emulator—it's time to bring it to life in silicon. Or at least, in programmable logic. This post documents the hardware selection and implementation planning for synthesizing Sampo on an FPGA.

The Story So Far

If you haven't read Part 1 of this series, here's the quick version: Sampo is a 16-bit RISC CPU designed to bridge the gap between clean RISC design principles and Z80-friendly features. It has 16 general-purpose registers, ~66 instructions, port-based I/O, block operations (LDIR, LDDR), alternate registers for fast interrupt handling, and hardware multiply/divide.

The project already includes working tools written in Rust:

sasm - A full assembler
semu - An emulator with TUI debugger (step, breakpoints, memory inspection)

And for hardware implementation, we now have two complete RTL implementations:

Amaranth HDL (/rtl/):

cpu.py, alu.py, decode.py, regfile.py, soc.py
Python-based, excellent for rapid iteration
Generates Verilog for synthesis

AI Assisted Hand-written Verilog (/verilog/rtl/):

cpu.v, alu.v, decode.v, regfile.v, shifter.v, uart.v, ram.v, soc.v
Readable, portable, works with any toolchain
Includes testbenches for Icarus Verilog and Verilator

Now it's time to synthesize it to real hardware.

Choosing an FPGA Platform

The FPGA world is split between proprietary toolchains (Xilinx Vivado, Intel Quartus) and the growing open source ecosystem. For a project like Sampo, where understanding every layer of the stack matters, open source tooling is the clear choice.

Open Source FPGA Options

FPGA Family	Capacity	Toolchain	Maturity
Gowin GW1N/GW2A	1K-55K LUTs	Project Apicula	Good
Lattice iCE40	1K-8K LUTs	Project IceStorm	Excellent
Lattice ECP5	12K-85K LUTs	Project Trellis	Excellent
Xilinx 7-series	10K-200K+ LUTs	Project X-Ray (partial)	Experimental

For Sampo, which estimates at ~1,500-2,500 LUTs for the basic CPU, even the smaller FPGAs have more than enough capacity. But if we want room to grow—adding caches, more peripherals, maybe even multi-core experiments—a larger device makes sense.

The ULX3S Board

The ULX3S is an open hardware development board built around the ECP5 FPGA. It's designed by Radiona.org and has become the de facto standard for open source FPGA development.

Specifications

Component	Specification
FPGA	Lattice ECP5 (LFE5U-85F/45F/12F-6BG381C)
LUTs	12K / 44K / 84K (depending on variant)
USB	FTDI FT231XS (500 kbit JTAG, 3 Mbit serial)
GPIO	56 pins (28 differential pairs), PMOD-compatible
RAM	32 MB SDRAM @ 166 MHz
Flash	4-16 MB Quad-SPI
Storage	microSD slot
LEDs	11 total (8 user, 2 USB, 1 WiFi)
Buttons	7 (4 direction, 2 fire, 1 power)
Audio	3.5mm jack (stereo + digital/composite)
Video	GPDI (HDMI-compatible) with level shifter
Display	Header for 0.96" SPI OLED (SSD1331)
Wireless	ESP32-WROOM-32 (WiFi/Bluetooth, standalone JTAG)
ADC	8 channels, 12-bit, 1 MS/s (MAX11125)
Clock	25 MHz onboard, differential input available
Power	3 switching regulators (1.1V, 2.5V, 3.3V)
Sleep	5 µA standby, RTC wake-up with battery backup
Dimensions	94mm × 51mm

Why ULX3S for Sampo

The ULX3S isn't just an FPGA breakout board—it's a complete system:

32MB SDRAM: Real memory, not just block RAM. Essential for running actual programs.
HDMI output: Video terminal without external hardware.
microSD slot: Load programs, implement a filesystem.
ESP32 co-processor: WiFi-based JTAG debugging from any device.
Buttons and LEDs: Instant I/O for testing without wiring anything.
Audio output: Even supports composite video through the audio jack.

Budget Alternative: Tang Nano 9K

Before we dive into the ULX3S, it's worth mentioning a much cheaper option. The Tang Nano 9K (~$15 on AliExpress) uses a Gowin GW1NR-9 FPGA with 8,640 LUTs—more than enough for Sampo:

8,640 LUTs
64Mbit PSRAM (can serve as the full 64KB address space and then some)
HDMI output for a video terminal
USB-C programming
Fully supported by open-source toolchain (Yosys + nextpnr-gowin)

For initial development and testing, the Tang Nano 9K is hard to beat on price. But the ULX3S offers more I/O, more RAM, and a richer peripheral set—making it the better choice for a more complete Sampo system.

LUT Budget Planning

The Sampo RTL implementation is designed to be compact. Here's the resource breakdown:

Component	Estimated LUTs
16 × 16-bit registers	~256 FFs
ALU (16-bit)	200 - 400
Control logic	500 - 1,000
Instruction decode	300 - 500
Sampo CPU core	~1,500 - 2,500
UART (115200 baud)	200 - 300
SPI controller (SD card)	300 - 500
GPIO controller	200 - 400
Basic system	~2,500 - 4,000
SDRAM controller	500 - 1,000
Instruction cache	1,000 - 2,000
Data cache	1,000 - 2,000
Full system	~6,000 - 10,000

These estimates are based on typical RISC CPU implementations. The actual numbers will depend on optimization choices and synthesis settings.

Variant Recommendations

12K LUTs (ULX3S-12F): Plenty for basic Sampo + peripherals, tight for caches.
45K LUTs (ULX3S-45F): Comfortable. Full CPU with cache, room for experiments.
85K LUTs (ULX3S-85F): Luxurious. Multi-core experiments, extensive peripherals.

Toolchain Setup

The ECP5 toolchain is fully open source:

# macOS (Homebrew)
brew install yosys nextpnr-ecp5 ecpprog fujprog

# Ubuntu/Debian
apt install yosys nextpnr-ecp5 ecpprog

# Amaranth HDL (for our existing RTL)
pip install amaranth amaranth-boards

# Or build FPGA tools from source for latest features
git clone https://github.com/YosysHQ/yosys
git clone https://github.com/YosysHQ/nextpnr
git clone https://github.com/YosysHQ/prjtrellis

Tool Roles

Tool	Purpose
Amaranth	Python-based HDL (generates Verilog)
Yosys	Verilog synthesis (RTL → netlist)
nextpnr-ecp5	Place and route (netlist → bitstream)
Project Trellis	ECP5 bitstream documentation
ecpprog/fujprog	Upload bitstream to board

Amaranth Build Flow

Since Sampo's RTL is written in Amaranth, the build flow starts with Python:

# Generate Verilog from Amaranth
cd rtl/
python -m amaranth generate soc.py > sampo.v

# Then synthesize with standard tools
yosys -p "synth_ecp5 -top sampo_soc -json sampo.json" sampo.v
nextpnr-ecp5 --85k --package CABGA381 \
    --lpf ulx3s.lpf --json sampo.json --textcfg sampo.config
ecppack sampo.config sampo.bit

# Program the board
fujprog sampo.bit

Hand-Written Verilog Implementation

In addition to the Amaranth RTL, we now have a complete ai-assisted hand-written Verilog implementation at /verilog/. While Amaranth can generate Verilog, the auto-generated output isn't particularly readable. The hand-written version is designed for clarity and portability:

verilog/
├── rtl/
│   ├── sampo_pkg.vh   # Opcodes, constants, state definitions
│   ├── alu.v          # 16-bit ALU with all operations
│   ├── shifter.v      # Barrel shifter (1/4/8-bit shifts, rotates)
│   ├── regfile.v      # 16 registers + alternate set (EXX)
│   ├── decode.v       # Instruction decoder
│   ├── cpu.v          # FSM-based CPU core (8 states)
│   ├── ram.v          # 64KB synchronous RAM
│   ├── uart.v         # Simple UART for serial I/O
│   └── soc.v          # Top-level SoC integration
├── tb/
│   ├── alu_tb.v       # ALU unit tests
│   ├── regfile_tb.v   # Register file tests
│   └── sampo_tb.v     # Full system testbench
├── programs/
│   └── hello.hex      # Test program in Verilog hex format
├── Makefile           # Build automation
└── bin2hex.py         # Convert sasm output to Verilog $readmemh format

The Verilog implementation uses an 8-state FSM for the CPU: RESET → FETCH → FETCH_EXT → DECODE → EXECUTE → MEMORY → WRITEBACK → HALTED. This makes timing predictable and debugging straightforward.

Simulation with Icarus Verilog

The Verilog implementation includes a complete Makefile for testing:

cd verilog/

# Run the main simulation (hello world)
make test

# Run ALU unit tests
make test-alu

# Run register file tests
make test-regfile

# Build with Verilator (faster simulation)
make verilate

# View waveforms in GTKWave
make wave

Sample output from make test:

=== Sampo CPU Testbench ===
RAM init file: ../programs/hello.hex

CPU started at PC=0x0100
UART output:
----------------------------------------
Hello, Sampo!
----------------------------------------

Simulation complete:
  Final PC:    0x011E
  Cycles:      847
  UART chars:  14
  Status:      HALTED

The Verilog version is portable to any FPGA toolchain—Xilinx, Intel, Lattice, Gowin—without requiring Amaranth or Python in the build chain.

Implementation Roadmap

With both Amaranth and Verilog implementations complete and tested in simulation, the roadmap is now about bringing them up on hardware.

Phase 1: Core Bring-up ✓ (Complete)

✓ Instruction fetch and decode
✓ ALU operations (all 16 operations)
✓ Barrel shifter (1/4/8-bit shifts, rotates, RCL/RCR)
✓ Register file with alternate set (EXX)
✓ FSM-based CPU core (8 states)
✓ RAM interface (64KB)
✓ UART for serial I/O
✓ SoC integration
✓ Testbenches passing (ALU, regfile, full system)
✓ Hello World runs in simulation

Phase 1.5: FPGA Bring-up (Current)

○ ULX3S pin constraints (.lpf file)
○ Clock setup (PLL from 25MHz)
○ Map UART to FTDI
○ LED heartbeat / debug outputs

Phase 2: Memory System

SDRAM controller for 32MB RAM
Instruction cache (optional but helps timing)
Basic interrupt handling

Phase 3: Peripherals

SPI controller for SD card boot
GPIO controller (buttons, LEDs)
Timer/counter module

Phase 4: Advanced Features

Data cache
MMU for memory protection
HDMI text console (VGA timing → GPDI)
ESP32 WiFi integration for wireless debugging

Recommended Tools & Books

Hardware

Tang Nano 9K FPGA - Budget-friendly FPGA board (~$25 on Amazon, ~$15 on AliExpress)
USB Logic Analyzer - Essential for debugging signals (24MHz, 8 channels)

Books

If you're new to Verilog or FPGA development, these are excellent starting points:

Getting Started with FPGAs by Russell Merrick - Beginner-friendly with Verilog and VHDL examples
Programming FPGAs: Getting Started with Verilog by Simon Monk - Practical hands-on guide
Verilog by Example by Blaine Readler - Concise reference for working engineers

Resources

Sampo on GitHub - Full source including assembler, emulator, and RTL
ULX3S GitHub - Schematics, examples, documentation
Project Trellis - ECP5 bitstream documentation
Amaranth HDL - Python-based hardware description
nextpnr - Place and route tool
Yosys - Verilog synthesis

Where to Buy

ULX3S: - AliExpress - ~$100-150 depending on variant - Mouser - Official distribution - CrowdSupply - Original campaign page

Tang Nano 9K (budget alternative): - Amazon - ~$25, faster shipping - AliExpress - ~$15, slower shipping

Next up: Getting our first instructions executing on real hardware. Both the Amaranth and Verilog implementations are ready and tested—Hello World runs in simulation and the testbenches pass. Now it's a matter of pin constraints, clock domains, and debugging the inevitable timing issues.