After designing the Sampo RISC architecture on paper—complete with a working assembler and emulator—it's time to bring it to life in silicon. Or at least, in programmable logic. This post documents the hardware selection and implementation planning for synthesizing Sampo on an FPGA.

The Story So Far

If you haven't read Part 1 of this series, here's the quick version: Sampo is a 16-bit RISC CPU designed to bridge the gap between clean RISC design principles and Z80-friendly features. It has 16 general-purpose registers, ~66 instructions, port-based I/O, block operations (LDIR, LDDR), alternate registers for fast interrupt handling, and hardware multiply/divide.

The project already includes working tools written in Rust:

  • sasm - A full assembler
  • semu - An emulator with TUI debugger (step, breakpoints, memory inspection)

And for hardware implementation, we now have two complete RTL implementations:

Amaranth HDL (/rtl/):

  • cpu.py, alu.py, decode.py, regfile.py, soc.py
  • Python-based, excellent for rapid iteration
  • Generates Verilog for synthesis

AI Assisted Hand-written Verilog (/verilog/rtl/):

  • cpu.v, alu.v, decode.v, regfile.v, shifter.v, uart.v, ram.v, soc.v
  • Readable, portable, works with any toolchain
  • Includes testbenches for Icarus Verilog and Verilator

Now it's time to synthesize it to real hardware.

Choosing an FPGA Platform

The FPGA world is split between proprietary toolchains (Xilinx Vivado, Intel Quartus) and the growing open source ecosystem. For a project like Sampo, where understanding every layer of the stack matters, open source tooling is the clear choice.

Open Source FPGA Options

FPGA Family Capacity Toolchain Maturity
Gowin GW1N/GW2A 1K-55K LUTs Project Apicula Good
Lattice iCE40 1K-8K LUTs Project IceStorm Excellent
Lattice ECP5 12K-85K LUTs Project Trellis Excellent
Xilinx 7-series 10K-200K+ LUTs Project X-Ray (partial) Experimental

For Sampo, which estimates at ~1,500-2,500 LUTs for the basic CPU, even the smaller FPGAs have more than enough capacity. But if we want room to grow—adding caches, more peripherals, maybe even multi-core experiments—a larger device makes sense.

The ULX3S Board

The ULX3S is an open hardware development board built around the ECP5 FPGA. It's designed by Radiona.org and has become the de facto standard for open source FPGA development.

Specifications

Component Specification
FPGA Lattice ECP5 (LFE5U-85F/45F/12F-6BG381C)
LUTs 12K / 44K / 84K (depending on variant)
USB FTDI FT231XS (500 kbit JTAG, 3 Mbit serial)
GPIO 56 pins (28 differential pairs), PMOD-compatible
RAM 32 MB SDRAM @ 166 MHz
Flash 4-16 MB Quad-SPI
Storage microSD slot
LEDs 11 total (8 user, 2 USB, 1 WiFi)
Buttons 7 (4 direction, 2 fire, 1 power)
Audio 3.5mm jack (stereo + digital/composite)
Video GPDI (HDMI-compatible) with level shifter
Display Header for 0.96" SPI OLED (SSD1331)
Wireless ESP32-WROOM-32 (WiFi/Bluetooth, standalone JTAG)
ADC 8 channels, 12-bit, 1 MS/s (MAX11125)
Clock 25 MHz onboard, differential input available
Power 3 switching regulators (1.1V, 2.5V, 3.3V)
Sleep 5 µA standby, RTC wake-up with battery backup
Dimensions 94mm × 51mm

Why ULX3S for Sampo

The ULX3S isn't just an FPGA breakout board—it's a complete system:

  • 32MB SDRAM: Real memory, not just block RAM. Essential for running actual programs.
  • HDMI output: Video terminal without external hardware.
  • microSD slot: Load programs, implement a filesystem.
  • ESP32 co-processor: WiFi-based JTAG debugging from any device.
  • Buttons and LEDs: Instant I/O for testing without wiring anything.
  • Audio output: Even supports composite video through the audio jack.

Budget Alternative: Tang Nano 9K

Before we dive into the ULX3S, it's worth mentioning a much cheaper option. The Tang Nano 9K (~$15 on AliExpress) uses a Gowin GW1NR-9 FPGA with 8,640 LUTs—more than enough for Sampo:

  • 8,640 LUTs
  • 64Mbit PSRAM (can serve as the full 64KB address space and then some)
  • HDMI output for a video terminal
  • USB-C programming
  • Fully supported by open-source toolchain (Yosys + nextpnr-gowin)

For initial development and testing, the Tang Nano 9K is hard to beat on price. But the ULX3S offers more I/O, more RAM, and a richer peripheral set—making it the better choice for a more complete Sampo system.

LUT Budget Planning

The Sampo RTL implementation is designed to be compact. Here's the resource breakdown:

Component Estimated LUTs
16 × 16-bit registers ~256 FFs
ALU (16-bit) 200 - 400
Control logic 500 - 1,000
Instruction decode 300 - 500
Sampo CPU core ~1,500 - 2,500
UART (115200 baud) 200 - 300
SPI controller (SD card) 300 - 500
GPIO controller 200 - 400
Basic system ~2,500 - 4,000
SDRAM controller 500 - 1,000
Instruction cache 1,000 - 2,000
Data cache 1,000 - 2,000
Full system ~6,000 - 10,000

These estimates are based on typical RISC CPU implementations. The actual numbers will depend on optimization choices and synthesis settings.

Variant Recommendations

  • 12K LUTs (ULX3S-12F): Plenty for basic Sampo + peripherals, tight for caches.
  • 45K LUTs (ULX3S-45F): Comfortable. Full CPU with cache, room for experiments.
  • 85K LUTs (ULX3S-85F): Luxurious. Multi-core experiments, extensive peripherals.

Toolchain Setup

The ECP5 toolchain is fully open source:

# macOS (Homebrew)
brew install yosys nextpnr-ecp5 ecpprog fujprog

# Ubuntu/Debian
apt install yosys nextpnr-ecp5 ecpprog

# Amaranth HDL (for our existing RTL)
pip install amaranth amaranth-boards

# Or build FPGA tools from source for latest features
git clone https://github.com/YosysHQ/yosys
git clone https://github.com/YosysHQ/nextpnr
git clone https://github.com/YosysHQ/prjtrellis

Tool Roles

Tool Purpose
Amaranth Python-based HDL (generates Verilog)
Yosys Verilog synthesis (RTL → netlist)
nextpnr-ecp5 Place and route (netlist → bitstream)
Project Trellis ECP5 bitstream documentation
ecpprog/fujprog Upload bitstream to board

Amaranth Build Flow

Since Sampo's RTL is written in Amaranth, the build flow starts with Python:

# Generate Verilog from Amaranth
cd rtl/
python -m amaranth generate soc.py > sampo.v

# Then synthesize with standard tools
yosys -p "synth_ecp5 -top sampo_soc -json sampo.json" sampo.v
nextpnr-ecp5 --85k --package CABGA381 \
    --lpf ulx3s.lpf --json sampo.json --textcfg sampo.config
ecppack sampo.config sampo.bit

# Program the board
fujprog sampo.bit

Hand-Written Verilog Implementation

In addition to the Amaranth RTL, we now have a complete ai-assisted hand-written Verilog implementation at /verilog/. While Amaranth can generate Verilog, the auto-generated output isn't particularly readable. The hand-written version is designed for clarity and portability:

verilog/
├── rtl/
   ├── sampo_pkg.vh   # Opcodes, constants, state definitions
   ├── alu.v          # 16-bit ALU with all operations
   ├── shifter.v      # Barrel shifter (1/4/8-bit shifts, rotates)
   ├── regfile.v      # 16 registers + alternate set (EXX)
   ├── decode.v       # Instruction decoder
   ├── cpu.v          # FSM-based CPU core (8 states)
   ├── ram.v          # 64KB synchronous RAM
   ├── uart.v         # Simple UART for serial I/O
   └── soc.v          # Top-level SoC integration
├── tb/
   ├── alu_tb.v       # ALU unit tests
   ├── regfile_tb.v   # Register file tests
   └── sampo_tb.v     # Full system testbench
├── programs/
   └── hello.hex      # Test program in Verilog hex format
├── Makefile           # Build automation
└── bin2hex.py         # Convert sasm output to Verilog $readmemh format

The Verilog implementation uses an 8-state FSM for the CPU: RESET → FETCH → FETCH_EXT → DECODE → EXECUTE → MEMORY → WRITEBACK → HALTED. This makes timing predictable and debugging straightforward.

Simulation with Icarus Verilog

The Verilog implementation includes a complete Makefile for testing:

cd verilog/

# Run the main simulation (hello world)
make test

# Run ALU unit tests
make test-alu

# Run register file tests
make test-regfile

# Build with Verilator (faster simulation)
make verilate

# View waveforms in GTKWave
make wave

Sample output from make test:

=== Sampo CPU Testbench ===
RAM init file: ../programs/hello.hex

CPU started at PC=0x0100
UART output:
----------------------------------------
Hello, Sampo!
----------------------------------------

Simulation complete:
  Final PC:    0x011E
  Cycles:      847
  UART chars:  14
  Status:      HALTED

The Verilog version is portable to any FPGA toolchain—Xilinx, Intel, Lattice, Gowin—without requiring Amaranth or Python in the build chain.

Implementation Roadmap

With both Amaranth and Verilog implementations complete and tested in simulation, the roadmap is now about bringing them up on hardware.

Phase 1: Core Bring-up ✓ (Complete)

  • ✓ Instruction fetch and decode
  • ✓ ALU operations (all 16 operations)
  • ✓ Barrel shifter (1/4/8-bit shifts, rotates, RCL/RCR)
  • ✓ Register file with alternate set (EXX)
  • ✓ FSM-based CPU core (8 states)
  • ✓ RAM interface (64KB)
  • ✓ UART for serial I/O
  • ✓ SoC integration
  • ✓ Testbenches passing (ALU, regfile, full system)
  • ✓ Hello World runs in simulation

Phase 1.5: FPGA Bring-up (Current)

  • ○ ULX3S pin constraints (.lpf file)
  • ○ Clock setup (PLL from 25MHz)
  • ○ Map UART to FTDI
  • ○ LED heartbeat / debug outputs

Phase 2: Memory System

  • SDRAM controller for 32MB RAM
  • Instruction cache (optional but helps timing)
  • Basic interrupt handling

Phase 3: Peripherals

  • SPI controller for SD card boot
  • GPIO controller (buttons, LEDs)
  • Timer/counter module

Phase 4: Advanced Features

  • Data cache
  • MMU for memory protection
  • HDMI text console (VGA timing → GPDI)
  • ESP32 WiFi integration for wireless debugging

Recommended Tools & Books

Hardware

Books

If you're new to Verilog or FPGA development, these are excellent starting points:

Resources

Where to Buy

ULX3S: - AliExpress - ~$100-150 depending on variant - Mouser - Official distribution - CrowdSupply - Original campaign page

Tang Nano 9K (budget alternative): - Amazon - ~$25, faster shipping - AliExpress - ~$15, slower shipping


Next up: Getting our first instructions executing on real hardware. Both the Amaranth and Verilog implementations are ready and tested—Hello World runs in simulation and the testbenches pass. Now it's a matter of pin constraints, clock domains, and debugging the inevitable timing issues.

Share: Twitter Reddit Hacker News LinkedIn