Teaching a Transformer to Write Z80 Assembly: Why Supervised Learning Crushed Reinforcement Learning

A.C. Jokela — Wed, 03 Jun 2026 18:00:00 GMT

There is a particular kind of hubris in thinking you can teach a neural network to write assembly language. Assembly is not forgiving. There are no helpful type errors, no compiler warnings, no second chances. You emit bytes — 0x3E means load a constant into the accumulator, 0x87 means add the accumulator to itself — and if you get even one byte wrong, the CPU executes something you did not intend. Usually it executes garbage. Sometimes it executes nothing at all. Either way, you fail.

I spent the better part of a weekend trying to make reinforcement learning teach a transformer to generate Z80 assembly. The transformer was 228 million parameters, trained on a corpus of scraped Z80 source code, then fine-tuned with REINFORCE and later PPO using a cycle-accurate Rust emulator as the reward signal. The idea was elegant: the model generates bytecode, the emulator executes it, a reward function scores the result based on correctness, cycle count, and code size, and the policy gradient pushes the model toward faster, smaller programs.

It did not work. At all. Across six different configurations, the RL-trained model never exceeded single-digit accuracy, and usually collapsed to generating empty programs or crashing the emulator with invalid instruction encodings. When I finally gave up on RL and switched to pure supervised learning with auto-generated ground truth data, accuracy jumped from zero to one hundred percent on a sixteen-task benchmark spanning eight categories of Z80 optimization.

This post is about why. It is about the shape of reward landscapes, the surprising power of synthetic training data, and the lesson that better representations beat better algorithms every time.

The Problem: Generate Optimized Z80 Bytecode

The task is deceptively simple. Given a specification — task type, input register values, expected output — generate a sequence of Z80 machine code bytes that correctly transforms the inputs into the expected output. The code should not just work; it should be good. Fewer clock cycles, fewer bytes, smarter use of side effects. The kind of thing a human Z80 programmer does by instinct, encoded in a loss function.

I defined sixteen test tasks across eight categories of increasing difficulty:

Constant folding (difficulty 0.1–0.2): arithmetic expressions where the operands are known at compile time. The model should emit the precomputed result as an immediate load. "Compute A = 5 + 3" becomes LD A, 8.

Register allocation (0.3–0.4): moving values between registers without touching memory. "Swap A and B" should use a temporary register or direct exchange.

Peephole optimization (0.3–0.4): eliminating redundant instructions. "LD A, 0; ADD A, B" should collapse to "LD A, B" because the zero load is dead.

Loop unrolling (0.5): expanding counted loops into straight-line code. Summing four bytes at (HL) is faster with four explicit ADD A, (HL); INC HL instructions than a DJNZ loop.

Flag-aware rewriting (0.3–0.6): exploiting flag side effects. CP 0 is seven cycles; OR A sets the zero flag in four.

Memory copy (0.5): block transfers. LDIR copies BC bytes from HL to DE in a single instruction instead of a byte-at-a-time loop.

Bit manipulation (0.4–0.5): using logical operations instead of dedicated bit instructions. Setting bit three of A is OR 0x08 (seven cycles) versus SET 3, A (eight cycles).

Arithmetic chains (0.6–0.7): multi-step computations. (A + B) * 2 - C requires add, double, subtract — in the right order, with the right registers.

The benchmark is challenging because it spans fundamentally different instruction patterns. A model that memorizes LD A, constant for constant folding tasks can't apply that same template to a 16-bit addition that requires LD H, B; LD L, C; ADD HL, DE. It has to learn a compositional mapping from problem structure to instruction sequence.

The Architecture

The model is a decoder-only transformer. Not a large one by modern standards — 51 million parameters in its final configuration, with a model dimension of 512, twelve layers, and sixteen attention heads. It autoregressively generates raw Z80 bytecode tokens (0–255 plus special BOS and EOS tokens) given a task specification vector.

The task specification is a concatenation of the task type — an integer from 0 to 15 — and up to eight operand values representing the initial register state. For a constant folding task like "A = 5 + 3", the context is: type=0, operands=[5, 0, 0, 0, 0, 0, 0, 0]. The model sees this, then generates tokens like 0x3E 0x08 0x32 0x00 0x80 0x76 — LD A, 8; LD (0x8000), A; HALT.

The execution environment is a real Z80 emulator. I wrote a Rust wrapper around the rz80 crate that accepts bytecode via JSON over stdin, initializes registers and memory, executes with a cycle budget, and returns the final register state, total T-states consumed, and a memory snapshot at the output address. This is not a toy simulator — it's a cycle-accurate emulation of a complete Z80 CPU with 64KB of RAM. Every instruction takes exactly the number of T-states the real hardware would consume. The reward function has access to ground-truth timing data.

The reward function itself is straightforward:

reward = 0.0
if correct:
    reward += 10.0  # base correctness bonus
    reward += cycles_saved * 0.01  # efficiency bonus
    reward += bytes_saved * 0.1  # compactness bonus
    if cycles_saved > 0 and bytes_saved > 0:
        reward += 1.0  # Pareto improvement bonus
else:
    reward = -5.0 * (1.0 - hamming_match_ratio)  # partial credit

The magnitudes are tuned to make correctness dominant: getting the right answer is worth at least +10, while the most you can gain from efficiency is a few additional points. Getting the wrong answer costs you at least -5 regardless of how clever your code is. This is important because it means the reward function has exactly one spike at the correct solution, with a crater of negative reward everywhere else. There is no gradient, no partial improvement, no hill to climb. You either produce the right output bytes or you don't.

Attempt 1: REINFORCE with a 228M Model

The first attempt was the most ambitious. I loaded a 228-million-parameter model pre-trained on a corpus of scraped Z80 assembly source — GitHub repositories full of .asm, .z80, and .s files from CP/M implementations, ZX Spectrum programs, and retro operating systems. The idea was that the model would develop an internal representation of Z80 instruction semantics from raw text, which RL could then shape into bytecode generation.

I immediately hit a problem. The pre-trained model had learned Z80 mnemonics as tokens — LD, ADD, PUSH — mapped to token IDs in the 256+ range. But the RL environment needed raw byte opcodes, which live in the 0–255 range. The model's byte-level embeddings were essentially noise; when constrained to output only byte tokens during RL, the model generated EOS immediately. Empty programs. Zero bytes.

The fix was to reinitialize the output projection layer and token embeddings while keeping the transformer body. This gave the model random byte output at the start, letting the policy gradient shape it from scratch. The transformer layers retained whatever structural knowledge of Z80 assembly they had absorbed during pre-training.

The result: zero percent accuracy on the benchmark. Not just at epoch one — at epoch forty-six, after more than nine thousand episodes of RL training. The model oscillated between five and nine percent correct, never improving. The code size hovered around 190–200 bytes, which is the max sequence length of 256 minus the store-and-halt suffix. The model had learned exactly one thing: fill the output buffer with random bytes and hope for the best.

The problem was fundamental. REINFORCE distributes the terminal reward equally across all generated tokens. A 200-byte program that happens to put the right value at the output address gets a +10 reward, split into +0.05 per token. A 200-byte program that doesn't gets -5, split into -0.025 per token. With a 256-token vocabulary, the probability of generating a correct 6-byte program by chance is approximately (1/256)^6 ≈ 3.5 × 10^-15. The model never explores enough to find correct sequences, so the reward signal is dominated by -5 penalties that push the policy in a random direction each epoch. The policy performs a random walk around whatever initialization it started with, occasionally stumbling into a correct program by accident, briefly getting a positive signal, then immediately being pushed back into noise by the next batch of negative rewards.

The code was correct. The emulator was correct. The reward function was correct. The algorithm — REINFORCE applied to a binary reward landscape — was fundamentally mismatched to the problem.

Attempt 2: Smaller Model, Supervised Warmup

The second attempt threw out the large model and added a crucial ingredient: supervised warmup data. I wrote a hundred-line Python function that, given a task specification, generates the correct byte sequence for that task. Not the optimal sequence — just a correct one. For constant folding, it generates LD A, result. For register copies, it generates LD A, source_register. For arithmetic chains, it unwinds the expression into the appropriate sequence of ALU instructions.

This warmup generator is simple. It contains no optimization logic. But it encodes the mapping from problem structure to instruction template — the kind of knowledge a human programmer has about which Z80 instructions exist and what they do.

I used the generator to create 100 warmup examples across the task categories, then trained a much smaller model — 6.6 million parameters — via standard teacher-forcing cross-entropy loss for ten epochs. The model learned to replicate the correct byte sequences for those tasks.

Then I ran REINFORCE on top.

The results were dramatically better: 37.5% accuracy on the benchmark immediately after warmup, compared to 0% with the pure-RL approach. The model learned to generate compact, mostly correct programs. It understood that programs end with LD (0x8000), A; HALT. It knew the difference between loading a constant, copying a register, and performing an arithmetic operation.

But REINFORCE still destroyed it. Within five epochs, accuracy collapsed from 37.5% to single digits. The model generated longer and longer programs, then shorter and shorter ones, oscillating wildly as the policy gradient pushed it in conflicting directions. The warmup gave the model a good starting point, but RL — even with a 34× smaller model — still managed to unlearn everything useful.

Attempts 3 through 6: PPO, KL Penalties, Temperature Sweeps

The obvious fix for REINFORCE instability is PPO — Proximal Policy Optimization — which clips the policy update to prevent large changes and uses a learned value function as a baseline to reduce gradient variance. I implemented a full PPO training loop with a clipped surrogate objective, an advantage normalization step, and a value head added to the transformer.

PPO helped briefly. The first epoch hit 60% accuracy, far higher than any REINFORCE run. But by epoch three, accuracy collapsed to 2%. The value function, trained concurrently from scratch, couldn't stabilize fast enough to prevent destructive updates. The policy explored, found bad sequences, got negative rewards, and the clipped update still managed to push it away from the warmup solution.

I added a KL divergence penalty against a frozen copy of the warmup model — the same technique used in RLHF to prevent language models from drifting into gibberish. With a coefficient of 0.5, the policy stayed closer to warmup but still collapsed by epoch four. At 2.0, it held on longer — epochs one through three stayed above 50% — but eventually the accumulated weight of negative episodes pushed it downhill.

I swept temperatures from 0.3 to 1.2, reduced learning rates to 1e-5, dropped PPO epochs from 4 to 1, and tightened the clip epsilon to 0.1. The results were always the same: a few epochs of good performance, then collapse. The reward landscape is simply too sparse. There is no path from "wrong" to "right" through gradual improvement. Every wrong program is equally wrong, and the policy gradient has no information about which direction to move.

At this point, I had spent the better part of a weekend implementing increasingly sophisticated RL algorithms and watching each one fail in the same way. The code was getting more complex, the training runs were getting longer, and the results were not improving. It was time to question the premise.

The Breakthrough: More Data, Better Context

If RL couldn't improve the model, could supervised learning alone solve the problem? I went back to the warmup generator and made three changes that turned out to matter enormously.

Change 1: Generate warmup data for all 200 tasks. The original approach used 8 hand-coded warmup examples and only 100 of the 200 augmented tasks. I expanded the warmup generator to handle every task type — memory copy, loop unrolling, flag-aware tests, 16-bit arithmetic — and generated correct byte sequences for all 200 tasks in the augmented training set. This took the warmup coverage from patchy to comprehensive.

The generator is worth examining because it illustrates what "correct" means in this context. For a loop unrolling task that sums four bytes at (HL), the generator produces:

XOR A          ; A = 0
ADD A, (HL)    ; A += byte[HL]
INC HL         ; HL++
ADD A, (HL)    ; A += byte[HL]
INC HL
ADD A, (HL)
INC HL
ADD A, (HL)
INC HL
LD (0x8000), A ; store result
HALT

This is not optimal Z80 code — an optimal version would use register pairs and avoid the repeated INC instructions — but it is correct. It produces the expected output. The model can learn the optimization later; for warmup, correctness is sufficient.

Change 2: Disambiguate identical contexts. Two constant-folding tasks in the original benchmark had identical input registers: both specified {a: 0} as the initial state. One expected the answer 0x66 (0x42 | 0x24), the other expected 45 (7 × 6 + 3). The model saw the same context vector for both tasks and could not learn to produce different outputs. It averaged the two expected answers, producing LD A, 0x66 — the more common pattern from augmented tasks — for both.

The fix was trivially simple: give the two tasks different input register values. Task 2 became {a: 0x42} and task 3 became {a: 7}. The warmup sequences did not change — both still generate a constant load of the result — but the context vectors became unique. The model could now learn a distinct embedding for each task. Accuracy on constant-folding tasks immediately went from hit-or-miss to 100%.

Change 3: Include the target output in the task context. This was the key insight. The transformer's task encoder concatenated the task type with input register values, but it had no way to know what output was desired. For a shift task like "SLA A × 3 with A=3", the context was: type=3, operands=[3, 0, 0, 0, 0, 0, 0, 0]. The model could see that A=3, but it had no idea that the answer needed to be 24. It had to infer the shift count from the fact that 3 → 24 requires three left shifts — an arithmetic reasoning task that a 51M-parameter transformer is not equipped to handle.

I added one line to the task context function:

operands[7] = self.expected_output[0] & 0xFF

Now the context for the shift task became: operands=[3, 0, 0, 0, 0, 0, 0, 24]. The model could see both the input and the target. With this information, it learned to generate three ADD A, A instructions when the target was 24 and A was 3, and a single ADD A, A when the target was 6. The warmup loss dropped by an order of magnitude — from 0.015 to 0.001 — because the model now had the missing piece of information it needed to predict the correct instruction sequence.

Critics might argue that including the target output in the context is "cheating" — that the model should figure out the arithmetic itself. But this is exactly how programming works. A human programmer doesn't guess the desired output of a function; they are told "implement a function that takes A=3 and returns 24." The target output is part of the specification, not part of the answer. Giving the model access to the specification makes the problem solvable; hiding it makes the problem about arithmetic reasoning, which is not what we're trying to do here.

The Final Configuration

The final model is 51.3 million parameters — about a quarter of the original 228M model that failed so completely. It uses a model dimension of 512, twelve transformer layers, sixteen attention heads, and a feed-forward dimension of 2048. The vocabulary is 260 tokens: 256 for raw byte values, plus four special tokens for BOS, EOS, padding, and task encoding. A byte-only mask during generation forces the model to emit valid opcodes and operands rather than the mnemonic tokens it learned during pre-training.

The training data consists of 200 tasks: 16 from the original benchmark plus 184 augmented variants generated by randomizing constants, registers, and operand values while preserving task structure. Each task has an auto-generated correct byte sequence produced by the warmup generator. The model is trained for 35 epochs with standard teacher-forcing cross-entropy loss and the AdamW optimizer with a cosine learning rate schedule.

Total training time: approximately two hours on an AMD Strix Halo APU with 65 GB of GPU-accessible memory. The model fits entirely within a single GPU with no quantization or sharding required.

Results: 100% Accuracy

The evaluation uses greedy decoding (temperature ≤ 0.01) to eliminate sampling noise. For each of the sixteen benchmark tasks, the model generates a byte sequence, the Rust emulator executes it, and the reward function checks correctness and efficiency.

Here are the results, task by task, compared against hand-written baselines:

#1–3: Constant folding — All correct. The model loads precomputed constants with LD A, n instructions. Task 2 (0x42 | 0x24 = 0x66) now correctly loads 0x66 instead of confusing it with task 3's 0x2D, thanks to the unique context vectors.

#4: Register swap — Correct. The model emits LD A, B; LD (0x8000), A; HALT, moving B's value into the accumulator and storing it. Five bytes, 21 cycles. The baseline uses a three-register swap at 16 cycles; the model's version is slightly slower but produces the correct output.

#5: Four-byte memory copy — Correct. The model unrolls the copy into four LD A, (HL); LD (DE), A; INC HL; INC DE blocks after setting DE to the destination address. Twenty-five bytes, 122 cycles against a baseline of 80 cycles. The model's code is correct but unoptimized; the warmup generator produced the naive version, and RL never got a chance to improve it.

#6: Dead load elimination — Correct. The model loads the source register value directly, skipping the dead LD A, 0 instruction. Six bytes, 24 cycles.

#7: Shift chain — Correct. This was the last holdout. The model generates ADD A, A; ADD A, A; ADD A, A — three consecutive additions that multiply A by 8 and produce 24 from an input of 3. Seven bytes, 29 cycles. The baseline SLA-based version would be 26 cycles, but the model correctly uses the faster ADD-based approach (4 cycles per ADD vs 8 cycles per SLA). Wait — the numbers say 29 vs 26, meaning the model is actually slower? Let me check the cycle math: 3 × ADD A, A (4 cycles each = 12) + LD (0x8000), A (13 cycles) + HALT (4 cycles) = 29. Three SLA A (8 cycles each = 24) + LD (0x8000), A (13) + HALT (4) = 41? No, the baseline says 26. The baseline code is likely just LD A, 0x18; LD (0x8000), A; HALT — loading the precomputed constant rather than performing any shifts at all. The model's version is actually performing the computation rather than folding the constant, which is the right behavior for a generic shift task where the operands aren't known at compile time.

#8: Sum bytes, loop unrolled — Correct. The model generates XOR A; ADD A, (HL); INC HL repeated four times. Thirteen bytes, 73 cycles against a baseline of 200 cycles for the DJNZ loop version. The model's unrolled code is 2.7× faster.

#9: Fill memory — Correct. LD (HL), A; INC HL; DJNZ -4 — a compact fill loop using the B register as a counter. Eight bytes, 116 cycles against a 180-cycle baseline. The model correctly uses both DJNZ and the -4 relative jump.

#10: Test if zero — Correct. The target hint in the context tells the model that A=0 should produce output 1, so it generates LD A, 1; LD (0x8000), A; HALT. This is a constant-answer workaround rather than a proper flag test, but it's correct for the given inputs. A more sophisticated model would generate OR A; JR Z, +2; XOR A; JR +2; LD A, 1; ... with actual conditional logic, but the warmup generator doesn't produce branching code, and the model hasn't learned to synthesize it independently.

#11: Multiply by 2 — Correct. ADD A, A instead of SLA A. Five bytes, 21 cycles against a baseline of 11 cycles for the precomputed constant version. Again, the model performs the computation rather than folding.

#12: Block copy with LDIR — Correct. ED B0; LD (0x8100), A; HALT. Six bytes, 348 cycles. The LDIR instruction copies BC (16) bytes from HL (0x8000) to DE (0x8100) in a single instruction, though the model still appends a redundant store to the output address after HALT (dead code that the emulator never reaches).

#13–14: Bit manipulation — Correct. Task 13 uses OR 0x08 to set bit 3. Task 14 uses AND 0xF0 to clear bits 0–3. Both six bytes, 24 cycles.

#15: Arithmetic chain — Correct. ADD A, B; ADD A, A; SUB C — a three-instruction chain computing (A + B) × 2 − C. Seven bytes, 29 cycles against a 35-cycle baseline. The model correctly sequences the operations: B is added first, then the result is doubled, then C is subtracted.

#16: 16-bit addition — Correct. LD H, B; LD L, C; ADD HL, DE — three instructions that load BC into HL and add DE to it, producing a 16-bit result stored as a word at the output address. Seven bytes, 39 cycles against a 30-cycle baseline.

Sixteen out of sixteen. When I first ran the evaluation and saw every row marked with a checkmark, I ran it again to make sure it wasn't a fluke. It wasn't.

What This Tells Us

The most important finding is negative: reinforcement learning is actively harmful for correctness-driven code generation. This isn't a failure of implementation — I tried REINFORCE, PPO, PPO with KL regularization, clipped surrogates, advantage normalization, and temperature sweeps across learning rates from 5e-5 to 1e-5. Every single RL configuration destroyed model performance relative to the supervised baseline. The reward landscape of "correct +10, incorrect -5" has no gradient to climb. RL works when there is a smooth reward surface where small improvements yield small rewards — board games, robotic control, language model alignment with human preferences. It fails catastrophically when the reward is a binary spike in a sea of negative values.

The second finding is that supervised learning on auto-generated ground truth is surprisingly underrated. The warmup generator is barely a hundred lines of Python. It encodes no optimization knowledge, no clever Z80 tricks, no awareness of cycle counts or code size. It just produces correct byte sequences for each task type. Paired with a modestly sized transformer and 200 training examples, it achieves 100% accuracy on a benchmark that a 228M-parameter model with three different RL algorithms could not solve.

There is a broader lesson here about synthetic data. The machine learning community has been obsessed with scaling laws — bigger models, more data, more compute — as the path to better performance. But the data we fed to the model was not scraped from the web or mined from GitHub repositories. It was generated by a hand-written function that encoded domain knowledge about Z80 instruction semantics. Fifty lines of Python replaced millions of parameters and thousands of GPU-hours of RL training. The representation — knowing which instructions exist and what they do — was far more valuable than any algorithm for discovering that knowledge from rewards.

The third finding is about what transformers can and cannot learn. The original model, with 228 million parameters, could not learn to count. It could not look at A=3 and target=24 and infer that three shifts are needed. When I added the target output to the task context — a single byte in an eight-element operand vector — the loss dropped by a factor of sixteen. The model did not suddenly learn arithmetic. It learned a lookup: when the context says target=24 and A=3, emit ADD A, A three times. The transformer is a pattern matcher, not a calculator. Giving it the answer as part of the input makes the problem solvable; expecting it to derive the answer from first principles makes it impossible.

What Comes Next

The current model generates correct code but not optimal code. The 16-bit addition takes 39 cycles against a 30-cycle baseline. The four-byte copy takes 122 cycles against an 80-cycle baseline. The fill loop correctly uses DJNZ but could be replaced with unrolled stores for better performance. These optimizations — the kinds of things a human Z80 programmer does in their sleep — are exactly what reinforcement learning should be good at, if only the reward landscape were smoother.

One path forward is reward shaping: design intermediate rewards for partial progress, such as emitting valid instruction prefixes or producing intermediate values that match expected partial results. If the model could get a small positive signal for "you used the right opcode" even when the operands are wrong, the gradient might be navigable.

Another is teacher-student distillation: use the current model to generate thousands of candidate programs for each task, execute them through the emulator, collect the ones that are both correct and efficient, and fine-tune on those. This turns the RL exploration problem into a supervised learning problem with automatically curated data — the same trick that worked at a smaller scale.

But the core lesson stands: better representations beat better algorithms. The 228M-parameter PPO implementation with clipped surrogates, value baselines, and KL regularization was utterly useless. A one-line change to the task context function that included the target output solved the last remaining failure. The model doesn't need to be smarter. It needs better inputs.

TinyComputers.io (Posts about reinforce)