Teaching a Transformer to Write Z80 Assembly: Why Supervised Learning Crushed Reinforcement Learning
A deeply empirical account of training a 51M-parameter transformer to generate correct, optimized Z80 assembly code from task specifications — and the surprising discovery that reinforcement learning actively destroys model performance while simple supervised learning with auto-generated ground truth achieves 100% accuracy.