Running DiffusionGemma on AMD Strix Halo and Decade-Old Tesla P40s

A.C. Jokela — Wed, 10 Jun 2026 23:30:00 GMT

Google released DiffusionGemma this week, and it's the most interesting thing to come out of the Gemma program in a while — not because it's bigger or smarter than what came before, but because it generates text in a fundamentally different way. Instead of predicting one token at a time, left to right, the way every mainstream LLM has worked since GPT-2, DiffusionGemma starts with a canvas of 256 masked tokens and iteratively denoises the whole block at once. Google's headline claim is up to 4x faster generation: over 1,000 tokens per second on an H100, 700+ on an RTX 5090.

I do not own an H100. I own an AMD Strix Halo APU and a rack-mount server with four NVIDIA Tesla P40s from 2016. Neither appears anywhere in Google's launch material, and for good reason — the official deployment paths assume hardware features that one of these machines lacks entirely and the other only half-has. Getting DiffusionGemma running on both took an unmerged llama.cpp pull request, a community quantization, and the usual amount of stubbornness.

It runs on both. And the benchmark result surprised me: the integrated GPU in a mini PC beat four discrete NVIDIA server cards — 96GB of combined VRAM, roughly 48 TFLOPS of aggregate FP32 — by a factor of more than two. This post covers the setup on each machine, the head-to-head numbers, and why diffusion inference inverts some of the assumptions I've built up over years of running autoregressive models on this hardware.

What Block Diffusion Actually Does

DiffusionGemma is built on the Gemma 4 26B A4B architecture: a Mixture-of-Experts model with 25.2 billion total parameters, of which only 3.8 billion are active for any given token. The MoE part is familiar. The generation process is not.

An autoregressive model produces text serially. Each new token requires a full forward pass conditioned on everything before it, so a 256-token response means 256 sequential passes through the network. The KV cache makes each pass cheap, but the serial dependency is structural: token N cannot begin until token N-1 is finished. Generation speed is dominated by how fast you can stream weights through the GPU's memory bus, over and over.

DiffusionGemma instead works on a 256-token block it calls a canvas. Every position starts as a mask token. Each denoising step runs one forward pass over the entire canvas and proposes tokens for every position simultaneously. Positions where the model is confident get committed; uncertain ones stay masked for the next round. The sampler in the released implementation is called entropy-bound: rather than unmasking a fixed number of tokens per step, it commits every position whose predicted distribution has entropy below a threshold, which means easy spans of text resolve in a handful of steps while tricky ones get more iterations. For sequences longer than one canvas, the model chains blocks autoregressively — each new 256-token canvas conditions on the completed text before it. Google calls this block-autoregressive multi-canvas sampling, which is a lot of words for "diffusion inside the block, autoregression between blocks."

The performance implication is the whole point. If a canvas resolves in 20 denoising steps, you've produced 256 tokens with 20 forward passes instead of 256. The trade is that each pass is much heavier: you're computing attention and expert routing for 2,300-odd positions of context-plus-canvas every step, not one new position. Diffusion converts text generation from a memory-bandwidth-bound serial problem into a compute-bound parallel one.

Keep that sentence in mind. It decides the entire benchmark.

The Official Paths Don't Fit

The weights are on Hugging Face under Apache 2.0 as google/diffusiongemma-26B-A4B-it — about 50GB in BF16. Google's developer guide shows a vLLM serving command and notes support in Transformers, SGLang, and MLX, with quantized deployment fitting in 18GB of VRAM.

None of that helps either of my machines.

The P40 problem is the same one I keep running into with this hardware. vLLM requires compute capability 7.0 or higher; the P40's Pascal GP102 die is 6.1, so vLLM won't even initialize. The Transformers path technically exists, but the weights are BF16 — a format Pascal has no hardware support for — and the fallback, FP16, runs on the P40 at 1/64th the rate of FP32 because Pascal's half-precision units were an afterthought on this die. I've written before about the contortions required to run BF16-native models on these cards, and a 25-billion-parameter model in FP32 would need 100GB — more than the 96GB the four cards have between them.

The Strix Halo could plausibly run the Transformers path, since ROCm's PyTorch handles BF16 on RDNA 3.5 fine. But a 50GB BF16 model plus activations through HuggingFace Transformers on an iGPU is the slow, painful version of this experiment. What both machines actually want is what they always want: llama.cpp and a good GGUF.

The launch blog said llama.cpp support was "coming soon." Soon turned out to already be in flight: pull request #24423 on the llama.cpp repository implements the DiffusionGemma architecture and a dedicated llama-diffusion-cli binary, and the Unsloth team had already published GGUF quantizations built against it — the kind of community velocity that has become the norm within days of any open-weights release. The PR is unmerged as I write this, which means building it yourself, but that's hardly a hardship.

The Unsloth repository offers the usual ladder:

Quantization	Size	Notes
BF16	50.5 GB	Full precision reference
Q8_0	26.9 GB	Near-lossless
Q6_K	22.7 GB
Q5_K_M	19.1 GB
Q4_K_M	16.8 GB	Fits a single 24GB card

I went with Q8_0 on both machines. The P40 box has 96GB of VRAM to spread it across, the Strix Halo has 121GB of unified memory, and for a model whose output quality is already documented as a step below standard Gemma 4, I didn't want quantization noise muddying the comparison.

Machine One: Four Tesla P40s

The P40 server is the familiar workhorse: four GP102GL cards with 24GB of GDDR5X each, Ubuntu 24.04, the NVIDIA 580-series driver, and a 64-core host. It already had a llama.cpp checkout from December, with a working CUDA build configured for CMAKE_CUDA_ARCHITECTURES=61. Rather than disturb a known-good build, I pulled the PR into a git worktree alongside it:

cd ~/llama.cpp
git fetch origin pull/24423/head:diffusiongemma
git worktree add ~/llama.cpp-diffusiongemma diffusiongemma

cd ~/llama.cpp-diffusiongemma
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61
cmake --build build -j 32 --target llama-diffusion-cli

The worktree trick is worth adopting if you haven't: one clone, multiple checked-out branches in separate directories, no second .git download, and the original build directory stays untouched. The build target matters too — DiffusionGemma doesn't run under the standard llama-cli or llama-server; the diffusion sampling loop lives in its own llama-diffusion-cli binary.

The CUDA 12.0 toolkit that ships in Ubuntu 24.04's repositories compiled the sm_61 kernels without complaint. While it built, the Q8_0 GGUF downloaded from Hugging Face. First run:

./build/bin/llama-diffusion-cli \
  -m ~/models/diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -n 256 \
  -p "Explain why text diffusion models can generate text faster than autoregressive models."

Two log lines worth noting before the results. First:

sched_reserve: layer 5 is assigned to device CUDA0 but the Flash Attention
tensor is assigned to device CPU (usually due to missing support)
sched_reserve: Flash Attention was auto, set to disabled

The PR doesn't yet have Flash Attention kernels wired up for this architecture on this hardware, so attention falls back to the unfused path. That's a real performance tax on a workload that's almost entirely attention-over-2,300-positions, and it suggests headroom once the PR matures.

Second:

diffusion_eb: kv cache auto-off (4 GPUs; pass --diffusion-kv-cache on to force)

The diffusion sampler maintains an optional KV cache over the already-committed prompt prefix, so each denoising step only recomputes attention for the active canvas. With the model split across four GPUs, the PR disables this by default. File that away; it becomes a benchmark variable later.

And then it worked. Nineteen entropy-bound denoising steps, 42 seconds, and a coherent 256-token answer — produced by a model that was, frankly, fascinating to watch. With --diffusion-visual you can see the canvas refine in place: scattered high-confidence words appear first, connective tissue fills in around them, and the text snaps into focus over successive steps like a developing photograph. It is the most legible window into "what is the model doing" I've encountered since attention map visualizations stopped being useful.

Machine Two: Strix Halo

The Strix Halo machine is the same one that ran DeepSeek V4 Flash: a Ryzen AI MAX+ 395 with the integrated Radeon 8060S (gfx1151, 40 RDNA 3.5 compute units), 128GB of LPDDR5X unified memory, and ROCm 7.2. The setup mirrors the P40 box almost line for line, with the CUDA flags swapped for HIP:

cd ~/llama.cpp
git fetch --no-tags origin pull/24423/head:diffusiongemma
git worktree add ~/llama.cpp-diffusiongemma diffusiongemma

cd ~/llama.cpp-diffusiongemma
cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release
cmake --build build -j 24 --target llama-diffusion-cli

Instead of downloading the 26.9GB GGUF a second time, I rsynced it across the LAN from the P40 server. Both machines sit on the same switch, and Hugging Face's CDN doesn't get faster than a local wire.

One memorable wrinkle: this box's disk was 97% full when I started, with 65GB free — enough for the model, but barely. The subsequent archaeology turned up 329GB of forgotten Ollama models and 118GB of GGUFs cached by root from September experiments, and the cleanup freed almost 600GB. The home lab equivalent of finding grocery money in last winter's coat.

The HIP build compiled the PR without a single source change — the same commit, c84e85af6, that built for CUDA sm_61 also built for gfx1151. Whatever else you want to say about the ggml project, its backend abstraction has earned its keep. The same Flash Attention fallback warning appeared, so both machines run the same unfused attention path, which keeps the comparison honest. One difference: as a single-GPU configuration, the Strix Halo got the diffusion KV cache enabled by default.

The Benchmark

Same model file, same PR commit, same prompt, same flags, two generation lengths. The only differences are the silicon and the KV cache default. The llama-diffusion-cli output reports total wall time and per-step time directly:

	Strix Halo (Radeon 8060S)	4x Tesla P40
256 tokens (1 canvas)	17.4s — 17 steps, 1,025 ms/step	42.5s — 19 steps, 2,235 ms/step
512 tokens (2 canvases)	37.1s — 36 steps, 1,031 ms/step	92.1s — 38 steps, 2,423 ms/step
Effective throughput	~14 tokens/sec	~5.6 tokens/sec

The integrated GPU in a mini PC that idles at a few dozen watts is 2.2x faster than four server cards drawing 250 watts apiece. Per denoising step, it's 1.03 seconds versus 2.2–2.4 seconds, and that per-step time holds almost perfectly flat as the generation grows from one canvas to two — 1,025ms to 1,031ms on the AMD side — so longer outputs scale linearly with block count on both machines.

My first suspect for the gap was the KV cache asymmetry, since the P40 box had it disabled. Easy to test: force it on with --diffusion-kv-cache on and rerun. Result: 2,179 ms/step versus 2,235 — a 2.5% improvement. Not the answer. The gap is the hardware, and it's worth understanding why, because the explanation is the inverse of every previous benchmark I've run on these two machines.

Why the iGPU Wins

For autoregressive inference, the P40s' saving grace has always been memory bandwidth. Each card moves about 346 GB/s from GDDR5X, and token-by-token generation is essentially a memory streaming benchmark — which is why these $200 relics have stayed economically relevant for chat workloads years after their compute became obsolete. The Strix Halo's LPDDR5X manages roughly 256 GB/s shared between CPU and GPU, so for ordinary LLM generation the P40s usually hold their own or win.

Diffusion flips the workload. Every denoising step is one enormous batched forward pass: 2,300+ positions of attention, MoE routing, and expert FFNs computed simultaneously. The weights are read once per step and amortized across all 256 canvas positions, so memory bandwidth stops being the bottleneck. What matters is raw arithmetic throughput on big matrix multiplies — exactly the regime where modern architectures embarrass Pascal.

Three specific factors stack up against the P40s:

Pascal's arithmetic is stuck in 2016. No tensor cores, useless FP16, no DP4A path that helps here. Every matmul in the unfused attention and expert layers runs through plain FP32 CUDA cores at GP102's native rate. RDNA 3.5 brings WMMA instructions and dual-issue FP32 — per-clock, per-unit, it simply does more math, and on a compute-bound workload that's the whole game.

Four GPUs synchronize every step. llama.cpp splits the model by layers, so each denoising step's forward pass marches through GPU 0, then 1, then 2, then 3, handing activations across PCIe 3.0 at every boundary — for a batch of 2,300 positions, a meaningfully larger transfer than single-token inference ever produces. In autoregressive mode this pipeline overhead hides behind memory streaming; at 48 steps per 512-token generation, it's pure tax. The unified-memory APU pays nothing. It doesn't even cross a PCIe bus to read the weights.

Neither machine has Flash Attention here, but the penalty isn't symmetric. Unfused attention materializes large intermediate matrices and burns bandwidth and compute on a canvas-sized sequence every step. The architecture with more arithmetic headroom absorbs that better.

The result is a benchmark where one of the oldest tricks in the home lab playbook — gang cheap cards together until the VRAM adds up — actively hurts, and the boring little APU that just holds everything in one pool of memory wins by a wide margin. The same dynamic decided the DeepSeek V4 experiment, but for a different reason: there it was instruction set support; here the P40s run the model correctly and still lose on the shape of the computation. Ten-year-old hardware doesn't fail all at once. It fails one workload category at a time.

Watching It Think

A few qualitative observations that the timing table doesn't capture.

The entropy-bound sampler's step count genuinely varies with content. Across runs I saw single canvases resolve in anywhere from 15 to 19 steps against a configured maximum of 48 — the sampler's confidence and entropy thresholds (t=[0.400, 0.800], entropy_bound=0.1, confidence=0.005 in the defaults) decide when each position commits, so boilerplate prose converges fast while denser passages take more iterations. The practical effect is that generation time depends on how hard the text is, not just how long. That's a strange property to develop intuitions for after years of fixed per-token costs.

The instruction-tuned model also produces an explicit planning trace — drafting bullet points, weighing alternatives, revising phrasing — before its final answer, in the now-familiar reasoning-model style. Watching a diffusion model do this is doubly strange, because the plan and the answer crystallize as blocks rather than as a stream, paragraph-scale thoughts emerging whole.

And the quality caveat is real, so I'll repeat Google's own framing: DiffusionGemma trades output quality for speed relative to standard Gemma 4. It's an experimental release aimed at speed-critical applications — real-time editing, latency-sensitive drafting — not a production daily driver. On my hardware, which can't reach the speeds that make the trade compelling, it's best understood as a working preview of a genuinely different inference paradigm. That's worth 27GB of disk to me.

The Recipe, Condensed

For either machine, the full setup is four commands and a download. CUDA flavor:

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
git fetch origin pull/24423/head:diffusiongemma && git checkout diffusiongemma
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=61   # 61 = Pascal/P40
cmake --build build -j --target llama-diffusion-cli

ROCm flavor, swap the configure line:

cmake -B build -DGGML_HIP=ON -DAMDGPU_TARGETS=gfx1151 -DCMAKE_BUILD_TYPE=Release

Then grab a GGUF from unsloth/diffusiongemma-26B-A4B-it-GGUF on Hugging Face — Q8_0 if you have 27GB of memory to spend, Q4_K_M if you're fitting a single 24GB card — and run:

./build/bin/llama-diffusion-cli \
  -m diffusiongemma-26B-A4B-it-Q8_0.gguf \
  -ngl 99 -cnv -n 2048 --diffusion-visual

The --diffusion-visual flag is optional and you should absolutely use it anyway. Once the PR merges into mainline llama.cpp, the fetch-and-checkout step disappears and this becomes as routine as running any other GGUF.

The deeper takeaway from this experiment isn't about DiffusionGemma specifically. It's that inference hardware assumptions are workload assumptions in disguise. The P40s survive in my rack because autoregressive generation is kind to old silicon with decent memory bandwidth. The first mainstream model family to change how tokens get generated — not just how many parameters produce them — quietly rewrote that bargain. If text diffusion graduates from experiment to standard practice, the hardware hierarchy of the home lab gets reshuffled, and the winners will be whatever has the most matmul per dollar, not the most gigabytes per second. I suspect the P40s will still find work. They always do. But I've stopped assuming I know which jobs they'll be good at.

TinyComputers.io (Posts about text diffusion)