DeepSeek released V4 in late April of 2026, and the numbers on the Flash variant are genuinely impressive: 284 billion total parameters, 13 billion activated per token, a one-million-token context window, and benchmarks that rival models three times its size. The architecture introduces hybrid compressed attention, manifold-constrained hyper-connections, and FP4 expert weights — a set of choices that makes running it locally both an interesting engineering challenge and a practical impossibility on most consumer hardware.
Unless you happen to have a Strix Halo APU sitting on your desk.
The AMD Ryzen AI MAX+ 395 has become my go-to machine for these kinds of experiments. It's not the fastest GPU, and it's certainly not the most power-efficient, but its unified memory architecture — 128GB of LPDDR5X shared between CPU and GPU — opens doors that discrete GPU setups can't walk through. When DeepSeek V4 dropped, I had two candidate machines: the Strix Halo, and a server with four NVIDIA Tesla P40s from 2016. On paper, the P40s have more raw compute: four GP102GL dies with 3,840 CUDA cores each, roughly 48 TFLOPS of FP32 in aggregate. The Strix Halo's integrated Radeon 8060S, by contrast, has 40 RDNA 3.5 compute units and maybe a quarter of that throughput.
The P40s couldn't even load the model. The Strix Halo ran it. This post is about why.
The Model: DeepSeek V4 Flash
DeepSeek V4 is a Mixture-of-Experts architecture that's both enormous and economical. The Flash variant, which is the smaller of the two released models, packs 284 billion parameters into 256 routed experts plus one shared expert. At inference time, only six experts are activated per token, meaning only 13 billion parameters actually participate in any given forward pass. The rest sit idle in memory.
The weights use a mixed precision scheme: FP4 for the expert parameters, FP8 for most everything else. This is aggressive. FP4 can only represent sixteen distinct values, and the quantization is done with per-block scaling to preserve dynamic range. It works, but it means the model is fundamentally tied to hardware that supports these formats natively — or to inference engines that can transparently dequantize them.
The model was released as a HuggingFace repository with 46 sharded safetensors files totaling roughly 149 GB. DeepSeek also published reference inference code using torchrun with model parallelism across GPUs and a tilelang kernel library for the FP8 and FP4 matrix multiplies. That reference code assumes Hopper or newer NVIDIA GPUs with native FP8 tensor cores. The P40s, as we'll see, did not get that memo.
Fortunately, the community moved fast. Within weeks, the teamblobfish group had produced GGUF quantizations of the model in formats ranging from Q8_0 down to IQ1_S, using a custom fork of llama.cpp implementing the new deepseek4 architecture. The IQ1_S variant — 1.6 bits per weight on average — compresses the model to 58 GB on disk. That's the one that fit.
Attempt One: Four NVIDIA Tesla P40s
The P40 machine is a familiar presence in my lab. Four Pascal-era GP102GL GPUs, each with 24 GB of GDDR5X, running Ubuntu 24.04 with NVIDIA's 580-series server driver. They've run diffusion models, quantized LLMs, and even a 57-billion-parameter image editing pipeline. I figured they'd at least be able to load DeepSeek V4, even if inference was slow.
They couldn't. The failure was architectural, not quantitative.
The first problem was PyTorch. DeepSeek's reference code requires PyTorch 2.10 or newer, which brings native float8_e4m3fn and float4_e2m1fn_x2 dtype support. These dtypes map to CUDA kernels that were compiled for compute capability 7.5 and above. The P40 is compute capability 6.1. When I installed PyTorch 2.11 with CUDA 12.8, it could see all four GPUs, but every CUDA kernel launch — from FP8 tensor creation to basic FP32 matrix multiplication — failed with cudaErrorNoKernelImageForDevice. The PyTorch binaries simply contain no sm_61 machine code.
You can build PyTorch from source with sm_61 support. I've done it before. It takes hours, requires exact CUDA toolkit version alignment, and produces a wheel that you have to manually maintain. But even if I'd taken that route, it wouldn't have helped, because the P40's Pascal architecture has no hardware support for FP8 or FP4 data types. The tensor cores that Hopper uses for FP8 matrix multiply don't exist here. Any FP8 operation would have to be emulated in software, destroying whatever performance advantage the larger die area might have provided.
The second problem was VRAM capacity. 96 GB across four GPUs sounds like a lot, but the IQ1_M GGUF needs 67.5 GB of allocation buffer for the weights alone, plus compute buffers, plus a KV cache. Even if the CUDA compatibility issues were solved, the model wouldn't fit without aggressive CPU offloading. And the P40, with its PCIe 3.0 x16 interface and ~12 GB/s of host-to-device bandwidth, would have turned CPU offloading into a slideshow.
Sometimes the right tool for the job is the newer one with less raw compute but the right instruction set.
The Strix Halo Setup
The Strix Halo machine is an AMD Ryzen AI MAX+ 395 APU running Ubuntu 24.04 with ROCm 7.2. Key specs:
- GPU: Radeon 8060S integrated graphics, gfx1151, RDNA 3.5 architecture, 40 compute units
- Memory: 128 GB LPDDR5X unified, with 96 GB configured as GTT for GPU access
- ROCm: 7.2.0 with HIP runtime
- llama.cpp: Existing build at commit d82b7a7c1 (the mainline GGUF tools work fine for reading the model files)
The Strix Halo has one thing the P40s don't: the right instruction set. RDNA 3.5 includes native FP8 support and, more importantly, the gfx1151 target in ROCm can compile and run kernels that operate on quantized data without emulation. The integrated GPU isn't fast, but it's compatible. And because the memory is unified, there's no PCIe bus standing between the CPU and the weights.
I started by downloading the IQ1_S-XL variant from teamblobfish's HuggingFace repository. The download was 58 GB across two sharded GGUF files — 47 GB for the first shard and 11 GB for the second. HuggingFace's CDN delivered them in about ten minutes.
Building llama.cpp with DeepSeek V4 Support
The mainline llama.cpp repository doesn't have support for the deepseek4 architecture yet. There's an open pull request — #22607 — with a full implementation by the nisparks team that adds the model architecture, tensor definitions, and inference graph construction. The PR is closed but the nisparks fork has the implementation on the pr/01-deepseek-v4-arch branch.
Cloning and building was straightforward:
git clone https://github.com/nisparks/llama.cpp.git llama.cpp-dsv4 cd llama.cpp-dsv4 git checkout pr/01-deepseek-v4-arch mkdir build && cd build cmake .. \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS=gfx1151 \ -DGGML_HIP_NO_VMM=OFF \ -DGGML_HIP_MMQ_MFMA=ON \ -DCMAKE_BUILD_TYPE=Release cmake --build . --target llama-cli -j16
The GGML_HIP_NO_VMM=OFF flag is important. With VMM (Virtual Memory Management) enabled, the HIP backend can use the GTT memory pool — the 96 GB of system memory the GPU can address directly. Without it, only the 4 GB of dedicated VRAM carveout is visible, and the model simply won't fit.
The build took a couple of minutes on the 16-core Zen 5 processor. The resulting binary was a standard llama-cli but with the deepseek4 architecture baked in.
Tensor Name Mismatches and the Art of sed
This is where things got interesting. The GGUF files from teamblobfish were created with a different version of the llama.cpp converter than the one in the nisparks fork. The model architecture code and the GGUF tensor names didn't agree on what anything was called.
The first run failed with "unknown model architecture: deepseek4." That was a missing branch in the nisparks fork — the experiment/deepseek-v4-gguf-convert branch had the right GGUF conversion scripts but not the model code. The pr/01-deepseek-v4-arch branch had the model code but expected different tensor names.
The mismatches fell into a few categories:
HC (hyper-connection) tensor naming: The GGUF uses output_hc_base, output_hc_fn, output_hc_scale for the output head hyper-connections. The code expected hc_head_base, hc_head_fn, hc_head_scale. Per-layer, the GGUF uses blk.N.hc_attn_base and blk.N.hc_ffn_base, which the code expected, but it needed the .weight suffix on the tensor lookup calls.
Compressor vs. compress naming: The GGUF names the compressed attention components attn_compressor_ape, attn_compressor_gate, etc. The code expected attn_compress_ape, attn_compress_gate. Same for indexer compressors: indexer_compressor_ape in the GGUF, indexer.compress_ape in the code (note the dot vs. underscore, too).
KV latent tensor: The GGUF named it attn_kv, the code expected attn_kv_latent.
Missing .weight suffixes: Several tensor creation calls in the model loader omitted the "weight" suffix when calling create_tensor(). For example, create_tensor(tn(LLM_TENSOR_ATTN_SINKS, i), ...) instead of create_tensor(tn(LLM_TENSOR_ATTN_SINKS, "weight", i), ...). The GGUF tensor names all ended in .weight or .bias, so every lookup that omitted the suffix returned null.
Each fix was a one-line change to src/llama-arch.cpp (the tensor name registry) or src/llama-model.cpp (the model loader), followed by a rebuild. The full list:
-
hc_head_base→output_hc_base,hc_head_fn→output_hc_fn,hc_head_scale→output_hc_scale(3 lines) - Added
"weight"suffix to HC head, HC layer, attn_sinks, attn_compress_ape, indexer_compress_ape, ffn_exp_probs_b, ffn_gate_tid2eid tensor calls (10+ lines) -
attn_compress_*→attn_compressor_*(4 lines) -
indexer.compress_*→indexer_compressor_*(4 lines) -
attn_kv_latent→attn_kv(1 line) -
attn_out_*→ was alreadyattn_output_*in the code (no change needed)
After all fixes, the tensor count matched: 1,325 loaded, 1,325 expected. The model loaded without errors.
GPU Binary Broadcast and Dequantization
The model loaded — and immediately crashed during prompt processing with an assertion failure in the GPU binary broadcast kernel:
ggml_cuda_op_bin_bcast: GGML_ASSERT(src1->type == GGML_TYPE_F32 || src1->type == GGML_TYPE_F16) failed
The IQ1_S quantization stores most tensors as iq1_m type, but the output projection and token embedding layers are q5_K — a 5-bit quantization format. When the model tried to add a bias or scale to one of these tensors during inference, the binary broadcast kernel received a q5_K tensor as src1 and had no idea what to do with it.
The binary broadcast kernel in ggml/src/ggml-cuda/binbcast.cu is designed for element-wise operations like add, multiply, and divide where src1 is broadcast across src0. It natively supports float32 and float16 for src1, but nothing quantized. The fix required three things:
- Adding
#include "convert.cuh"to get access to thedequantize_row_q5_K_cudaanddequantize_row_q2_K_cudafunctions - Adding a
ggml_backend_cuda_context & ctxparameter toggml_cuda_op_bin_bcastso we could allocate temporary GPU memory from the pool - Inserting a dequantization step before the binary operation:
if (src1->type != GGML_TYPE_F32 && src1->type != GGML_TYPE_F16) { int64_t nelements = ggml_nelements(src1); float * src1_f32 = src1_f32_buf.alloc(ctx.pool(), nelements); to_fp32_cuda_t to_fp32 = ggml_get_to_fp32_cuda(src1->type); to_fp32(src1_dd, src1_f32, nelements, stream); src1_effective = (const void *)src1_f32; // Build float-compatible strides for the contiguous dequantized buffer memcpy(&src1_copy, src1, sizeof(ggml_tensor)); src1_copy.type = GGML_TYPE_F32; src1_copy.nb[0] = sizeof(float); for (int d = 1; d < GGML_MAX_DIMS; d++) { src1_copy.nb[d] = src1_copy.ne[d-1] * src1_copy.nb[d-1]; } src1_ptr = &src1_copy; }
The stride reconstruction was necessary because the original tensor's strides are computed for the quantized block layout — a q5_K block is 176 bytes, not 4 bytes like a float. Passing float data with quantized strides causes alignment assertion failures deeper in the launch pipeline. The fix constructs a temporary tensor view with float-compatible strides that match the dequantized data's memory layout.
This patch is specific to the nisparks fork's version of the CUDA backend and will need to be re-applied or reimplemented differently when deepseek4 support lands in mainline llama.cpp. But it worked.
The First Successful Run
With all fixes applied and the binary compiled, the model loaded in about 30 seconds — the time it takes to allocate 58 GB of GPU buffers through the VMM allocator and initialize the compute graph. Memory breakdown showed 58,337 MiB for the model weights, 18 MiB for context, and a tiny compute buffer.
I sent a simple prompt:
What is 2+2? Answer briefly.
The model responded with [Start thinking] followed by a series of <<<<<<<<<< characters — the DeepSeek thinking prefix. It was actually processing. The IQ1_S quantization is aggressive enough that the model occasionally gets stuck in token repetition loops, which is a known artifact of extreme compression. But the fundamental achievement was clear: a 284B-parameter model, running inference on a consumer APU, completely locally.
The generation speed was not measured precisely in this initial test — I was focused on getting it running at all rather than benchmarking throughput. But qualitatively, token generation was slow, on the order of 1-2 tokens per second. The Strix Halo's LPDDR5X memory bandwidth (~120 GB/s) is the binding constraint. Each token requires streaming the active expert weights through the compute units, and 13 billion activated parameters at 1.6 bits per weight means roughly 2.6 GB of data movement per token. At 120 GB/s, that's a theoretical maximum of about 46 tokens per second, but the reality is much lower due to the overhead of the attention mechanism, the non-expert weights that also need to be accessed, and the compute time for the matrix multiplies. In practice, expect single-digit tokens per second.
Why This Matters
There's a narrative in AI hardware discourse that goes something like: to run large models, you need large GPUs. Multiple H100s. A DGX. Cloud credits. The reality, increasingly, is that model compression and inference optimization are advancing faster than model size growth.
DeepSeek V4 Flash in IQ1_S quantization runs on a system you can buy for around $3,500. It occupies 58 GB of disk space and fits within 65 GB of GPU-addressable memory. It's not fast. It's not production-ready. But it runs.
The contrast with the P40s is instructive. Those cards have more raw teraflops, more memory bandwidth (346 GB/s per card vs. ~120 GB/s shared), and a dedicated inference pedigree — the P40 was literally designed for this kind of workload. But they lack the instruction set support that modern quantized models depend on. Hardware compatibility is becoming more important than hardware capability. FP8 isn't just a nice-to-have; it's table stakes for running 2026-vintage models.
The AMD Strix Halo platform, for all its compromises — limited memory bandwidth, immature software ecosystem, the eternal dance with ROCm version pinning — is the platform where this experiment worked. The unified memory architecture means you never have to think about PCIe transfers. The RDNA 3.5 instruction set means FP8 just works. And the 96 GB of GPU-addressable GTT memory means models that need 70+ GB of buffer allocation actually fit.
One wrinkle worth mentioning: the VMM (Virtual Memory Management) allocator in this build doesn't survive CUDA graph warmup. Running without --no-warmup triggers a HipVMM Failure: invalid argument inside ggml_cuda_pool_vmm::alloc during the warmup's matrix-vector multiply pass. The first token inference itself works — it's the graph capture that fails, likely because the VMM allocator can't satisfy the warmup's allocation pattern within the remaining pool space. Skipping warmup with --no-warmup avoids the issue entirely at the cost of slightly higher per-token latency. For a research setup running a 284B-parameter model at single-digit tokens per second, that trade-off is easy to accept.
What's Next
The nisparks fork's deepseek4 support will eventually make its way into mainline llama.cpp. When that happens, Ollama will be able to import the model directly with a simple Modelfile, and the whole setup process will collapse to a ollama create command. That's the dream, at least.
In the meantime, there's room for improvement on the performance side. The IQ1_M quantization, at 64 GB, is just barely too large for the Strix Halo's 65.2 GB of HIP-visible VRAM — the weight allocation alone needs 67.5 GB. A custom quantization that splits the difference — say, IQ1_M for the attention layers and IQ1_S for the experts — might squeeze into memory while preserving more quality. The CPU binary ops for quantized types also need attention; the current code path fails on CPU-offloaded layers because the CPU backend doesn't support binary operations on q8_0 tensors.
And then there's the bigger picture. DeepSeek V4 Pro, at 1.6 trillion parameters with 49 billion activated, is the next target. Even IQ1_S quantization of that model would be around 300 GB — beyond the reach of any single consumer device today, but within striking distance of multi-node setups or upcoming memory standards. The gap between what's possible and what's practical keeps shrinking.
For now, the takeaway is simple: if you have a Strix Halo machine and a willingness to patch a few tensor names and GPU kernels, you can run one of the most capable open-source language models ever released, completely offline, in your own home. That felt like science fiction three years ago. Today it's just an afternoon of debugging.
Reproducing This
If you want to try this yourself, here's the condensed setup:
# Clone and build the nisparks llama.cpp fork git clone https://github.com/nisparks/llama.cpp.git llama.cpp-dsv4 cd llama.cpp-dsv4 git checkout pr/01-deepseek-v4-arch # Apply tensor name fixes (see this post for details) # Build with HIP VMM enabled for GTT memory access mkdir build && cd build cmake .. \ -DGGML_HIP=ON \ -DAMDGPU_TARGETS=gfx1151 \ -DGGML_HIP_NO_VMM=OFF \ -DCMAKE_BUILD_TYPE=Release cmake --build . --target llama-cli -j16 # Download IQ1_S GGUF pip install huggingface_hub hf download teamblobfish/DeepSeek-V4-Flash-GGUF \ IQ1_S-XL/ --local-dir ./model # Run ./bin/llama-cli \ -m ./model/IQ1_S-XL/DeepSeek-V4-Flash-IQ1_S-XL-00001-of-00002.gguf \ -p "Your prompt here" -n 100 -ngl 99 -c 256 --temp 0 -sp
The full set of code changes — tensor name mappings, suffix additions, and the binbcast kernel fix — are detailed in each section above. A consolidated patch will be available once these changes stabilize.