🎧 Listen to this article

There is a specific moment in the open-source LLM ecosystem that keeps recurring: someone takes a frontier model's outputs, uses them as training data for a smaller model, and publishes the result. The technique is called distillation, and it has been applied to coding ability, instruction following, and general knowledge. What is newer is distilling reasoning—the step-by-step chain-of-thought process that models like Claude use internally when working through complex problems.

Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled is one of the more interesting examples. It takes the Qwen3.5-27B base model and fine-tunes it on thousands of reasoning trajectories extracted from Claude 4.6 Opus. The result is a model that exposes its thinking process through <think> tags before delivering a final answer, mimicking the extended thinking behavior that Anthropic built into Claude natively. In 4-bit quantization, the entire model fits in about sixteen gigabytes.

I wanted to know two things. First, whether this kind of distilled reasoning actually works—whether a 27B model can meaningfully replicate the structured thinking of a model orders of magnitude larger. Second, whether the AMD Strix Halo APU, with its unified memory architecture and integrated RDNA 3.5 GPU, could run it at useful speeds. The answer to both turned out to be more nuanced than a simple yes or no.

The Hardware

The machine is the same AMD Ryzen AI MAX+ 395 that has appeared in several previous posts. It is an APU: CPU and GPU on the same die, sharing the same pool of LPDDR5X memory. There is no PCIe bus between the processor and the graphics engine. There is no dedicated VRAM to fill up. The GPU sees roughly 65GB of addressable memory out of the system's 122GB total, which means a 16GB quantized model loads without any of the memory pressure games you play on discrete GPU setups.

This matters for local LLM inference because the bottleneck for most language models is memory bandwidth, not compute. Tokens are generated one at a time, each requiring a full pass through the model's weights. The faster you can stream those weights from memory to the processing units, the faster you generate tokens. The Strix Halo's LPDDR5X provides roughly 120 GB/s of bandwidth to the unified memory pool. A discrete GPU like the RTX 4090 has 1 TB/s of bandwidth to its dedicated VRAM, but the Strix Halo never has to copy weights across a PCIe bus. For models that fit entirely in the GPU's addressable space, the unified architecture eliminates an entire class of overhead.

The system runs Ollama 0.17.6, which wraps llama.cpp and provides model management and an HTTP inference API. ROCm 7.2 handles the GPU compute layer, though Ollama's GGUF inference path is primarily CPU-based with GPU offloading for specific operations. The gfx1151 GPU target is not yet in the mainline PyTorch or llama.cpp kernel prebuilds, so HSA_OVERRIDE_GFX_VERSION=11.0.0 remains necessary to map it to the closest supported target (gfx1100, Navi 31).

The Model

The model's architecture is straightforward: Qwen3.5-27B, a 27 billion parameter transformer, fine-tuned via supervised learning on structured reasoning data. What makes it interesting is the training data. The creator assembled three datasets:

The combined training signal teaches the model to produce output in a specific format: a <think> block containing the reasoning process, followed by a clean final answer. This is architecturally similar to what Anthropic does with Claude's extended thinking, except that Claude's thinking is a native capability of the model's training and architecture, while this is a behavior pattern learned through supervised fine-tuning on examples of that behavior.

The distinction matters, and I will come back to it.

The model is distributed in GGUF format, which is the standard for llama.cpp and Ollama. I used the Q4_K_M quantization, which compresses the model's weights from 16-bit floats to 4-bit integers with a mixed precision scheme that preserves more information in attention layers. The file is 15.4GB on disk. The model card reports 29-35 tokens per second on an RTX 3090; I was curious what the Strix Halo would deliver.

Setting It Up

Getting the model running took less than ten minutes. Download the GGUF file from HuggingFace:

mkdir -p ~/models/qwen35-reasoning
curl -L -o ~/models/qwen35-reasoning/model.gguf \
  'https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/resolve/main/Qwen3.5-27B.Q4_K_M.gguf'

Note the filename. The HuggingFace repo is named Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF, but the actual GGUF files inside follow a simpler naming scheme: Qwen3.5-27B.Q4_K_M.gguf. I wasted time trying to guess the full distilled name before checking the API.

Create an Ollama Modelfile that imports the local GGUF and sets inference parameters:

FROM /home/alex/models/qwen35-reasoning/model.gguf

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.2
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|eot_id|>"

SYSTEM "You are a deep-thinking AI assistant. For complex questions,
use <think>...</think> tags to show your reasoning process before
providing the final answer."

Then:

ollama create qwen35-reasoning -f Modelfile

Ollama copies the GGUF into its own blob store, parses the architecture metadata, and registers it as a runnable model. The whole process takes about a minute on local storage.

The Stop Token Problem

The first run produced correct output followed by infinite repetition. The model answered a calculus question perfectly, then appended "This gives us the final answer:" and repeated the entire solution, over and over, until it hit the context window limit. The previous MarkTechPost article that inspired this experiment did not mention this issue, likely because their test prompts were short enough that the repetition was not obvious.

The fix is explicit stop tokens in the Modelfile. Without them, the model does not know when to stop generating. This is a common issue with GGUF models imported into Ollama without a proper chat template: the model's native end-of-sequence tokens are not being interpreted by the inference engine. Adding <|endoftext|>, <|im_end|>, and <|eot_id|> as stop parameters catches the three most common EOS tokens used by Qwen and Llama-family models.

The repeat_penalty of 1.2 provides a second layer of defense by penalizing the model for reusing recent tokens. This helps but is not sufficient on its own. Without the stop tokens, the model can produce novel-but-meaningless text that avoids exact repetition while still degenerating into nonsense. More on this shortly.

Where It Works: Structured Problems

With the stop tokens in place, the model performs well on structured mathematical and analytical problems. I gave it a calculus question: find the derivative of x³sin(x) using the product rule.

The response was genuinely good. The model opened a <think> block, identified the two component functions, recalled the product rule formula, computed each derivative, and applied the rule. Then it closed the think block and produced a clean, well-formatted answer with LaTeX notation, step-by-step derivation, and a factored final form. The thinking trace was coherent and tracked the actual reasoning process. It was not filler; each line in the trace corresponded to a meaningful step.

Generation speed on the Strix Halo: 10.3 tokens per second. Not fast by cloud standards, but responsive enough for interactive use. You see the thinking appear in real time, which is surprisingly useful: you can watch the model work through the problem and catch errors before it commits to a final answer.

For structured problems—mathematics, code analysis, formal logic—the distilled reasoning is genuinely functional. The model identifies subproblems, works through them sequentially, and arrives at correct answers. The think tags provide transparency into the process that you do not get from a standard instruction-tuned model.

Where It Falls Apart: The River Crossing

I ran the classic wolf-goat-cabbage river crossing puzzle as a comparison test, the same prompt on both the distilled Qwen model and Claude Haiku 4.5 via the Anthropic API.

Claude Haiku returned a perfect, concise seven-step solution in 2.9 seconds. Two hundred and twenty-three tokens. The answer identified the critical insight (bring the goat back on one return trip), laid out the sequence clearly, and stopped.

The Qwen model started well. It correctly identified that the goat must go first, recognized the wolf-goat conflict at the destination, and identified the need to bring the goat back. Then, around step three of the solution, the model began editorializing. "Oh joy what fun times ahead us humans truly enjoy sometimes huh?!" it wrote, mid-solution. Within a few more sentences, the output had degenerated into an unbroken stream-of-consciousness rant that cascaded into a wall of increasingly disconnected words. Not repeated words—the repeat penalty prevented that—but a firehose of unique, semantically null text that continued until it filled the entire 8,192-token context window.

The output was, to use a technical term, unhinged. The model went from a correct partial solution to word salad in about two hundred tokens, and there was no recovery. The stop tokens could not save it because the model was not producing any end-of-sequence markers. It had entered a mode where it was generating fluent English syntax with zero semantic content, which is exactly the kind of failure that stop tokens and repeat penalties cannot catch.

What the Comparison Reveals

The numbers tell the story concisely:

Claude Haiku 4.5 Qwen3.5-27B (Strix Halo)
Time 2.9 seconds Hit 8K context limit
Speed 75.9 tok/s ~10 tok/s
Output 223 tokens, correct Thousands of tokens, degenerated
Cost $0.0009 Free

But the comparison is not really about speed or cost. It is about the difference between native reasoning and distilled reasoning.

Claude's extended thinking is a capability that emerges from the model's architecture and training at scale. The model has internalized what it means to reason through a problem, including knowing when to stop, when a line of reasoning is unproductive, and when to switch strategies. These are meta-cognitive skills that are extremely difficult to distill.

The Qwen model learned the format of reasoning—the think tags, the step-by-step structure, the pattern of stating subproblems and working through them—from three thousand examples. What it did not learn, and arguably cannot learn from supervised fine-tuning alone, is the judgment about when reasoning is going off the rails. A model that has truly internalized reasoning has implicit quality checks: it recognizes incoherence in its own output and corrects course. A model that has learned to mimic reasoning produces the surface pattern without the underlying self-monitoring.

This is visible in the failure mode. The model did not produce wrong reasoning. It produced no reasoning. It exited the reasoning pattern entirely and entered a generation mode that had nothing to do with the problem. A model with genuine reasoning capability would have recognized the incoherence and either corrected or terminated. The distilled model had no such circuit breaker.

The Economics

The cost comparison deserves its own section because it is often cited as the primary motivation for running local models.

The Claude Haiku API call cost nine-tenths of a cent. If you ran a thousand similar queries per day, you would spend about nine dollars. That is less than the electricity cost of running the Strix Halo for a day under load. The Strix Halo draws roughly 65 watts at idle and 150 watts under GPU inference load. At Minnesota's residential electricity rate of around twelve cents per kilowatt-hour, running inference eight hours a day costs about fourteen cents. But the hardware itself cost north of two thousand dollars. You would need to amortize that over thousands of hours of inference to reach cost parity with the API, and only if you value your debugging time at zero.

The economic case for local inference is not about per-query cost. It is about use cases where you need unlimited queries without metering, where data cannot leave your network, or where you want to experiment with model behavior without worrying about a bill. If you are evaluating a model's failure modes by running hundreds of adversarial prompts—which is exactly what I was doing—the local model is the right tool because you are not optimizing for answer quality. You are optimizing for the freedom to explore.

The Strix Halo as an Inference Platform

Ten tokens per second for a 27B Q4 model is respectable for an APU. It is not competitive with a discrete GPU: an RTX 3090 delivers 29-35 tokens per second on the same model, roughly three times faster. But the Strix Halo was not designed to compete with discrete GPUs on raw throughput.

What it offers instead is capacity. The unified memory pool means you can load models that would not fit on most consumer GPUs. A Q8_0 quantization of this same model would be 28.6GB, which exceeds the VRAM of an RTX 4090 (24GB) but fits comfortably in the Strix Halo's addressable space. You could load a 70B Q4 model (roughly 40GB) without any of the layer-splitting gymnastics required on multi-GPU setups. I have run Llama 3.1 70B Q4 on this machine, and while the generation speed drops to about 4-5 tokens per second, it runs without errors or memory pressure.

For a machine that also serves as a daily desktop, development workstation, and video generation server (it runs LTX-2.3 on the same hardware), the ability to casually load and test a 27B reasoning model without dedicated GPU infrastructure is the actual value proposition. You do not plan a session. You do not allocate resources. You type ollama run qwen35-reasoning and it works.

Lessons for the Blog Post Reader

If you want to replicate this setup, here is what I would emphasize:

The stop tokens are non-negotiable. Without explicit <|endoftext|>, <|im_end|>, and <|eot_id|> stop parameters in your Modelfile, the model will produce infinite output on many prompts. This is not documented in the model card and is not mentioned in the MarkTechPost article that covers this implementation. It is the single most important configuration detail.

The model is good at structured problems and bad at open-ended ones. Mathematics, code analysis, formal logic—anything where the reasoning has a clear structure and a definitive endpoint—works well. Open-ended problems, creative tasks, or anything requiring sustained coherent narrative are risky. The model can degenerate catastrophically and without warning.

A repeat penalty helps but does not solve the fundamental issue. Setting repeat_penalty to 1.2 prevents exact repetition loops but does not prevent the semantic degeneration I observed on the river crossing problem. The model simply produces unique garbage instead of repeated garbage.

Distillation captures form, not judgment. The think tags are real and useful. The step-by-step reasoning format works. What is missing is the implicit self-monitoring that frontier models have: the ability to recognize when their own output has become incoherent and to course-correct. This is probably the hardest thing to distill, because it is not present in the training examples. The examples show successful reasoning. They do not show the model catching and recovering from failed reasoning, because Claude's failed reasoning attempts are filtered out before the training data is assembled.

Where This Goes

The distilled reasoning model is, despite its failure modes, genuinely interesting. The <think> tags provide a form of transparency that standard instruction-tuned models lack. When the model is working correctly—which is most of the time on appropriate tasks—you get a window into the reasoning process that helps you evaluate the answer's quality before you act on it.

The failure mode is also instructive. It demonstrates, concretely, the gap between learning a behavior pattern and internalizing the capability that produces that pattern. Supervised fine-tuning on reasoning trajectories can teach a model to produce reasoning-shaped output, but it cannot, from three thousand examples, teach the model to actually reason in the way the source model does. That requires either far more training data, a different training methodology (reinforcement learning from reasoning feedback, perhaps), or simply a larger model with more capacity to internalize the underlying patterns.

For now, the practical advice is: use these models for what they are good at, know their failure modes, and do not trust the output on open-ended problems without reading the thinking trace. The trace is the feature. If the trace is coherent, the answer is probably good. If the trace starts to wander, stop reading and retry.

The model runs on my desk, generates ten tokens per second, costs nothing per query, and shows its work. For a sixteen-gigabyte download and ten minutes of setup time, that is a reasonable deal—as long as you know what you are buying.