<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TinyComputers.io (Posts about inference)</title><link>https://tinycomputers.io/</link><description></description><atom:link href="https://tinycomputers.io/categories/inference.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 A.C. Jokela 
&lt;!-- div style="width: 100%" --&gt;
&lt;a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"&gt;&lt;img alt="" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /&gt; Creative Commons Attribution-ShareAlike&lt;/a&gt;&amp;nbsp;|&amp;nbsp;
&lt;!-- /div --&gt;
</copyright><lastBuildDate>Mon, 06 Apr 2026 22:13:01 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Distilled Reasoning on Strix Halo: Running a Claude-Trained Thinking Model Locally</title><link>https://tinycomputers.io/posts/distilled-reasoning-on-strix-halo-qwen35-claude-thinking.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/distilled-reasoning-on-strix-halo-qwen35-claude-thinking_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;27 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;There is a specific moment in the open-source LLM ecosystem that keeps recurring: someone takes a frontier model's outputs, uses them as training data for a smaller model, and publishes the result. The technique is called distillation, and it has been applied to coding ability, instruction following, and general knowledge. What is newer is distilling &lt;em&gt;reasoning&lt;/em&gt;—the step-by-step chain-of-thought process that models like Claude use internally when working through complex problems.&lt;/p&gt;
&lt;p&gt;&lt;a href="https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"&gt;Jackrong's Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled&lt;/a&gt; is one of the more interesting examples. It takes the Qwen3.5-27B base model and fine-tunes it on thousands of reasoning trajectories extracted from Claude 4.6 Opus. The result is a model that exposes its thinking process through &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags before delivering a final answer, mimicking the extended thinking behavior that Anthropic built into Claude natively. In 4-bit quantization, the entire model fits in about sixteen gigabytes.&lt;/p&gt;
&lt;p&gt;I wanted to know two things. First, whether this kind of distilled reasoning actually works—whether a 27B model can meaningfully replicate the structured thinking of a model orders of magnitude larger. Second, whether the AMD Strix Halo APU, with its unified memory architecture and integrated RDNA 3.5 GPU, could run it at useful speeds. The answer to both turned out to be more nuanced than a simple yes or no.&lt;/p&gt;
&lt;h3&gt;The Hardware&lt;/h3&gt;
&lt;p&gt;The machine is the same &lt;a href="https://tinycomputers.io/posts/amd-ai-max+-395-system-review-a-comprehensive-analysis.html"&gt;AMD Ryzen AI MAX+ 395&lt;/a&gt; that has appeared in several previous posts. It is an APU: CPU and GPU on the same die, sharing the same pool of LPDDR5X memory. There is no PCIe bus between the processor and the graphics engine. There is no dedicated VRAM to fill up. The GPU sees roughly 65GB of addressable memory out of the system's 122GB total, which means a 16GB quantized model loads without any of the memory pressure games you play on discrete GPU setups.&lt;/p&gt;
&lt;p&gt;This matters for local LLM inference because the bottleneck for most language models is memory bandwidth, not compute. Tokens are generated one at a time, each requiring a full pass through the model's weights. The faster you can stream those weights from memory to the processing units, the faster you generate tokens. The Strix Halo's LPDDR5X provides roughly 120 GB/s of bandwidth to the unified memory pool. A discrete GPU like the RTX 4090 has 1 TB/s of bandwidth to its dedicated VRAM, but the Strix Halo never has to copy weights across a PCIe bus. For models that fit entirely in the GPU's addressable space, the unified architecture eliminates an entire class of overhead.&lt;/p&gt;
&lt;p&gt;The system runs Ollama 0.17.6, which wraps llama.cpp and provides model management and an HTTP inference API. ROCm 7.2 handles the GPU compute layer, though Ollama's GGUF inference path is primarily CPU-based with GPU offloading for specific operations. The &lt;code&gt;gfx1151&lt;/code&gt; GPU target is not yet in the mainline PyTorch or llama.cpp kernel prebuilds, so &lt;code&gt;HSA_OVERRIDE_GFX_VERSION=11.0.0&lt;/code&gt; remains necessary to map it to the closest supported target (gfx1100, Navi 31).&lt;/p&gt;
&lt;h3&gt;The Model&lt;/h3&gt;
&lt;p&gt;The model's architecture is straightforward: Qwen3.5-27B, a 27 billion parameter transformer, fine-tuned via supervised learning on structured reasoning data. What makes it interesting is the training data. The creator assembled three datasets:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/datasets/nohurry/Opus-4.6-Reasoning-3000x-filtered"&gt;Opus-4.6-Reasoning-3000x-filtered&lt;/a&gt;&lt;/strong&gt;: Three thousand reasoning trajectories extracted from Claude 4.6 Opus, filtered for quality.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/datasets/TeichAI/claude-4.5-opus-high-reasoning-250x"&gt;claude-4.5-opus-high-reasoning-250x&lt;/a&gt;&lt;/strong&gt;: Two hundred and fifty examples of high-intensity structured reasoning from an earlier Claude version.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;a href="https://huggingface.co/datasets/Jackrong/Qwen3.5-reasoning-700x"&gt;Qwen3.5-reasoning-700x&lt;/a&gt;&lt;/strong&gt;: Seven hundred step-by-step problem-solving examples.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The combined training signal teaches the model to produce output in a specific format: a &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; block containing the reasoning process, followed by a clean final answer. This is architecturally similar to what Anthropic does with Claude's extended thinking, except that Claude's thinking is a native capability of the model's training and architecture, while this is a behavior pattern learned through supervised fine-tuning on examples of that behavior.&lt;/p&gt;
&lt;p&gt;The distinction matters, and I will come back to it.&lt;/p&gt;
&lt;p&gt;The model is distributed in GGUF format, which is the standard for llama.cpp and Ollama. I used the Q4_K_M quantization, which compresses the model's weights from 16-bit floats to 4-bit integers with a mixed precision scheme that preserves more information in attention layers. The file is 15.4GB on disk. The &lt;a href="https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF"&gt;model card&lt;/a&gt; reports 29-35 tokens per second on an RTX 3090; I was curious what the Strix Halo would deliver.&lt;/p&gt;
&lt;h3&gt;Setting It Up&lt;/h3&gt;
&lt;p&gt;Getting the model running took less than ten minutes. Download the GGUF file from HuggingFace:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;~/models/qwen35-reasoning
curl&lt;span class="w"&gt; &lt;/span&gt;-L&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;~/models/qwen35-reasoning/model.gguf&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="s1"&gt;'https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF/resolve/main/Qwen3.5-27B.Q4_K_M.gguf'&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note the filename. The HuggingFace repo is named &lt;code&gt;Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF&lt;/code&gt;, but the actual GGUF files inside follow a simpler naming scheme: &lt;code&gt;Qwen3.5-27B.Q4_K_M.gguf&lt;/code&gt;. I wasted time trying to guess the full distilled name before checking the API.&lt;/p&gt;
&lt;p&gt;Create an Ollama Modelfile that imports the local GGUF and sets inference parameters:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;FROM&lt;span class="w"&gt; &lt;/span&gt;/home/alex/models/qwen35-reasoning/model.gguf

PARAMETER&lt;span class="w"&gt; &lt;/span&gt;temperature&lt;span class="w"&gt; &lt;/span&gt;0.6
PARAMETER&lt;span class="w"&gt; &lt;/span&gt;top_p&lt;span class="w"&gt; &lt;/span&gt;0.95
PARAMETER&lt;span class="w"&gt; &lt;/span&gt;num_ctx&lt;span class="w"&gt; &lt;/span&gt;8192
PARAMETER&lt;span class="w"&gt; &lt;/span&gt;repeat_penalty&lt;span class="w"&gt; &lt;/span&gt;1.2
PARAMETER&lt;span class="w"&gt; &lt;/span&gt;stop&lt;span class="w"&gt; &lt;/span&gt;"&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;|endoftext|&amp;gt;"
PARAMETER&lt;span class="w"&gt; &lt;/span&gt;stop&lt;span class="w"&gt; &lt;/span&gt;"&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;|im_end|&amp;gt;"
PARAMETER&lt;span class="w"&gt; &lt;/span&gt;stop&lt;span class="w"&gt; &lt;/span&gt;"&lt;span class="err"&gt;&amp;lt;&lt;/span&gt;|eot_id|&amp;gt;"

SYSTEM&lt;span class="w"&gt; &lt;/span&gt;"You&lt;span class="w"&gt; &lt;/span&gt;are&lt;span class="w"&gt; &lt;/span&gt;a&lt;span class="w"&gt; &lt;/span&gt;deep-thinking&lt;span class="w"&gt; &lt;/span&gt;AI&lt;span class="w"&gt; &lt;/span&gt;assistant.&lt;span class="w"&gt; &lt;/span&gt;For&lt;span class="w"&gt; &lt;/span&gt;complex&lt;span class="w"&gt; &lt;/span&gt;questions,
use&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;&amp;lt;think&amp;gt;&lt;/span&gt;...&lt;span class="nt"&gt;&amp;lt;/think&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;tags&lt;span class="w"&gt; &lt;/span&gt;to&lt;span class="w"&gt; &lt;/span&gt;show&lt;span class="w"&gt; &lt;/span&gt;your&lt;span class="w"&gt; &lt;/span&gt;reasoning&lt;span class="w"&gt; &lt;/span&gt;process&lt;span class="w"&gt; &lt;/span&gt;before
providing&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;final&lt;span class="w"&gt; &lt;/span&gt;answer."
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;ollama&lt;span class="w"&gt; &lt;/span&gt;create&lt;span class="w"&gt; &lt;/span&gt;qwen35-reasoning&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;Modelfile
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Ollama copies the GGUF into its own blob store, parses the architecture metadata, and registers it as a runnable model. The whole process takes about a minute on local storage.&lt;/p&gt;
&lt;h3&gt;The Stop Token Problem&lt;/h3&gt;
&lt;p&gt;The first run produced correct output followed by infinite repetition. The model answered a calculus question perfectly, then appended "This gives us the final answer:" and repeated the entire solution, over and over, until it hit the context window limit. The previous &lt;a href="https://www.marktechpost.com/2026/03/26/a-coding-implementation-to-run-qwen3-5-reasoning-models-distilled-with-claude-style-thinking-using-gguf-and-4-bit-quantization/"&gt;MarkTechPost&lt;/a&gt; article that inspired this experiment did not mention this issue, likely because their test prompts were short enough that the repetition was not obvious.&lt;/p&gt;
&lt;p&gt;The fix is explicit stop tokens in the Modelfile. Without them, the model does not know when to stop generating. This is a common issue with GGUF models imported into Ollama without a proper chat template: the model's native end-of-sequence tokens are not being interpreted by the inference engine. Adding &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;|im_end|&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;|eot_id|&amp;gt;&lt;/code&gt; as stop parameters catches the three most common EOS tokens used by Qwen and Llama-family models.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;repeat_penalty&lt;/code&gt; of 1.2 provides a second layer of defense by penalizing the model for reusing recent tokens. This helps but is not sufficient on its own. Without the stop tokens, the model can produce novel-but-meaningless text that avoids exact repetition while still degenerating into nonsense. More on this shortly.&lt;/p&gt;
&lt;h3&gt;Where It Works: Structured Problems&lt;/h3&gt;
&lt;p&gt;With the stop tokens in place, the model performs well on structured mathematical and analytical problems. I gave it a calculus question: find the derivative of x³sin(x) using the product rule.&lt;/p&gt;
&lt;p&gt;The response was genuinely good. The model opened a &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; block, identified the two component functions, recalled the product rule formula, computed each derivative, and applied the rule. Then it closed the think block and produced a clean, well-formatted answer with LaTeX notation, step-by-step derivation, and a factored final form. The thinking trace was coherent and tracked the actual reasoning process. It was not filler; each line in the trace corresponded to a meaningful step.&lt;/p&gt;
&lt;p&gt;Generation speed on the Strix Halo: &lt;strong&gt;10.3 tokens per second&lt;/strong&gt;. Not fast by cloud standards, but responsive enough for interactive use. You see the thinking appear in real time, which is surprisingly useful: you can watch the model work through the problem and catch errors before it commits to a final answer.&lt;/p&gt;
&lt;p&gt;For structured problems—mathematics, code analysis, formal logic—the distilled reasoning is genuinely functional. The model identifies subproblems, works through them sequentially, and arrives at correct answers. The think tags provide transparency into the process that you do not get from a standard instruction-tuned model.&lt;/p&gt;
&lt;h3&gt;Where It Falls Apart: The River Crossing&lt;/h3&gt;
&lt;p&gt;I ran the classic &lt;a href="https://en.wikipedia.org/wiki/Wolf,_goat_and_cabbage_problem"&gt;wolf-goat-cabbage river crossing&lt;/a&gt; puzzle as a comparison test, the same prompt on both the distilled Qwen model and Claude Haiku 4.5 via the Anthropic API.&lt;/p&gt;
&lt;p&gt;Claude Haiku returned a perfect, concise seven-step solution in 2.9 seconds. Two hundred and twenty-three tokens. The answer identified the critical insight (bring the goat back on one return trip), laid out the sequence clearly, and stopped.&lt;/p&gt;
&lt;p&gt;The Qwen model started well. It correctly identified that the goat must go first, recognized the wolf-goat conflict at the destination, and identified the need to bring the goat back. Then, around step three of the solution, the model began editorializing. "Oh joy what fun times ahead us humans truly enjoy sometimes huh?!" it wrote, mid-solution. Within a few more sentences, the output had degenerated into an unbroken stream-of-consciousness rant that cascaded into a wall of increasingly disconnected words. Not repeated words—the repeat penalty prevented that—but a firehose of unique, semantically null text that continued until it filled the entire 8,192-token context window.&lt;/p&gt;
&lt;p&gt;The output was, to use a technical term, unhinged. The model went from a correct partial solution to word salad in about two hundred tokens, and there was no recovery. The stop tokens could not save it because the model was not producing any end-of-sequence markers. It had entered a mode where it was generating fluent English syntax with zero semantic content, which is exactly the kind of failure that stop tokens and repeat penalties cannot catch.&lt;/p&gt;
&lt;h3&gt;What the Comparison Reveals&lt;/h3&gt;
&lt;p&gt;The numbers tell the story concisely:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Claude Haiku 4.5&lt;/th&gt;
&lt;th&gt;Qwen3.5-27B (Strix Halo)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Time&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2.9 seconds&lt;/td&gt;
&lt;td&gt;Hit 8K context limit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;75.9 tok/s&lt;/td&gt;
&lt;td&gt;~10 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;223 tokens, correct&lt;/td&gt;
&lt;td&gt;Thousands of tokens, degenerated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cost&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$0.0009&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;But the comparison is not really about speed or cost. It is about the difference between native reasoning and distilled reasoning.&lt;/p&gt;
&lt;p&gt;Claude's extended thinking is a capability that emerges from the model's architecture and training at scale. The model has internalized what it means to reason through a problem, including knowing when to stop, when a line of reasoning is unproductive, and when to switch strategies. These are meta-cognitive skills that are extremely difficult to distill.&lt;/p&gt;
&lt;p&gt;The Qwen model learned the &lt;em&gt;format&lt;/em&gt; of reasoning—the think tags, the step-by-step structure, the pattern of stating subproblems and working through them—from three thousand examples. What it did not learn, and arguably cannot learn from supervised fine-tuning alone, is the judgment about when reasoning is going off the rails. A model that has truly internalized reasoning has implicit quality checks: it recognizes incoherence in its own output and corrects course. A model that has learned to &lt;em&gt;mimic&lt;/em&gt; reasoning produces the surface pattern without the underlying self-monitoring.&lt;/p&gt;
&lt;p&gt;This is visible in the failure mode. The model did not produce wrong reasoning. It produced &lt;em&gt;no&lt;/em&gt; reasoning. It exited the reasoning pattern entirely and entered a generation mode that had nothing to do with the problem. A model with genuine reasoning capability would have recognized the incoherence and either corrected or terminated. The distilled model had no such circuit breaker.&lt;/p&gt;
&lt;h3&gt;The Economics&lt;/h3&gt;
&lt;p&gt;The cost comparison deserves its own section because it is often cited as the primary motivation for running local models.&lt;/p&gt;
&lt;p&gt;The Claude Haiku API call cost nine-tenths of a cent. If you ran a thousand similar queries per day, you would spend about nine dollars. That is less than the electricity cost of running the Strix Halo for a day under load. The Strix Halo draws roughly 65 watts at idle and 150 watts under GPU inference load. At Minnesota's residential electricity rate of around twelve cents per kilowatt-hour, running inference eight hours a day costs about fourteen cents. But the hardware itself cost north of two thousand dollars. You would need to amortize that over thousands of hours of inference to reach cost parity with the API, and only if you value your debugging time at zero.&lt;/p&gt;
&lt;p&gt;The economic case for local inference is not about per-query cost. It is about use cases where you need unlimited queries without metering, where data cannot leave your network, or where you want to experiment with model behavior without worrying about a bill. If you are evaluating a model's failure modes by running hundreds of adversarial prompts—which is exactly what I was doing—the local model is the right tool because you are not optimizing for answer quality. You are optimizing for the freedom to explore.&lt;/p&gt;
&lt;h3&gt;The Strix Halo as an Inference Platform&lt;/h3&gt;
&lt;p&gt;Ten tokens per second for a 27B Q4 model is respectable for an APU. It is not competitive with a discrete GPU: an RTX 3090 delivers 29-35 tokens per second on the same model, roughly three times faster. But the Strix Halo was not designed to compete with discrete GPUs on raw throughput.&lt;/p&gt;
&lt;p&gt;What it offers instead is capacity. The unified memory pool means you can load models that would not fit on most consumer GPUs. A Q8_0 quantization of this same model would be 28.6GB, which exceeds the VRAM of an RTX 4090 (24GB) but fits comfortably in the Strix Halo's addressable space. You could load a 70B Q4 model (roughly 40GB) without any of the layer-splitting gymnastics required on multi-GPU setups. I have run Llama 3.1 70B Q4 on this machine, and while the generation speed drops to about 4-5 tokens per second, it runs without errors or memory pressure.&lt;/p&gt;
&lt;p&gt;For a machine that also serves as a daily desktop, development workstation, and &lt;a href="https://tinycomputers.io/posts/ltx-api.html"&gt;video generation server&lt;/a&gt; (it runs LTX-2.3 on the same hardware), the ability to casually load and test a 27B reasoning model without dedicated GPU infrastructure is the actual value proposition. You do not plan a session. You do not allocate resources. You type &lt;code&gt;ollama run qwen35-reasoning&lt;/code&gt; and it works.&lt;/p&gt;
&lt;h3&gt;Lessons for the Blog Post Reader&lt;/h3&gt;
&lt;p&gt;If you want to replicate this setup, here is what I would emphasize:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The stop tokens are non-negotiable.&lt;/strong&gt; Without explicit &lt;code&gt;&amp;lt;|endoftext|&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;|im_end|&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;|eot_id|&amp;gt;&lt;/code&gt; stop parameters in your Modelfile, the model will produce infinite output on many prompts. This is not documented in the model card and is not mentioned in the MarkTechPost article that covers this implementation. It is the single most important configuration detail.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;The model is good at structured problems and bad at open-ended ones.&lt;/strong&gt; Mathematics, code analysis, formal logic—anything where the reasoning has a clear structure and a definitive endpoint—works well. Open-ended problems, creative tasks, or anything requiring sustained coherent narrative are risky. The model can degenerate catastrophically and without warning.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;A repeat penalty helps but does not solve the fundamental issue.&lt;/strong&gt; Setting &lt;code&gt;repeat_penalty&lt;/code&gt; to 1.2 prevents exact repetition loops but does not prevent the semantic degeneration I observed on the river crossing problem. The model simply produces unique garbage instead of repeated garbage.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Distillation captures form, not judgment.&lt;/strong&gt; The think tags are real and useful. The step-by-step reasoning format works. What is missing is the implicit self-monitoring that frontier models have: the ability to recognize when their own output has become incoherent and to course-correct. This is probably the hardest thing to distill, because it is not present in the training examples. The examples show successful reasoning. They do not show the model catching and recovering from failed reasoning, because Claude's failed reasoning attempts are filtered out before the training data is assembled.&lt;/p&gt;
&lt;h3&gt;Where This Goes&lt;/h3&gt;
&lt;p&gt;The distilled reasoning model is, despite its failure modes, genuinely interesting. The &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; tags provide a form of transparency that standard instruction-tuned models lack. When the model is working correctly—which is most of the time on appropriate tasks—you get a window into the reasoning process that helps you evaluate the answer's quality before you act on it.&lt;/p&gt;
&lt;p&gt;The failure mode is also instructive. It demonstrates, concretely, the gap between learning a behavior pattern and internalizing the capability that produces that pattern. Supervised fine-tuning on reasoning trajectories can teach a model to produce reasoning-shaped output, but it cannot, from three thousand examples, teach the model to actually reason in the way the source model does. That requires either far more training data, a different training methodology (reinforcement learning from reasoning feedback, perhaps), or simply a larger model with more capacity to internalize the underlying patterns.&lt;/p&gt;
&lt;p&gt;For now, the practical advice is: use these models for what they are good at, know their failure modes, and do not trust the output on open-ended problems without reading the thinking trace. The trace is the feature. If the trace is coherent, the answer is probably good. If the trace starts to wander, stop reading and retry.&lt;/p&gt;
&lt;p&gt;The model runs on my desk, generates ten tokens per second, costs nothing per query, and shows its work. For a sixteen-gigabyte download and ten minutes of setup time, that is a reasonable deal—as long as you know what you are buying.&lt;/p&gt;</description><category>amd</category><category>chain-of-thought</category><category>claude</category><category>distillation</category><category>gguf</category><category>inference</category><category>llm</category><category>ollama</category><category>open-source</category><category>quantization</category><category>qwen</category><category>reasoning</category><category>strix halo</category><guid>https://tinycomputers.io/posts/distilled-reasoning-on-strix-halo-qwen35-claude-thinking.html</guid><pubDate>Sun, 29 Mar 2026 14:00:00 GMT</pubDate></item><item><title>Running a 22B Video Model on Four Tesla P40s</title><link>https://tinycomputers.io/posts/running-ltx-video-on-four-tesla-p40s.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/running-ltx-video-on-four-tesla-p40s_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;22 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;LTX-Video 2.3 is a 22 billion parameter model that generates video from text prompts. It was designed for modern hardware: GPUs with bfloat16 support, high-bandwidth memory, and enough VRAM to hold the full model on one or two cards. The &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;Tesla P40&lt;/a&gt; has none of these things. It is a Pascal-generation GPU from 2016, with 24GB of GDDR5X per card, no native bfloat16, no Tensor Cores, and a PCIe 3.0 bus. It was built for data center inference workloads that no longer exist.&lt;/p&gt;
&lt;p&gt;I have four of them in a rack-mount server in an unheated shop building in Minnesota. Together they provide 96GB of VRAM. The question was whether that 96GB, spread across four old cards, could run a model that was never meant to run on any of them.&lt;/p&gt;
&lt;p&gt;The answer is yes, with significant caveats and a substantial amount of code to work around hardware limitations that the model's authors never anticipated.&lt;/p&gt;
&lt;h3&gt;The Problem&lt;/h3&gt;
&lt;p&gt;LTX-Video 2.3's transformer has 48 blocks. At fp16 precision, the model weights alone consume roughly 44GB. With the Gemma text encoder, the video VAE encoder/decoder, the spatial upsampler, and the audio components, the full pipeline needs more memory than any single P40 can provide. The model doesn't fit on one card. It doesn't fit on two. It barely fits on three, with no room for activations during inference.&lt;/p&gt;
&lt;p&gt;Four cards at 24GB each gives 96GB total, which is enough for the weights with room for intermediate activations. But CUDA doesn't automatically spread a model across multiple GPUs. You have to tell it how.&lt;/p&gt;
&lt;p&gt;The standard approach for multi-GPU inference is &lt;code&gt;accelerate&lt;/code&gt;'s &lt;code&gt;dispatch_model&lt;/code&gt;, which automatically distributes model layers across available GPUs based on memory constraints. This works for the Gemma text encoder, which is a straightforward transformer. For the LTX transformer, it doesn't work, because the model has a custom forward pass with audio-video cross-attention that &lt;code&gt;accelerate&lt;/code&gt;'s automatic dispatch can't handle correctly. The model needs to move data between GPUs at specific points in the forward pass, and &lt;code&gt;accelerate&lt;/code&gt; doesn't know where those points are.&lt;/p&gt;
&lt;p&gt;The solution was manual pipeline parallelism: split the 48 transformer blocks evenly across four GPUs (12 blocks per card), keep the shared components (patchify projections, normalization, output projections) on GPU 0, and write a custom forward pass that moves tensors between devices at block boundaries.&lt;/p&gt;
&lt;h3&gt;The Precision Problem&lt;/h3&gt;
&lt;p&gt;Even with the model split across four cards, nothing worked on the first attempt. Or the fifth. Getting LTX-Video running on Pascal hardware was an iterative process, with Claude Code generating solutions and me testing them against the actual hardware. Each failure revealed another assumption the model made about the GPU it would run on. The feedback loop was brutal: load a 22B model across four GPUs, wait eight minutes for a test generation, get a black frame or a NaN error, diagnose which precision boundary caused it, generate a fix, and try again.&lt;/p&gt;
&lt;p&gt;The first problem was bfloat16. The model weights are stored in bf16 format. Pascal GPUs cannot compute in bf16. PyTorch handles this silently for some operations by promoting to fp32, but other operations fail or produce garbage. The initial approach was the obvious one: monkey-patch &lt;code&gt;torch.bfloat16&lt;/code&gt; to redirect to &lt;code&gt;torch.float16&lt;/code&gt;. This seemed to work at load time. The model loaded, the weights populated, no errors. Then the first forward pass produced NaN everywhere. The monkey-patch had corrupted the safetensors weight loading. The weights loaded as fp16 bit patterns interpreted as bf16 values, which is not the same thing. A bf16 value of 1.0 has a different bit pattern than an fp16 value of 1.0. Reinterpret one as the other and you get a number that's either wildly wrong or NaN.&lt;/p&gt;
&lt;p&gt;The second attempt tried running everything in fp16 natively, converting weights properly during load. This got further: the model produced output that wasn't NaN. But the output was a solid green frame. The intermediate activations in the transformer blocks were overflowing fp16 range. Values above 65,504 become infinity in fp16, and the model's internal representations regularly exceed that during the attention and feedforward passes. The green frame was the model's attempt to decode latents that had been clipped to infinity at some point in the pipeline.&lt;/p&gt;
&lt;p&gt;The working solution was to let the model builder properly convert weights from bf16 to fp16 on load, then run the entire computation pipeline in float32. The weights sit in memory as fp16 (saving space), but every computation promotes to fp32 before executing. This required patching &lt;code&gt;F.linear&lt;/code&gt; to handle mixed dtype inputs:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;_orig_linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_mixed_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_orig_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mixed_linear&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The same pattern extends to every normalization function and every convolution operation. Layer norm, group norm, RMS norm, conv1d through conv_transpose3d: all patched to handle mixed dtypes and accumulate in float32. Without these patches, intermediate values overflow fp16 range (values above 65,504 become infinity) and the output is a black frame.&lt;/p&gt;
&lt;h3&gt;The Gemma Problem&lt;/h3&gt;
&lt;p&gt;The text encoder is Google's Gemma 3, a separate model that converts text prompts into embeddings the video transformer can condition on. Gemma's attention mechanism overflows when run in fp16 on Pascal hardware. The attention scores grow large enough to exceed fp16 range, producing NaN values that propagate through the rest of the pipeline.&lt;/p&gt;
&lt;p&gt;The fix was running the entire Gemma encoder in float32. This uses more memory, but the text encoder only runs once per generation (to encode the prompt), and its weights can be freed from GPU memory before the transformer starts. The sequence is: load Gemma across all four GPUs using &lt;code&gt;accelerate&lt;/code&gt;, encode the prompt in float32, delete the encoder, free the memory, then load the video transformer.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;encode_prompt_float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;
    &lt;span class="n"&gt;te&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_encoder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Dispatch across all 4 GPUs for memory&lt;/span&gt;
    &lt;span class="n"&gt;max_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_balanced_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"22GiB"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
        &lt;span class="n"&gt;no_split_module_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Gemma3DecoderLayer"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;te&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dispatch_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;hidden_states&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attention_mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Free GPU memory before transformer loads&lt;/span&gt;
    &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;te&lt;/span&gt;
    &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This load-encode-delete cycle is ugly but necessary. There isn't enough total memory to hold both Gemma and the video transformer simultaneously, even across four cards. The sequential approach works because each component only needs to exist during its phase of the pipeline.&lt;/p&gt;
&lt;h3&gt;The Pipeline&lt;/h3&gt;
&lt;p&gt;The generation runs in two stages, matching LTX-Video's distilled inference schedule.&lt;/p&gt;
&lt;p&gt;Stage 1 generates a half-resolution latent video (e.g., 256x384) through 8 denoising steps. Each step runs the full 48-block transformer, with data moving across all four GPUs:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;patched_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ltx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transformer_blocks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;dev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block_devices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Every GPU boundary involves a tensor transfer across PCIe 3.0. With 12 blocks per GPU, there are 3 boundary crossings per denoising step (GPU 0 to 1, 1 to 2, 2 to 3), plus a final transfer back to GPU 0. With 8 denoising steps, that's 32 cross-device transfers per stage, each moving both video and audio state tensors. PCIe 3.0 x16 has a theoretical bandwidth of ~16 GB/s. The tensors being transferred are small relative to the bandwidth (attention states and activations, not full weight matrices), so the overhead is manageable. But it adds up.&lt;/p&gt;
&lt;p&gt;Stage 1 takes roughly 4 minutes for 241 frames at 24 fps (a 10-second clip). The spatial upsampler then doubles the resolution. Stage 2 runs 3 more denoising steps at full resolution (512x768), taking roughly 6.5 minutes. The VAE decoder converts latents to pixels and generates the audio track in another 40 seconds.&lt;/p&gt;
&lt;p&gt;Total generation time for a 10-second, 512x768 video with audio: approximately 18.5 minutes. For a 1-second clip (25 frames): about 8 minutes. For a 4-second clip (97 frames): about 10.5 minutes.&lt;/p&gt;
&lt;h3&gt;The Memory Layout&lt;/h3&gt;
&lt;p&gt;During inference, the four GPUs aren't loaded equally. GPU 0 carries extra weight because it hosts all the shared components (patchify projections, normalization layers, output projections) plus its 12 transformer blocks. The actual memory distribution:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;VRAM Used&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;10.8 GB&lt;/td&gt;
&lt;td&gt;Shared components + blocks 0-11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 12-23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 24-35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 36-47&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That's 38.7 GB of the available 96 GB. The remaining 57 GB provides headroom for activations, KV cache growth, and the VAE decoder. There's enough margin that generation never OOMs, even at 241 frames.&lt;/p&gt;
&lt;h3&gt;The API&lt;/h3&gt;
&lt;p&gt;Running inference from the command line is fine for testing, but generating videos for blog content requires something more practical. I wrapped the generation script in a FastAPI server with an async job queue:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# Submit a text-to-video job&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;-X&lt;span class="w"&gt; &lt;/span&gt;POST&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prompt=A cinematic flyover of a Zilog Z80 processor on a PCB"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duration=10"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"seed=42"&lt;/span&gt;

&lt;span class="c1"&gt;# Submit an image-to-video job&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;-X&lt;span class="w"&gt; &lt;/span&gt;POST&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prompt=A fluffy orange cat dancing"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duration=4"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"image=@cat.jpg"&lt;/span&gt;

&lt;span class="c1"&gt;# Check status&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs/07420abb6d82

&lt;span class="c1"&gt;# Download result&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs/07420abb6d82/video&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;output.mp4
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Jobs queue and execute sequentially. The GPU can only handle one generation at a time, and the load-encode-delete cycle for Gemma means there's significant setup overhead per job. The API spawns each job as a subprocess, which gives clean GPU memory cleanup between runs. If a generation crashes (which happened frequently during development), the next job starts fresh.&lt;/p&gt;
&lt;p&gt;The server supports both text-to-video and image-to-video. Image conditioning locks the first frame to a provided image and generates subsequent frames from it, which produces more controllable results for specific visual subjects. In practice, image-to-video is the more useful mode. Text-to-video gives the model complete creative freedom, which means the output is unpredictable. You might ask for a Z80 processor and get something that looks like a generic IC, or something that looks like a Z80, depending on the seed. Image-to-video lets you provide the exact first frame you want and the model animates from there. For blog content where visual accuracy matters, starting from a real photograph or a specific reference image gives consistently better results.&lt;/p&gt;
&lt;h3&gt;What the Output Looks Like&lt;/h3&gt;
&lt;p&gt;The video quality is genuinely good. LTX-Video 2.3 produces coherent motion, reasonable physics, and detailed textures. Here are three examples, generated entirely on the P40 server:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-video: "A cinematic flyover of a Zilog Z80 processor on a printed circuit board" (10 seconds, 18.5 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-z80-flyover.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;&lt;strong&gt;Image-to-video: "A fluffy orange cat with a hat dancing" (4 seconds, 10.5 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-cat-dancing.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;&lt;strong&gt;Text-to-video: "A cat sitting on a windowsill, sunlight streaming in" (1 second, 8 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-cat-windowsill.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;The model understands object permanence, lighting consistency, and basic spatial relationships. The Z80 flyover produces a recognizable IC package with surrounding components, proper lighting, and smooth camera movement.&lt;/p&gt;
&lt;p&gt;The audio is a different story. LTX-Video 2.3 generates an audio track alongside the video, but the results are inconsistent. Prompts describing characters speaking produce odd ambient music instead of voices. Prompts describing environments produce vaguely appropriate soundscapes. The audio pipeline works mechanically (it generates real audio waveforms via a separate VAE decoder and vocoder), but the semantic connection between prompt and audio output is weak. For blog content, I'd likely strip the generated audio and add narration or music separately.&lt;/p&gt;
&lt;p&gt;The 512x768 resolution at 24fps is usable for web content. It's not 4K. It's not going to replace stock footage for production video. But for blog hero images in motion, visual demonstrations, or supplementary content alongside text, it works.&lt;/p&gt;
&lt;h3&gt;What This Cost&lt;/h3&gt;
&lt;p&gt;The hardware cost is zero incremental. The four P40s and the server already existed for &lt;a href="https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html"&gt;LLM inference&lt;/a&gt;. LTX-Video is an additional workload on the same hardware.&lt;/p&gt;
&lt;p&gt;The electricity cost is modest. The server draws roughly 500W under full GPU load. An 18.5-minute generation (10-second video at full resolution) consumes about 0.15 kWh, roughly $0.024 at Minnesota residential rates. You could generate forty 10-second clips for a dollar.&lt;/p&gt;
&lt;p&gt;The real cost was development time. Getting from "model downloaded" to "working generation pipeline" took many iterations across multiple sessions with Claude Code. Each precision-related failure mode (bf16 corruption, fp16 overflow, mixed-dtype kernel errors, NaN propagation through attention) required diagnosis, a hypothesis, a code change, and a test cycle that involved loading a 22B model across four GPUs. The feedback loop was slow. A single test takes 8 to 18 minutes to confirm whether a change worked. Many didn't.&lt;/p&gt;
&lt;h3&gt;The Broader Point&lt;/h3&gt;
&lt;p&gt;A 22 billion parameter video generation model was not designed to run on 2016 hardware. The authors assumed bf16, assumed modern attention kernels, assumed enough memory on one or two cards. None of those assumptions hold on the P40.&lt;/p&gt;
&lt;p&gt;But the model runs anyway, because the underlying math doesn't actually require any of those features. Bfloat16 is a convenience, not a requirement; float32 computes the same function. Flash attention is an optimization, not a necessity; standard attention produces identical results. And 96GB across four cards is 96GB, regardless of whether it's cutting-edge HBM3 or decade-old GDDR5X.&lt;/p&gt;
&lt;p&gt;The generation is slow. Eighteen minutes for ten seconds of video is not competitive with a single A100, which would finish the same job in under two minutes. The float32 computation pipeline roughly doubles the FLOPS required compared to the bf16 path the model was designed for, and the PCIe 3.0 transfers between four separate memory pools add latency that a single modern GPU with unified HBM would never incur. But competitive wasn't the point. The point was that four GPUs I bought on eBay for a thousand dollars total, sitting in a server in a shop building, can run a model that was released this month. The gap between "latest model" and "latest hardware" is not as wide as the spec sheets suggest, as long as you're willing to write the code that bridges it.&lt;/p&gt;
&lt;p&gt;The P40 server was already paying for itself on &lt;a href="https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html"&gt;LLM inference&lt;/a&gt; and &lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;TTS generation&lt;/a&gt;. Video generation is one more workload on a machine that I own, running models that I choose, on a schedule that I control. The 18-minute wait is the price of not asking anyone's permission.&lt;/p&gt;</description><category>ai</category><category>cuda</category><category>gpu</category><category>home lab</category><category>inference</category><category>ltx video</category><category>multi-gpu</category><category>pascal</category><category>pipeline parallelism</category><category>tesla p40</category><category>video generation</category><guid>https://tinycomputers.io/posts/running-ltx-video-on-four-tesla-p40s.html</guid><pubDate>Fri, 20 Mar 2026 13:00:00 GMT</pubDate></item><item><title>The Economics of Owning Your Own Inference</title><link>https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/the-economics-of-owning-your-own-inference_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;21 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;I own $5,500 worth of GPU hardware dedicated to running AI models locally. I also pay for a Claude Max subscription that I use for nearly everything that matters. If that sounds like a contradiction, it is the entire subject of this article.&lt;/p&gt;
&lt;p&gt;The local inference conversation online is dominated by two positions. The first: why pay for API calls when you can run models on your own hardware? The second: local models are worse, so just pay for the good ones. Both are correct. Both are incomplete. The interesting question is where the boundary falls between them, and the answer turns out to depend less on cost-per-token arithmetic than on what kind of work you are doing.&lt;/p&gt;
&lt;h3&gt;The Split&lt;/h3&gt;
&lt;p&gt;I use Claude for research, code review, writing feedback, technical analysis, and anything that used to be a Google search. The frontier models are better at all of these tasks than anything I can run locally. Not marginally better; categorically better. An 8B parameter model running on my hardware is not in the same conversation as Claude Opus or GPT-5.4 for anything requiring reasoning, nuance, or broad knowledge. The subscription cost is fixed regardless of volume, which eliminates per-query friction entirely. For interactive, quality-sensitive work, I pay for the best model available and I do not think about it.&lt;/p&gt;
&lt;p&gt;Local inference handles everything else: the batch jobs, the grunt work, the high-volume tasks where model quality matters less than model availability. The work that would be expensive at cloud API rates not because any single call costs much, but because the calls number in the tens of thousands.&lt;/p&gt;
&lt;p&gt;This is not a temporary arrangement while local models catch up. It is a structural split. Frontier models are getting better. Local models are also getting better. The gap is not closing in the ways that matter for my usage, because the tasks I send to each side are fundamentally different. I do not need my local 8B model to reason better. I need it to process text cheaply and without metering.&lt;/p&gt;
&lt;h3&gt;What the Local Hardware Actually Does&lt;/h3&gt;
&lt;p&gt;Three workloads. All batch. All quality-tolerant.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-speech.&lt;/strong&gt; Every post on this site has an &lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;AI-generated audio narration&lt;/a&gt;. This is the workload that justifies the hardware on its own. Google Cloud Platform has superior TTS voices; Chirp3-HD sounds noticeably more natural than any open-source model I have tested. I ran a novel through it once: 82,000 words, 500,000 characters, $17.25. That is reasonable for a one-off project.&lt;/p&gt;
&lt;p&gt;It is not reasonable for a library of blog posts that I revise and regenerate periodically. At GCP rates ($16 per million characters, more for premium voices), narrating every post on this site would cost $200 to $400, and that bill resets every time I edit an article and regenerate the audio. Open-source TTS (&lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;F5-TTS and Qwen TTS&lt;/a&gt;) mispronounces technical terms. The prosody goes flat on dense jargon. But it is good enough for blog narration. "Good enough" at zero marginal cost beats "excellent" at $4 to $10 per post when you are generating audio daily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Code scanning.&lt;/strong&gt; Running local models over source files for pattern detection, documentation extraction, and automated analysis. These jobs produce high token volume at low quality requirements. An 8B model is adequate. The token count across a full codebase makes API pricing add up in a way that individual queries do not.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Infrastructure work.&lt;/strong&gt; Benchmarking hardware (as in this article), testing prompt structures across quantization levels, evaluating model behavior under different configurations. These queries have no value individually. They are the test drives, not the commute. Paying per-token for test drives is paying per-mile to drive your own car around the block.&lt;/p&gt;
&lt;p&gt;None of these workloads require a frontier model. All of them generate enough volume to make metered pricing uncomfortable. That is the boundary.&lt;/p&gt;
&lt;h3&gt;The Machines&lt;/h3&gt;
&lt;p&gt;Two machines. Both mine. Both running &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;four-GPU Tesla P40 server&lt;/a&gt;: Penguin Computing 2U chassis, Xeon E5-2697A v4, 252GB DDR4 ECC, four Tesla P40s with 24GB GDDR5X each. Ninety-six gigabytes of VRAM. Pascal architecture, 2016 vintage. Built from eBay parts for about $2,500. Lives in an unheated shop building in Minnesota.&lt;/p&gt;
&lt;p&gt;A Bosgame M5 mini desktop: AMD Ryzen AI MAX+ 395, Strix Halo APU with integrated RDNA 3.5 graphics. No discrete GPU. CPU and GPU share 128GB DDR5, roughly 60GB addressable as VRAM through ROCm 7.2. Cost about $3,000. Fits on a desk.&lt;/p&gt;
&lt;h3&gt;What They Cost to Run&lt;/h3&gt;
&lt;p&gt;I logged GPU power draw at 500-millisecond intervals during inference using &lt;code&gt;nvidia-smi&lt;/code&gt; on the P40 server and &lt;code&gt;rocm-smi&lt;/code&gt; on the Strix Halo. Same prompt, same models, same Ollama configuration. All models ran 100% on GPU.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;P40 tok/s&lt;/th&gt;
&lt;th&gt;P40 GPU Power&lt;/th&gt;
&lt;th&gt;Halo tok/s&lt;/th&gt;
&lt;th&gt;Halo GPU Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 3B&lt;/td&gt;
&lt;td&gt;91.2&lt;/td&gt;
&lt;td&gt;170W avg&lt;/td&gt;
&lt;td&gt;78.4&lt;/td&gt;
&lt;td&gt;64W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;47.5&lt;/td&gt;
&lt;td&gt;278W avg&lt;/td&gt;
&lt;td&gt;40.2&lt;/td&gt;
&lt;td&gt;82W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B (4K ctx)&lt;/td&gt;
&lt;td&gt;6.3&lt;/td&gt;
&lt;td&gt;278W avg&lt;/td&gt;
&lt;td&gt;5.6&lt;/td&gt;
&lt;td&gt;81W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 is 15-18% faster in raw throughput. It draws 3-4x the power. The 3B model lives on a single P40; the other three cards idle at ~9W each but still cost electricity. The 8B and 70B models span two GPUs while two idle. You always pay for cards that are not working. The Strix Halo has one GPU. No idle penalty.&lt;/p&gt;
&lt;p&gt;GPU power is not total system power. The P40 server's Xeons, 252GB of RAM, dual PSUs, and fans add roughly 200W to the GPU figures. The Strix Halo's APU and DDR5 add roughly 40-60W. Conservative estimates for total system draw: 500W for the P40 under load, 120W for the Strix Halo.&lt;/p&gt;
&lt;p&gt;At Minnesota residential electricity rates ($0.157/kWh), the cost per million tokens:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;3B&lt;/th&gt;
&lt;th&gt;8B&lt;/th&gt;
&lt;th&gt;70B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P40 Server&lt;/td&gt;
&lt;td&gt;$0.19/M&lt;/td&gt;
&lt;td&gt;$0.46/M&lt;/td&gt;
&lt;td&gt;$3.47/M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strix Halo&lt;/td&gt;
&lt;td&gt;$0.06/M&lt;/td&gt;
&lt;td&gt;$0.13/M&lt;/td&gt;
&lt;td&gt;$0.94/M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Why the Per-Token Number Is Misleading&lt;/h3&gt;
&lt;p&gt;Those numbers look competitive with hosted inference, which runs $0.05 to $0.20 per million tokens for 8B-class models through providers like Together AI or Groq. The Strix Halo at $0.13/M sits squarely in that range. The P40 at $0.46/M does not.&lt;/p&gt;
&lt;p&gt;But per-token cost during active inference is the wrong metric for two reasons.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hardware amortization changes the math.&lt;/strong&gt; The P40 server cost $2,500. The Strix Halo cost $3,000. Amortized over two years, that adds $0.14/hr and $0.11/hr respectively. On the 8B model, the all-in cost per million tokens rises to about $1.28 for the P40 and $0.90 for the Strix Halo. Both are more expensive than every hosted inference API for the same model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Idle power is the dominant cost.&lt;/strong&gt; The P40 server draws roughly 340W at idle: $38.50 per month whether I run a single query or not. The Strix Halo draws roughly 35W at idle: $4.20 per month. Over a year, idle electricity alone costs $462 on the P40 and $50 on the Strix Halo. If you are not using the hardware frequently, idle power overwhelms everything else in the cost model.&lt;/p&gt;
&lt;p&gt;Per-token math at load flatters local inference by ignoring the hours when the hardware is doing nothing. It is like calculating your car's fuel economy only during highway driving and ignoring that it sits in the driveway 22 hours a day with the engine running.&lt;/p&gt;
&lt;h3&gt;Why I Run Both Anyway&lt;/h3&gt;
&lt;p&gt;The per-token economics favor API providers. The per-workload economics favor local hardware for specific tasks. TTS is the starkest example.&lt;/p&gt;
&lt;p&gt;Generating a 20-minute blog narration on the Strix Halo takes about 45 minutes of inference at roughly 85W above idle power. The incremental electricity cost is about $0.02. The same narration through Google Cloud TTS would cost $4 to $10 depending on character count and voice tier.&lt;/p&gt;
&lt;p&gt;That is a 200-to-500x cost difference on the marginal unit. And the marginal unit is what matters, because the question is never "should I generate TTS at all?" It is "should I regenerate the audio for this post I just edited?" or "should I try a different voice on this article?" or "should I narrate this niche post about PCB trace routing that maybe fifty people will listen to?"&lt;/p&gt;
&lt;p&gt;At $4 to $10 per narration, the answer to all of those is "probably not." At $0.02, the answer is "why wouldn't I?" That shift from "probably not" to "why not" is the entire economic argument for owning TTS hardware. It is not about the average cost. It is about the marginal decision.&lt;/p&gt;
&lt;p&gt;Before running local TTS, I narrated posts selectively with Google Cloud's Text-to-Speech. Some were too long or too niche to justify the GCP cost. Now every post gets audio. I regenerate after revisions without thinking about it. I have run the same post through three different TTS models to compare voice quality. I experiment with speaker voices, pacing parameters, and chunk sizes. The total volume of audio I have generated locally exceeds what I would have purchased from Google at any price point. This is &lt;a href="https://tinycomputers.io/posts/jevons-paradox.html"&gt;Jevons Paradox&lt;/a&gt; at the smallest possible scale: make TTS cheap enough and I do not produce the same amount of TTS for less money; I produce vastly more TTS for slightly less money.&lt;/p&gt;
&lt;p&gt;The same logic applies to code scanning. Any individual scan is cheap enough through an API. But the friction of metered pricing discourages the kind of speculative, exploratory analysis that turns up unexpected findings. When the marginal cost is zero, I scan more freely and more often. The value is not in any single scan; it is in the scans I would not have run otherwise.&lt;/p&gt;
&lt;h3&gt;The Strix Halo Problem&lt;/h3&gt;
&lt;p&gt;The most surprising result in the benchmarks is the Strix Halo's efficiency. An integrated APU with no discrete GPU delivers 40.2 tokens per second at 82W of GPU power. The P40 server delivers 47.5 tokens per second at 278W of GPU power. The P40 is 18% faster. The Strix Halo uses 70% less power. In performance per GPU watt, the Strix Halo (0.49 tok/s per watt) is nearly three times more efficient than the P40 (0.17 tok/s per watt).&lt;/p&gt;
&lt;p&gt;This creates a problem for the P40 server's economics. The server's advantage is VRAM: 96GB lets it run 120B MoE models that the Strix Halo cannot fit. For the gpt-oss 120B model, the P40 server is the valid option. But for everything 8B and below, the Strix Halo is cheaper to buy ($2,000 vs. $2,500), cheaper to idle ($4.20/month vs. $38.50/month), cheaper per token ($0.13/M vs. $0.46/M), quieter, smaller, and only 18% slower.&lt;/p&gt;
&lt;p&gt;If I were building a local inference setup today from scratch and my workload was 8B models and TTS, I would buy the Strix Halo and nothing else. The P40 server justifies its existence only for the large models that need its VRAM and the fact that I put it together well before the current RAM price spike.&lt;/p&gt;
&lt;p&gt;This is worth sitting with for a moment, because it inverts the conventional wisdom about inference hardware. The enterprise GPU server that looks impressive on paper (four GPUs, 96GB VRAM, 2U rack mount) loses on total cost of ownership to a $3,000 mini desktop for the workloads that dominate my actual usage. The P40's raw throughput advantage is real but small. Its power cost advantage is negative. The VRAM advantage matters only for models most people do not run.&lt;/p&gt;
&lt;h3&gt;The Maintenance Tax&lt;/h3&gt;
&lt;p&gt;The per-token calculations ignore the cost of keeping these machines running. It is not zero.&lt;/p&gt;
&lt;p&gt;I have had two kernel updates break the NVIDIA DKMS module on the P40 server. The AMD machine requires &lt;a href="https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html"&gt;specific pre-release PyTorch wheels&lt;/a&gt; and environment variable overrides for ROCm to function on gfx1151 hardware. While running the benchmarks for this article, I discovered that Ollama on the Strix Halo had been running entirely on CPU because the systemd service file lacked the &lt;code&gt;HSA_OVERRIDE_GFX_VERSION=11.5.1&lt;/code&gt; variable. Every benchmark I had run on that machine prior to catching this was measuring CPU inference, not GPU inference. The fix took two minutes. Finding it took longer.&lt;/p&gt;
&lt;p&gt;The P40 server's fans run at full speed from October through April because the BMC interprets Minnesota winter temperatures as a hardware malfunction. The noise is audible from the house, 150 feet away.&lt;/p&gt;
&lt;p&gt;None of this is catastrophic. All of it is time. And time spent debugging DKMS modules or adding environment variables to systemd units is time not spent on the work that the hardware is supposed to enable. A Claude Max subscription requires zero maintenance. The local hardware requires ongoing attention. That asymmetry does not show up in per-token cost tables, but it is real.&lt;/p&gt;
&lt;h3&gt;Who This Is For&lt;/h3&gt;
&lt;p&gt;Most people should not build a local inference server. If you use AI for interactive tasks (questions, code, analysis, writing), a frontier model subscription is a better product at a lower total cost than any local setup. The quality gap between a local 8B model and Claude or GPT-5.4 is not closing in the ways that matter for conversational use. Pay for the good models. Use them freely.&lt;/p&gt;
&lt;p&gt;Local inference makes economic sense when you have a specific, high-volume, quality-tolerant workload that you will run often enough to justify hardware sitting on 24/7. TTS is the clearest case. Batch code analysis is another. If you cannot name the workload, you do not have one, and the hardware will cost you $40 to $50 per month in idle electricity to find out.&lt;/p&gt;
&lt;p&gt;The split between frontier subscriptions and local batch processing is not a compromise. It is, for my usage, the correct architecture. The frontier model handles the work where quality determines value. The local hardware handles the work where volume determines cost. Neither replaces the other. The mistake is thinking they compete.&lt;/p&gt;</description><category>ai</category><category>amd</category><category>benchmarks</category><category>claude</category><category>economics</category><category>gpu</category><category>home lab</category><category>inference</category><category>jevons paradox</category><category>local inference</category><category>power consumption</category><category>strix halo</category><category>tesla p40</category><category>tts</category><guid>https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html</guid><pubDate>Tue, 17 Mar 2026 13:00:00 GMT</pubDate></item><item><title>The Real Cost of Running Qwen TTS Locally: Three Machines Compared</title><link>https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/the-real-cost-of-running-qwen-tts-locally-three-machines-compared_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;17 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;&lt;img src="https://tinycomputers.io/images/qwen-tts-benchmark/p40-server-shop.jpg" alt="The Tesla P40 server standing on its side in an unheated Minnesota shop building, one of three machines benchmarked for local TTS generation" style="float: right; max-width: 40%; margin: 0 0 1em 1.5em; border-radius: 4px; box-shadow: 0 30px 40px rgba(0,0,0,.1);"&gt;&lt;/p&gt;
&lt;p&gt;Every post on this site has an audio version. A small player at the top, a few minutes of narration, generated entirely on local hardware. No cloud API, no per-character fees, no data leaving the network. I wrote about &lt;a href="https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html"&gt;setting up the pipeline on AMD Strix Halo&lt;/a&gt; earlier this year, and the system has been running in production since, generating narrations for new posts, regenerating old ones when I revise them, and occasionally processing long-form content that would cost real money through Google Cloud TTS or ElevenLabs.&lt;/p&gt;
&lt;p&gt;But I now have three machines capable of running Qwen3-TTS, and they could not be more different from each other. An Apple M3 Max laptop. An AMD Ryzen AI MAX+ 395 mini desktop with integrated Radeon graphics. And a &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;four-GPU Tesla P40 server&lt;/a&gt; built from decade-old enterprise hardware bought on eBay. Three different silicon vendors, three different compute backends (MPS, ROCm, and CUDA) running the same model on the same text.&lt;/p&gt;
&lt;p&gt;The question I wanted to answer is simple: how do they actually compare? Not on paper. Not in theoretical FLOPS. In wall-clock time, generating real audio from a real blog post.&lt;/p&gt;
&lt;p&gt;The answer turned out to be more interesting than I expected, because the numbers tell a story about hardware architecture that raw specifications completely miss.&lt;/p&gt;
&lt;h3&gt;The Setup&lt;/h3&gt;
&lt;p&gt;The model is &lt;a href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"&gt;Qwen3-TTS-12Hz-1.7B-CustomVoice&lt;/a&gt;, a 1.7 billion parameter autoregressive text-to-speech model from Alibaba's Qwen team. It generates natural-sounding speech with multiple speaker voices. I use the Eric voice for all blog narrations: clear, professional, well-paced for technical content.&lt;/p&gt;
&lt;p&gt;The three machines:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apple M3 Max&lt;/strong&gt;, a &lt;a href="https://amzn.to/4rwlTa6"&gt;MacBook Pro&lt;/a&gt; with Apple's M3 Max chip. 14 CPU cores, 30 GPU cores, 64GB unified memory. The GPU runs through PyTorch's MPS (Metal Performance Shaders) backend. This is my daily driver laptop, and it generates TTS when I am writing and editing posts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AMD Radeon 8060S&lt;/strong&gt;, a Bosgame M5 mini desktop running &lt;a href="https://amzn.to/4bv5CMG"&gt;AMD's Ryzen AI MAX+ 395&lt;/a&gt;. This is a Strix Halo APU with integrated RDNA 3.5 graphics, not a discrete GPU. It shares 128GB of DDR5 system memory with the CPU, with roughly 96GB addressable as VRAM. The GPU runs through ROCm 7.2 with PyTorch 2.9.1. The gfx1151 architecture requires specific PyTorch wheels from AMD's pre-release index and several environment variable overrides to function. I wrote a &lt;a href="https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html"&gt;full setup guide&lt;/a&gt; for this machine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NVIDIA Tesla P40&lt;/strong&gt;, a 2U rack-mount server with four &lt;a href="https://www.ebay.com/itm/306087510352?_skw=nvidia+tesla+p40+24gb+gpu&amp;amp;epid=27032254618&amp;amp;itmmeta=01KKJEGQKSK110HNM6214EB0TT&amp;amp;hash=item47443cc150:g:qAwAAOSwy0toUHXh&amp;amp;itmprp=enc%3AAQALAAABAGfYFPkwiKCW4ZNSs2u11xAq6UjArKrgnuEyMVTZhAZhOSUGYags6TsDJvvCEOa51UH2r%2BRe%2F182ah6rgiTIAIRULQNEL9rbiinCXMor%2FBNNZk0GaNKqTWkq9pLWGoRBM8NL%2BjC1aSA63XPe4YsFHjQkb%2Fmup21S3UM7oqwBrW%2BHep1E07lnrt2vzkljSA4xg7SnrA%2BFDtOdqvDwO4tpgB0t%2BtCv9%2BlXoh%2BeoEgpJqXgaaM0ad48OfmgKB13PF9RIPXLNI6z4SjV2O%2FXOk6nYPyD9Eg5wbzdmsXfNRhwitz7HEZ1bTRUnRmvKzQrw4B3r3LAag5f8%2B8CcCWfCRAkkG8%3D%7Ctkp%3ABk9SR4j6ws6cZw&amp;amp;mkcid=1&amp;amp;mkrid=711-53200-19255-0&amp;amp;siteid=0&amp;amp;campid=5338960379&amp;amp;customid=&amp;amp;toolid=10001&amp;amp;mkevt=1"&gt;Tesla P40 GPUs&lt;/a&gt;, each with 24GB of GDDR5X. Pascal architecture from 2016. Compute capability 6.1. No Tensor Cores, no native bfloat16 support. The benchmark uses a single P40, since Qwen TTS runs on one GPU. This machine lives in an unheated shop building in Minnesota and &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;screams through the winter&lt;/a&gt; when the BMC misinterprets sub-zero ambient temperatures as a hardware malfunction.&lt;/p&gt;
&lt;p&gt;All three machines run the same model checkpoint, the same text input, and the same speaker voice. The only differences are the silicon and the compute backend.&lt;/p&gt;
&lt;h3&gt;The Benchmark&lt;/h3&gt;
&lt;p&gt;I used a standardized 2,411-character passage, five paragraphs on the Jevons Paradox, dense enough to exercise the model's prosody and pacing on real written content. Each machine ran three consecutive generations from the same loaded model, producing roughly three minutes of audio per run. The first run includes kernel compilation and cache warmup; subsequent runs reflect steady-state performance.&lt;/p&gt;
&lt;p&gt;The metric that matters is Real-Time Factor (RTF): how many seconds of wall-clock time it takes to generate one second of audio. An RTF of 1.0 means the model generates audio at exactly real-time speed. Below 1.0 is faster than real-time. Above 1.0 means you are waiting.&lt;/p&gt;
&lt;h4&gt;Individual Runs&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Apple M3 Max (MPS)&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;698.5s&lt;/td&gt;
&lt;td&gt;197.7s&lt;/td&gt;
&lt;td&gt;3.53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;533.1s&lt;/td&gt;
&lt;td&gt;184.2s&lt;/td&gt;
&lt;td&gt;2.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;447.8s&lt;/td&gt;
&lt;td&gt;179.2s&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;559.8s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;187.0s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.97&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;AMD Radeon 8060S (ROCm)&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;729.2s&lt;/td&gt;
&lt;td&gt;173.6s&lt;/td&gt;
&lt;td&gt;4.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;460.0s&lt;/td&gt;
&lt;td&gt;204.8s&lt;/td&gt;
&lt;td&gt;2.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;548.2s&lt;/td&gt;
&lt;td&gt;214.2s&lt;/td&gt;
&lt;td&gt;2.56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;579.1s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;197.5s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.00&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;NVIDIA Tesla P40 (CUDA)&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1511.4s&lt;/td&gt;
&lt;td&gt;204.1s&lt;/td&gt;
&lt;td&gt;7.41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1225.7s&lt;/td&gt;
&lt;td&gt;171.6s&lt;/td&gt;
&lt;td&gt;7.14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1537.2s&lt;/td&gt;
&lt;td&gt;206.7s&lt;/td&gt;
&lt;td&gt;7.44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1424.8s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;194.1s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.33&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Summary&lt;/h4&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Avg RTF&lt;/th&gt;
&lt;th&gt;Best RTF&lt;/th&gt;
&lt;th&gt;Avg Gen Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro&lt;/td&gt;
&lt;td&gt;M3 Max (MPS)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.97&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;559.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bosgame M5&lt;/td&gt;
&lt;td&gt;Radeon 8060S (ROCm)&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;2.25&lt;/td&gt;
&lt;td&gt;579.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Penguin 2U&lt;/td&gt;
&lt;td&gt;Tesla P40 (CUDA)&lt;/td&gt;
&lt;td&gt;7.33&lt;/td&gt;
&lt;td&gt;7.14&lt;/td&gt;
&lt;td&gt;1424.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;What the Numbers Mean&lt;/h3&gt;
&lt;p&gt;The headline result is that the M3 Max and Radeon 8060S are essentially tied, and the Tesla P40 is roughly 2.4 times slower than both. But that summary hides the interesting details.&lt;/p&gt;
&lt;h4&gt;The Warmup Effect Is Massive&lt;/h4&gt;
&lt;p&gt;On both the M3 Max and the Radeon 8060S, the first run is dramatically slower than subsequent runs. The M3 Max goes from RTF 3.53 on run 1 to RTF 2.50 on run 3, a 29% improvement. The AMD shows an even larger swing: RTF 4.20 on run 1 dropping to RTF 2.25 on run 2, a 46% improvement.&lt;/p&gt;
&lt;p&gt;This is kernel compilation. Both MPS and ROCm compile GPU kernels on first use and cache them for subsequent calls. The Qwen TTS model hits a wide variety of kernel shapes during autoregressive generation (different sequence lengths, different attention patterns) and each new shape triggers a compilation on the first encounter. By run 2, most of the common shapes are cached, and performance stabilizes.&lt;/p&gt;
&lt;p&gt;The P40 shows almost no warmup effect. RTF 7.41 on run 1, 7.14 on run 2, 7.44 on run 3. CUDA's kernel compilation is faster and more mature, so the overhead is absorbed within the first few seconds rather than spread across the entire run. But this maturity does not translate into faster inference; CUDA compiles faster, but the P40's hardware is fundamentally slower at the operations this model requires.&lt;/p&gt;
&lt;p&gt;This has a practical implication that matters: &lt;strong&gt;short benchmarks on MPS and ROCm are misleading.&lt;/strong&gt; I initially ran a quick 276-character test on all three machines before doing the full benchmark. The short test showed the AMD at RTF 9.20, almost identical to the P40's RTF 10.01, and far behind the M3 Max's RTF 2.84. That result nearly led me to conclude the AMD was performing as poorly as decade-old hardware. The longer benchmark, with its warmup effect amortized across more generation, revealed the truth: the AMD is just as fast as the M3 Max once the kernels are cached. If I had stopped at the short test, I would have drawn exactly the wrong conclusion.&lt;/p&gt;
&lt;h4&gt;Why the P40 Is So Slow&lt;/h4&gt;
&lt;p&gt;The Tesla P40 is a Pascal-generation GPU from 2016. It has 3,840 CUDA cores and 24GB of GDDR5X memory. On paper, it should be competitive; 12 TFLOPS of FP32 compute is not trivial. And for LLM inference through Ollama, the P40 &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;performs remarkably well&lt;/a&gt;, outperforming quad T4 instances on models up to 8B parameters.&lt;/p&gt;
&lt;p&gt;TTS is a different workload. Qwen3-TTS is an autoregressive transformer that generates audio tokens one at a time, each conditioned on all previous tokens. This means the inference is heavily memory-bandwidth bound during the decoding phase, and compute-bound during the attention and feedforward passes. The model is distributed in bfloat16 precision, which the P40 cannot compute natively; Pascal predates bfloat16 support entirely. PyTorch silently promotes bf16 operations to fp32 on the P40, roughly doubling the computation per operation and halving the effective throughput.&lt;/p&gt;
&lt;p&gt;The P40 also lacks the SDPA (Scaled Dot-Product Attention) hardware acceleration that newer architectures provide. On the M3 Max, MPS routes attention through Metal's optimized primitives. On the AMD, ROCm's AOTriton provides experimental flash attention support. On the P40, attention runs through standard CUDA kernels without any of these accelerations. For a model that generates thousands of autoregressive steps per audio clip, each involving a full attention pass over the growing sequence, this compounds dramatically.&lt;/p&gt;
&lt;p&gt;The P40 is not bad hardware. It is excellent hardware for the workloads it was designed for: batch inference on quantized LLMs where its 24GB of VRAM per card creates a memory advantage. But autoregressive TTS in bfloat16 hits every one of its architectural weaknesses simultaneously.&lt;/p&gt;
&lt;h4&gt;Unified Memory Wins This Workload&lt;/h4&gt;
&lt;p&gt;Both the M3 Max and the Radeon 8060S use unified memory architectures, where the CPU and GPU share the same physical memory pool. The M3 Max has 64GB of unified LPDDR5. The Radeon 8060S shares 128GB of DDR5 with the CPU, with roughly 96GB addressable as VRAM.&lt;/p&gt;
&lt;p&gt;For a 1.7B parameter model in bf16, the weights occupy roughly 3.4GB. The model fits comfortably on all three machines. But the autoregressive generation pattern creates a stream of intermediate activations (KV cache entries, attention scores, feedforward intermediates) that grow with the sequence length. On a unified memory architecture, these intermediates exist in the same memory space as the model weights, avoiding any PCIe transfer overhead. On the P40, every interaction between CPU and GPU crosses a PCIe 3.0 bus.&lt;/p&gt;
&lt;p&gt;For LLM inference, where the bottleneck is token generation throughput and the KV cache fits in VRAM, the P40's discrete memory is fine. For TTS, where the model generates hundreds of audio tokens per second of speech and the attention window grows continuously, the memory access pattern favors unified architectures.&lt;/p&gt;
&lt;p&gt;This is not a universal statement about unified versus discrete memory. A modern discrete GPU with HBM2e or GDDR6X and PCIe 4.0 or 5.0 would likely outperform both the M3 Max and the Radeon 8060S on this workload. The P40's problem is not that its memory is discrete; it is that its memory is slow and its bus is narrow by 2026 standards.&lt;/p&gt;
&lt;h3&gt;The Model Architecture Question&lt;/h3&gt;
&lt;p&gt;While benchmarking Qwen TTS, I also ran a quick comparison with &lt;a href="https://huggingface.co/SWivid/F5-TTS"&gt;F5-TTS&lt;/a&gt; on the AMD machine to sanity-check the results. F5-TTS is a flow-matching model, fundamentally different from Qwen's autoregressive approach. Where Qwen generates audio tokens sequentially, each conditioned on all previous tokens, F5 generates audio in parallel through an iterative refinement process.&lt;/p&gt;
&lt;p&gt;The difference is stark. On the same Radeon 8060S, the same text, the same hardware:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-TTS&lt;/td&gt;
&lt;td&gt;579.1s (avg)&lt;/td&gt;
&lt;td&gt;197.5s&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F5-TTS&lt;/td&gt;
&lt;td&gt;17.4s&lt;/td&gt;
&lt;td&gt;27.2s&lt;/td&gt;
&lt;td&gt;0.64&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;F5-TTS is faster than real-time. Qwen3-TTS takes three times longer than the audio it produces. On normalized terms, F5 is roughly five times faster than Qwen at steady-state, and the gap widens on shorter content where Qwen's warmup overhead is proportionally larger.&lt;/p&gt;
&lt;p&gt;This is not an apples-to-apples quality comparison. Qwen3-TTS generally produces more natural prosody, better handling of complex sentence structures, and more consistent speaker identity across long passages. F5-TTS is excellent but can occasionally drift in voice character or pacing on very long content. For blog narration, both are well above the threshold of "good enough," and the quality difference is smaller than you might expect given the architectural gap.&lt;/p&gt;
&lt;p&gt;The point is that hardware is only half the story. The choice of model architecture can matter more than the choice of GPU. A flow-matching model on integrated AMD graphics outperforms an autoregressive model on Apple's best laptop silicon by a wide margin. If generation speed is the constraint, switching models gains more than switching hardware.&lt;/p&gt;
&lt;h3&gt;What This Costs in Practice&lt;/h3&gt;
&lt;p&gt;The abstract benchmark numbers translate into concrete time and electricity costs when you are generating audio for a library of blog posts.&lt;/p&gt;
&lt;p&gt;A typical TinyComputers post runs 3,000 to 5,000 words, producing 15 to 25 minutes of narrated audio. At steady-state RTF:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;15 min audio&lt;/th&gt;
&lt;th&gt;25 min audio&lt;/th&gt;
&lt;th&gt;System Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;M3 Max&lt;/td&gt;
&lt;td&gt;~38 min&lt;/td&gt;
&lt;td&gt;~63 min&lt;/td&gt;
&lt;td&gt;~50W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Radeon 8060S&lt;/td&gt;
&lt;td&gt;~38 min&lt;/td&gt;
&lt;td&gt;~63 min&lt;/td&gt;
&lt;td&gt;~100W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tesla P40&lt;/td&gt;
&lt;td&gt;~110 min&lt;/td&gt;
&lt;td&gt;~183 min&lt;/td&gt;
&lt;td&gt;~400W&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The M3 Max and Radeon 8060S are tied on generation time, but the M3 Max draws roughly half the system power. For a single post, the electricity cost difference is negligible, a fraction of a cent. For batch processing a backlog of thirty posts, the M3 Max costs about \$0.18 in electricity versus \$0.36 for the AMD and \$3.50 for the P40.&lt;/p&gt;
&lt;p&gt;None of these numbers are alarming. Even the P40, at nearly two and a half hours per post and 400 watts from the wall, costs under fifteen cents in electricity per narration at Minnesota residential rates. The equivalent Google Cloud TTS job would cost \$4 to \$16 per post depending on the voice quality tier.&lt;/p&gt;
&lt;p&gt;To put cloud costs in perspective: I recently ran a fiction novel through Google's Chirp3-HD voice: 82,000 words, roughly 500,000 characters of text plus SSML markup. The bill came to \$17.25 at Google's rate of \$30 per million characters. That is not unreasonable for a one-off project, but it adds up quickly if you are generating audio regularly. The entire library of TinyComputers narrations (dozens of posts, hours of audio) has cost me nothing beyond the electricity to run the machines I already own. The economics of local TTS are favorable on every machine in the comparison.&lt;/p&gt;
&lt;p&gt;The real cost is time. If I am generating audio for a single new post, I start it on whichever machine is idle and check back in an hour. If I am regenerating audio for twenty posts after changing the speaker voice or updating the pipeline, the M3 Max or AMD will finish overnight. The P40 would take most of a weekend.&lt;/p&gt;
&lt;h3&gt;The Right Machine for the Job&lt;/h3&gt;
&lt;p&gt;After running these benchmarks, my workflow has shifted. The M3 Max is the default for new post narration; it is fast, quiet, and I am usually sitting in front of it when I finish writing. The AMD handles batch jobs and overnight processing, where its slightly higher power draw does not matter and its equivalent speed makes it interchangeable with the Mac. The P40 server is reserved for what it does best: &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;running large language models&lt;/a&gt; through Ollama, where its 96GB of aggregate VRAM gives it an advantage that neither the Mac nor the AMD can match.&lt;/p&gt;
&lt;p&gt;The P40 can still generate TTS in a pinch, and it does; when both other machines are occupied, I will queue a job on the P40 and accept the longer wait. But for a workload that is inherently autoregressive, memory-bandwidth sensitive, and dependent on bf16 precision, a ten-year-old Pascal GPU is the wrong tool.&lt;/p&gt;
&lt;p&gt;What surprised me most is how well the AMD performs. The Radeon 8060S is an integrated GPU sharing system memory with the CPU. It has no HBM, no dedicated VRAM, no NVLink. Its ROCm software stack requires environment variable hacks, pre-release PyTorch wheels, and a GFX version override to function at all. And yet, once the kernels warm up, it matches Apple's best laptop silicon stride for stride. The raw hardware is there: 40 RDNA 3.5 compute units with access to a deep pool of DDR5 memory. The software just needs to get out of the way, and on run 2 and beyond, it does.&lt;/p&gt;
&lt;h3&gt;Lessons&lt;/h3&gt;
&lt;p&gt;Three takeaways from this exercise that generalize beyond TTS:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Short benchmarks lie.&lt;/strong&gt; Kernel compilation overhead on MPS and ROCm is large enough to dominate a short test. If you are evaluating a new model on non-CUDA hardware, run it at least twice before drawing conclusions. The first run is measuring the software stack, not the hardware.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Architecture matters more than clock speed.&lt;/strong&gt; The P40 has more raw FLOPS than the Radeon 8060S. It does not matter. The P40 lacks native bf16, lacks efficient attention primitives, and sits behind a PCIe 3.0 bus. The Radeon has all three, and ties a chip designed by Apple's custom silicon team. For autoregressive models, the architectural fit between model and hardware dominates everything else.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Model choice can outweigh hardware choice.&lt;/strong&gt; F5-TTS running on the weakest GPU in this comparison is five times faster than Qwen3-TTS running on the strongest. If your constraint is generation speed and you can accept a modest quality trade-off, switching to a flow-matching architecture gains more than any hardware upgrade short of a data center GPU.&lt;/p&gt;
&lt;p&gt;The audio player at the top of each post on this site represents a few minutes of machine time on one of these three machines. Which machine generated it depends on the day, the workload, and what else is running. The listener cannot tell the difference. The audio sounds the same regardless of whether it was generated on a laptop, a mini desktop, or a rack-mount server in a cold Minnesota shop. That is the real benchmark: not which machine is fastest, but that all three are fast enough.&lt;/p&gt;</description><category>amd</category><category>apple silicon</category><category>audio</category><category>benchmarks</category><category>cuda</category><category>gpu</category><category>inference</category><category>m3 max</category><category>machine learning</category><category>mps</category><category>nvidia</category><category>qwen</category><category>rocm</category><category>strix halo</category><category>tesla p40</category><category>text-to-speech</category><category>tts</category><guid>https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html</guid><pubDate>Thu, 12 Mar 2026 14:00:00 GMT</pubDate></item><item><title>Repurposing Enterprise GPUs: The Tesla P40 Home Lab Story</title><link>https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;17 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;There is a window, maybe eighteen months wide, where enterprise hardware hits a pricing sweet spot. The first-generation buyers (the hyperscalers, the research labs, the Fortune 500 AI teams) have moved on to the next generation. The second-hand market floods. Prices crater. And if you know what you're looking for, you can build something genuinely capable for less than a month of cloud compute.&lt;/p&gt;
&lt;p&gt;I built a four-GPU inference server for about twenty-five hundred dollars. This is the story of how, why, and whether you should do the same.&lt;/p&gt;
&lt;h3&gt;The Buy&lt;/h3&gt;
&lt;p&gt;The acquisition strategy is straightforward: eBay, patience, and knowing what to look for.&lt;/p&gt;
&lt;p&gt;Tesla P40s started appearing in volume on the secondary market around 2023, when cloud providers and enterprise data centers began cycling them out in favor of A100s and H100s. A card that sold for over five thousand dollars new was suddenly available for three hundred, then two hundred and fifty, then, if you watched listings carefully and were willing to buy from decommissioned lot sellers, sometimes less. I picked up four cards over the course of about two months, averaging two hundred and fifty dollars each.&lt;/p&gt;
&lt;p&gt;The chassis was a Penguin Computing 2U rack-mount server, also from eBay. These show up when government labs and research institutions liquidate equipment. The Penguin Computing systems are well-built, with proper server-grade construction, redundant power supplies, and engineered airflow. Mine takes the Xeon E5-2697A v4 and two were purchased from eBay: eighteen Broadwell cores, more than enough CPU to keep four GPUs fed. The chassis cost around six hundred dollars.&lt;/p&gt;
&lt;p&gt;Memory was the lucky purchase. I bought 252GB of DDR4 ECC RAM before the memory price spike that hit in late 2024 when every company on Earth decided they needed AI infrastructure simultaneously. What I paid around two hundred and fifty dollars for would cost significantly more today. Total build: roughly twenty-five hundred dollars.&lt;/p&gt;
&lt;h3&gt;The Hardware&lt;/h3&gt;
&lt;p&gt;The Tesla P40 is a 2016-era data center GPU. NVIDIA designed it for the Pascal generation, targeting inference workloads in enterprise environments. The specifications, for something you can buy on eBay for two hundred and fifty dollars, are remarkable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;24GB GDDR5X&lt;/strong&gt; per card, more memory than an RTX 4090&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;3,840 CUDA cores&lt;/strong&gt;, Pascal architecture, compute capability 6.1&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;12 TFLOPS FP32&lt;/strong&gt;, respectable even by 2026 standards for inference&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;250W TDP&lt;/strong&gt;: this is a data center card and it draws power like one&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Multiply by four and you get 96GB of VRAM for a thousand dollars. That is an extraordinary amount of GPU memory for the price. For context, a single NVIDIA A100 80GB still sells for north of five thousand dollars on the secondary market. Four P40s give you more total VRAM for a fraction of the cost.&lt;/p&gt;
&lt;h3&gt;What You Give Up&lt;/h3&gt;
&lt;p&gt;There is no free lunch in computing, and the P40 makes you pay for its low price in specific, sometimes painful ways.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No Tensor Cores.&lt;/strong&gt; The P40 predates NVIDIA's Tensor Core architecture, which arrived with Volta in 2017. Tensor Cores accelerate matrix multiplication (the fundamental operation in neural network inference) by factors of 4x to 16x depending on precision. The P40 does everything with its CUDA cores, the old-fashioned way. This matters less than you might think for inference at moderate batch sizes, but it means you will never match the throughput of a V100 or newer card, clock for clock.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No native BF16 or FP16.&lt;/strong&gt; This is the real gotcha. BF16 (bfloat16) has become the default precision for large language models. It is what most model weights are distributed in. The P40 cannot compute in BF16 natively; it emulates it through FP32 operations, which is roughly 21% slower than native support. In practice, this means you are running quantized models (Q4, Q5, Q8) through llama.cpp or similar frameworks, which handle the precision conversion for you. It works. It is not optimal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Passive cooling designed for server airflow.&lt;/strong&gt; The P40 is a blower-style card designed for 1U and 2U server chassis with front-to-back forced airflow. In a proper server, this is fine. In anything else, you need to solve cooling yourself. I put mine in a Penguin Computing 2U rack-mount chassis, which has the right airflow characteristics, but this is not a card you drop into a desktop tower.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PCIe 3.0 x16.&lt;/strong&gt; The P40 connects via PCIe 3.0, which provides about 16 GB/s of bandwidth per direction. When you are running a model that spans four GPUs, the inter-GPU communication goes over PCIe, not NVLink. This creates a bottleneck for models that require heavy cross-GPU communication. For inference, where the communication pattern is more predictable than training, this is manageable. For training, it would be a serious constraint.&lt;/p&gt;
&lt;h3&gt;The Minnesota Problem&lt;/h3&gt;
&lt;p&gt;My server lives in an unheated shop building in northern Minnesota. This has created an issue that no hardware review will prepare you for.&lt;/p&gt;
&lt;p&gt;When ambient temperatures drop below freezing (which, in Minnesota, means roughly October through April) the onboard temperature sensors report values that the baseboard management controller interprets as a malfunction. The BMC's response is to spin every fan to maximum RPM as a protective measure.&lt;/p&gt;
&lt;p&gt;The result is a machine that, on quiet winter nights, is audible from the house. The house is a hundred and fifty feet away.&lt;/p&gt;
&lt;p&gt;I have not solved this problem. I have learned to live with it. You can override BMC fan curves on some platforms, but the Penguin Computing firmware is locked down in ways that make this nontrivial, and frankly, a server that runs its fans at full speed because it thinks it is dying is doing exactly what it should be doing. The firmware's assumptions are just wrong for the environment.&lt;/p&gt;
&lt;p&gt;The server runs 24/7 regardless of the season, and the cold air actually keeps the GPUs well within thermal limits. The irony is that the machine has never been cooler or louder than when it is twenty below zero outside. If you are considering a similar setup in a garage, basement, or outbuilding, factor in noise. A 2U server with four 250W GPUs is not quiet under any circumstances, and server-grade fans at full RPM are genuinely loud.&lt;/p&gt;
&lt;h3&gt;Setting Up the Software Stack&lt;/h3&gt;
&lt;p&gt;The driver situation for the P40 in 2026 is straightforward, though it was not always. NVIDIA's &lt;code&gt;nvidia-driver-570-server&lt;/code&gt; package works cleanly on Ubuntu, and the DKMS module rebuilds automatically on kernel updates, most of the time. I have had exactly two occasions where a kernel update broke the NVIDIA module and required manual intervention. This is fewer than I expected.&lt;/p&gt;
&lt;p&gt;For inference, I run &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt;, which wraps llama.cpp and provides a simple API for model management and inference. Ollama handles multi-GPU sharding automatically: when you load a model, it distributes layers across GPUs based on available memory and model size. A 65GB model like gpt-oss:120b fits across three of the four P40s, leaving one free. Smaller models may only need one or two cards. The allocation is generally sensible, though you have less control over placement than you would with raw llama.cpp.&lt;/p&gt;
&lt;p&gt;The alternative stack (vLLM, TGI, or raw llama.cpp) offers more control over GPU assignment but requires more configuration. With llama.cpp directly, you can pin specific GPU layers to specific devices, which lets you optimize for the P40's memory topology. vLLM provides better batching and continuous batching for serving multiple concurrent requests. For a home lab where the primary use case is running various models for experimentation and development rather than serving production traffic, Ollama's simplicity wins.&lt;/p&gt;
&lt;p&gt;One thing worth noting: the P40 is well-supported by the GGUF ecosystem that llama.cpp (and therefore Ollama) uses. GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0) run without issues on Pascal hardware. The quantization handles the BF16 problem for you: model weights are stored in 4-bit or 8-bit integer formats and dequantized to FP32 at runtime, which the P40 handles natively. You are not fighting the hardware; you are working with it.&lt;/p&gt;
&lt;h3&gt;The Benchmarks&lt;/h3&gt;
&lt;p&gt;Theory is cheap. Benchmarks are what matter. I ran the same inference workload across three configurations: my four P40 home lab, a single AWS Tesla T4 instance, and a quad T4 instance on AWS. The T4 is the closest cloud comparison; it is the workhorse inference GPU in AWS's fleet, one generation newer than the P40 (Turing architecture, 2018), with 16GB of GDDR6 and actual Tensor Cores.&lt;/p&gt;
&lt;p&gt;All benchmarks used Ollama with the same prompt, measuring tokens per second during the evaluation phase (excluding model load time).&lt;/p&gt;
&lt;h4&gt;Dense Models&lt;/h4&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;4x P40 (Home Lab)&lt;/th&gt;
&lt;th&gt;1x T4 (AWS \$0.53/hr)&lt;/th&gt;
&lt;th&gt;4x T4 (AWS \$3.91/hr)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2&lt;/td&gt;
&lt;td&gt;3B&lt;/td&gt;
&lt;td&gt;94.3 tok/s&lt;/td&gt;
&lt;td&gt;81.5 tok/s&lt;/td&gt;
&lt;td&gt;101.5 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;52.7 tok/s&lt;/td&gt;
&lt;td&gt;36.9 tok/s&lt;/td&gt;
&lt;td&gt;40.3 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;47.8 tok/s&lt;/td&gt;
&lt;td&gt;35.7 tok/s&lt;/td&gt;
&lt;td&gt;29.2 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 wins on the 7B and 8B models by substantial margins, 31% and 64% respectively over the quad T4 configuration. The only model where the T4 edges ahead is the 3B, which is small enough to fit entirely on a single GPU. Here, the T4's higher clock speeds and faster GDDR6 memory give it an advantage because there is no multi-GPU overhead to penalize it.&lt;/p&gt;
&lt;p&gt;The 8B result is particularly interesting. The quad T4 actually performs &lt;em&gt;worse&lt;/em&gt; than a single T4 on this model (29.2 vs 35.7 tok/s). Ollama shards the model across all four GPUs even though it fits on one, and the PCIe communication overhead between four T4s costs more than it gains. The P40, with its larger 24GB per-card memory, likely fits more of the model per GPU, reducing cross-GPU transfers.&lt;/p&gt;
&lt;h4&gt;The MoE Advantage&lt;/h4&gt;
&lt;p&gt;The most compelling benchmark comes from OpenAI's gpt-oss, a 120-billion parameter mixture-of-experts model with only 5.1 billion active parameters per token. The MoE architecture means the model's total weight is large (it needs the memory), but the computation per token is modest (only a fraction of the parameters fire for any given input).&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;4x P40&lt;/th&gt;
&lt;th&gt;4x T4 (AWS \$3.91/hr)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-oss&lt;/td&gt;
&lt;td&gt;120B MoE (5.1B active)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.1 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20.6 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 runs OpenAI's 120B model at 28.1 tokens per second, 36% faster than the cloud instance, and fast enough for comfortable interactive use. This is a state-of-the-art model running on decade-old GPUs at a speed that would have been impressive on much newer hardware a year ago.&lt;/p&gt;
&lt;p&gt;The reason is memory. The gpt-oss model uses MXFP4 quantization on its MoE weights, bringing the total model size to about 65GB. Four P40s offer 96GB of VRAM, enough to hold the entire model in GPU memory. Four T4s offer only 64GB, which means some of the model likely spills to system RAM, adding latency on every token.&lt;/p&gt;
&lt;p&gt;This is the P40's superpower: 24GB per card was overkill in 2016, and it is exactly right in 2026. Models have grown to fill the memory, and the P40 has more of it per dollar than almost anything else on the market.&lt;/p&gt;
&lt;h4&gt;Where It Falls Apart&lt;/h4&gt;
&lt;p&gt;Dense 70B models are a different story. Llama 3.1 70B at Q4_0 quantization (39GB) fits across 96GB of P40 VRAM, but the inference speed is essentially unusable: 0.033 tokens per second. One token every thirty seconds. Answering "What is 2+2?" took six and a half minutes. The combination of no Tensor Cores, PCIe 3.0 interconnect, and the sheer volume of cross-GPU data transfers for a dense 70B model pushes the per-token latency beyond any practical threshold.&lt;/p&gt;
&lt;p&gt;The quad T4 on AWS managed 2.0 tokens per second on the same model, sixty times faster. Slow, but functional. The T4's Tensor Cores make the difference here; at this scale, the P40's raw CUDA cores simply cannot keep up with the matrix math.&lt;/p&gt;
&lt;p&gt;The lesson: MoE models and quantized models up to about 8B parameters are the P40's sweet spot. Dense models above 13B start hitting diminishing returns. Dense 70B is a wall.&lt;/p&gt;
&lt;h3&gt;The Cost Argument&lt;/h3&gt;
&lt;p&gt;Here is the math that justifies the project.&lt;/p&gt;
&lt;p&gt;A &lt;code&gt;g4dn.12xlarge&lt;/code&gt; on AWS (four Tesla T4s, 48 vCPUs, 192GB RAM) costs \$3.91 per hour. My home lab outperforms it on every model except the smallest. If I run inference for just four hours a day, the cloud cost would be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily&lt;/strong&gt;: \$15.64&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monthly&lt;/strong&gt;: \$469&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yearly&lt;/strong&gt;: \$5,694&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My server cost \$2,500 to build. It pays for itself in roughly five months of equivalent cloud usage. After that, the only ongoing cost is electricity. At Minnesota residential rates (roughly \$0.12/kWh) and an average draw of 800W under load, that is about \$70 per month. Less than a single day of the equivalent cloud instance.&lt;/p&gt;
&lt;p&gt;Even if you factor in the P40's lower performance on some workloads and assume you only get 70% of the cloud equivalent's utility, the break-even point is still well under a year. For a home lab that runs 24/7 for development, experimentation, and &lt;a href="https://tinycomputers.io/posts/clean-room-z80-emulator.html"&gt;text-to-speech generation&lt;/a&gt;, the economics are overwhelming.&lt;/p&gt;
&lt;h3&gt;What I Actually Use It For&lt;/h3&gt;
&lt;p&gt;The server runs several workloads:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Local LLM inference.&lt;/strong&gt; This is the primary use case. Having a local inference server with 96GB of VRAM means I can run frontier-class open-weight models without sending data to a cloud API. For development work, where I might make hundreds of inference calls while iterating on a project, the zero marginal cost changes how I work. I experiment more freely when each query costs nothing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-speech.&lt;/strong&gt; I run &lt;a href="https://tinycomputers.io/posts/clean-room-z80-emulator.html"&gt;Qwen TTS&lt;/a&gt; on the P40s to generate audio narration for blog posts. The model fits comfortably in the P40's memory, and the generation speed is acceptable for batch processing. The narration you hear on posts across this site was generated on these GPUs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Development and testing.&lt;/strong&gt; When I am building projects like &lt;a href="https://tinycomputers.io/posts/sampo-designing-a-16-bit-risc-cpu-from-scratch-part-1-theory-and-architecture.html"&gt;Sampo&lt;/a&gt; or &lt;a href="https://tinycomputers.io/posts/introducing-lattice-a-crystallization-based-programming-language.html"&gt;Lattice&lt;/a&gt;, having local GPU compute available for testing AI-assisted workflows means I do not need to worry about API rate limits or costs during intensive development sessions.&lt;/p&gt;
&lt;p&gt;The server sits on my local network at a static IP, accessible from any machine in the house. It is always on, always available, and always free to use. That availability changes your relationship with AI inference in ways that are hard to appreciate until you have lived with it. There is a psychological difference between "this costs two cents per query" and "this costs nothing per query." The first makes you think about whether the query is worth it. The second lets you experiment without friction, and that friction reduction, compounded across hundreds of daily interactions, fundamentally changes how you work.&lt;/p&gt;
&lt;p&gt;This is, incidentally, a small-scale example of the &lt;a href="https://tinycomputers.io/posts/jevons-paradox.html"&gt;Jevons Paradox&lt;/a&gt; I have been writing about in this blog's economics series. Making inference cheaper did not cause me to run the same number of queries and pocket the savings. It caused me to run dramatically more queries, on more models, for more projects, consuming more total compute than I ever would have purchased from a cloud provider. The efficiency created demand.&lt;/p&gt;
&lt;h3&gt;Should You Build One?&lt;/h3&gt;
&lt;p&gt;The honest answer is: it depends on what you value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Build one if:&lt;/strong&gt;
- You run local inference regularly and the cloud costs are adding up
- You want 96GB of VRAM for under a thousand dollars in GPU costs
- You have the physical space, electrical capacity, and noise tolerance for a rack-mount server
- You enjoy the process of building and configuring systems; this is not a plug-and-play experience&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Do not build one if:&lt;/strong&gt;
- You need the latest model performance (Tensor Cores, FP8, NVLink)
- You are training models, not running inference
- You need reliability guarantees; this is a home lab, not a production environment
- You are not comfortable with Linux system administration, driver debugging, and occasional hardware troubleshooting&lt;/p&gt;
&lt;p&gt;The P40 window will not last forever. As newer GPUs age out of data centers (the V100, the A100) the P40 will eventually lose its price-to-performance advantage. The V100, with its first-generation Tensor Cores and 32GB of HBM2, is already starting to appear at attractive secondary market prices. Within a year, it may be the new sweet spot. But right now, in early 2026, four P40s on eBay represent one of the best deals in GPU computing. Ninety-six gigabytes of VRAM, proven CUDA compatibility, and a decade of driver maturity, for the price of a weekend trip.&lt;/p&gt;
&lt;p&gt;The server in my shop building will keep running. The fans will keep screaming through the Minnesota winter. And I will keep running models on hardware that a hyperscaler discarded three years ago, at speeds that would have been remarkable on any hardware five years ago. That is the beauty of the secondary market: someone else paid for the R&amp;amp;D, someone else paid for the depreciation, and you get the compute.&lt;/p&gt;</description><category>ai</category><category>benchmarks</category><category>cuda</category><category>deep learning</category><category>ebay</category><category>enterprise hardware</category><category>gpu</category><category>home lab</category><category>inference</category><category>nvidia</category><category>ollama</category><category>tesla p40</category><guid>https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html</guid><pubDate>Wed, 11 Mar 2026 14:00:00 GMT</pubDate></item><item><title>Moore's Law for Intelligence: What Happens When Thinking Gets Cheap</title><link>https://tinycomputers.io/posts/moores-law-for-intelligence-what-happens-when-thinking-gets-cheap.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/moores-law-for-intelligence-what-happens-when-thinking-gets-cheap_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;24 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;&lt;img src="https://tinycomputers.io/images/moores-law-intelligence/silicon-wafer.jpg" alt="A silicon wafer with an array of integrated circuit dies, the physical foundation of Moore's Law" style="float: right; max-width: 40%; margin: 0 0 1em 1.5em; border-radius: 4px;"&gt;&lt;/p&gt;
&lt;p&gt;I have written about &lt;a href="https://tinycomputers.io/posts/jevons-paradox.html"&gt;Jevons Paradox&lt;/a&gt; twice now, once through the history of the semiconductor industry, and once as a &lt;a href="https://tinycomputers.io/posts/the-jevons-counter-thesis-why-ai-displacement-scenarios-underweight-demand-expansion.html"&gt;broader examination&lt;/a&gt; of what happens when the cost of a critical economic input collapses. The pattern is consistent: demand expands to overwhelm the savings. Coal. Transistors. Bandwidth. Lighting.&lt;/p&gt;
&lt;p&gt;Those pieces looked at the pattern itself. This one is different. I want to run a thought experiment forward, not backward.&lt;/p&gt;
&lt;p&gt;I've also spent a lot of time on this site looking backward at computing history, watching &lt;a href="https://tinycomputers.io/posts/stewart-cheifet-and-his-computer-chronicles.html"&gt;Stewart Cheifet walk viewers through the early personal computer revolution&lt;/a&gt; on &lt;em&gt;The Computer Chronicles&lt;/em&gt;, examining how &lt;a href="https://tinycomputers.io/posts/language-manipulators-what-a-1983-episode-of-the-computer-chronicles-got-right-and-wrong-about-word-processing.html"&gt;word processing went from a curiosity to a necessity&lt;/a&gt; in a single decade, tracing &lt;a href="https://tinycomputers.io/posts/george-morrow-pioneer-of-personal-computing.html"&gt;George Morrow's&lt;/a&gt; role in making personal computing real, and following &lt;a href="https://tinycomputers.io/posts/cpm-history-and-legacy.html"&gt;CP/M's arc&lt;/a&gt; from operating system of the future to historical footnote. I've &lt;a href="https://tinycomputers.io/posts/cpm-on-physical-retroshield-z80.html"&gt;run CP/M on physical RetroShield hardware&lt;/a&gt;, explored the &lt;a href="https://tinycomputers.io/posts/motorola-68000-processor-and-the-ti-89-graphing-calculator.html"&gt;Motorola 68000&lt;/a&gt; that powered a generation of machines, and dug into &lt;a href="https://tinycomputers.io/posts/infocom-zork-history.html"&gt;how Infocom turned text adventures into a business&lt;/a&gt; at a time when 64K of RAM was generous. That immersion in where computing came from is exactly what makes the forward question so vivid, because at every stage, the people living through the transition couldn't see what was coming next. The engineers building CP/M didn't anticipate DOS. The engineers building DOS didn't anticipate the web. The engineers building the web didn't anticipate the iPhone. The pattern is always the same: cheaper compute enables things that were unimaginable at the prior cost.&lt;/p&gt;
&lt;p&gt;The question isn't "will AI destroy jobs?" or "is the doom scenario wrong?" The question is: &lt;strong&gt;what becomes possible when thinking gets cheap?&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Because AI compute is following a cost curve that looks remarkably like the early decades of Moore's Law. And if that continues (if the cost per unit of machine intelligence drops by an order of magnitude every few years) the consequences extend far beyond making today's chatbots cheaper to run.&lt;/p&gt;
&lt;h3&gt;The Cost Curve&lt;/h3&gt;
&lt;p&gt;&lt;img src="https://tinycomputers.io/images/moores-law-intelligence/moores-law-transistor-count.png" alt="Microprocessor transistor counts from 1971 to 2011 plotted on a logarithmic scale, showing Moore's Law doubling trend" style="max-width: 100%; margin: 0 0 1.5em 0; border-radius: 4px;"&gt;&lt;/p&gt;
&lt;p&gt;Moore's Law, in its original formulation, described the doubling of transistors per integrated circuit roughly every two years. But the economic consequence that mattered wasn't transistor density; it was cost per unit of compute. From the 1960s through the 2010s, the cost per FLOP declined at a compound rate that delivered roughly a 10x improvement every four to five years. A computation that cost \$1 million in 1975 cost \$1 by 2010. That decline didn't just make existing applications cheaper. It created entirely new categories of computing that were inconceivable at the prior cost structure.&lt;/p&gt;
&lt;p&gt;AI inference costs are now following a similar trajectory, but faster. OpenAI's text-davinci-003, released in late 2022, cost \$20 per million tokens. GPT-4o mini, released in mid-2024, delivers substantially better performance at \$0.15 per million input tokens, a 99% cost reduction in under two years. Claude, Gemini, and open-source models have followed similar curves. DeepSeek entered the market in early 2025 with pricing that undercut Western frontier models by roughly 90%, compressing the timeline further through competitive pressure.&lt;/p&gt;
&lt;p&gt;The GPU hardware underneath these models is on its own Moore's Law trajectory. GPU price-performance in FLOP/s per dollar doubles approximately every 2.5 years for ML-class hardware. Architectural improvements in transformers, mixture-of-experts routing, quantization, speculative decoding, and distillation compound on top of the hardware gains. The result is a cost curve where the effective price of a unit of machine reasoning is falling faster than the price of a transistor did during the semiconductor industry's most explosive growth phase.&lt;/p&gt;
&lt;p&gt;This matters because we know, empirically, what happens when the cost of a foundational input follows an exponential decline. We have sixty years of data on it. The compute industry went from a few thousand mainframes serving governments and large corporations to billions of devices in every pocket, every appliance, every traffic light. Total spending on computing didn't shrink as costs fell; it expanded by orders of magnitude, because each 10x cost reduction unlocked categories of use that didn't exist at the prior price point.&lt;/p&gt;
&lt;p&gt;The thought experiment is straightforward: apply that pattern to intelligence itself.&lt;/p&gt;
&lt;h3&gt;Today's Price Points Create Today's Use Cases&lt;/h3&gt;
&lt;p&gt;At current pricing (roughly \$3 per million input tokens for a frontier model like Claude Sonnet), AI is economically viable for a specific class of applications. Customer support automation. Code assistance. Document summarization. Marketing copy. Translation. These are the use cases where the value generated per token comfortably exceeds the cost per token, and where the interaction pattern involves relatively short exchanges.&lt;/p&gt;
&lt;p&gt;But there are vast categories of potential use where current pricing makes the math uncomfortable or impossible. Consider:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Continuous monitoring and analysis.&lt;/strong&gt; A financial analyst who wants an AI to continuously watch SEC filings, earnings calls, patent applications, and news feeds across 500 companies (analyzing each document in full, cross-referencing against historical patterns, and generating alerts) would consume billions of tokens per month. At current prices, this costs tens of thousands of dollars monthly. At 100x cheaper, it costs the price of a SaaS subscription.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Full-codebase reasoning.&lt;/strong&gt; This one is already arriving. Anthropic's Claude Opus 4.6, working through Claude Code, can operate at the repository level, reading files, understanding architecture, running tests, and making changes across an entire codebase in a single session. I've used it to build a &lt;a href="https://tinycomputers.io/posts/open-sourcing-a-high-performance-rust-based-ballistics-engine.html"&gt;high-performance Rust-based ballistics engine&lt;/a&gt; and to develop &lt;a href="https://tinycomputers.io/posts/introducing-lattice-a-crystallization-based-programming-language.html"&gt;Lattice, an entire programming language&lt;/a&gt; with a &lt;a href="https://tinycomputers.io/posts/from-tree-walker-to-bytecode-vm-compiling-lattice.html"&gt;bytecode VM compiler&lt;/a&gt;, projects where the AI wasn't autocompleting fragments but reasoning across thousands of lines of interconnected code, tracking type systems, managing compiler passes, and understanding how changes in one module ripple through the rest. The constraint today isn't capability; it's cost. These sessions consume large volumes of tokens, which means they're viable for serious engineering work but not yet cheap enough to run continuously on every commit, every pull request, every deployment. At 100x cheaper, that changes. At 1,000x cheaper, every codebase has an always-on collaborator that has read everything and forgets nothing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Personalized education at scale.&lt;/strong&gt; A truly personalized AI tutor that adapts to a student's learning style, tracks their understanding across subjects, reviews their homework in detail, explains mistakes with patience, and adjusts its teaching strategy over months, this requires sustained, high-volume token consumption per student. Multiply by millions of students and the current cost structure breaks. At 100x cheaper, it's viable for a school district. At 1,000x cheaper, it's viable for an individual family.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Preventive medicine.&lt;/strong&gt; Analyzing a patient's complete medical history, genetic data, lifestyle information, lab results, and the current research literature to generate genuinely personalized health recommendations (not the generic advice a five-minute doctor's visit produces, but the kind of comprehensive analysis that currently only concierge medicine patients paying \$10,000+ per year receive). At current token prices, this is prohibitively expensive for routine use. At 100x cheaper, it could be embedded in every annual checkup.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Ambient intelligence.&lt;/strong&gt; The concept of AI that runs continuously in the background of your life (understanding your calendar, email, documents, and goals, proactively surfacing relevant information, drafting responses, scheduling meetings, flagging conflicts) requires sustained inference at volumes that would cost hundreds of dollars per day at current prices. At 1,000x cheaper, it costs less than your phone bill.&lt;/p&gt;
&lt;p&gt;These aren't science fiction scenarios. They're applications of current model capabilities at price points that don't yet exist. The models can already do most of this work. The cost curve is the bottleneck.&lt;/p&gt;
&lt;h3&gt;The 10x / 100x / 1,000x Framework&lt;/h3&gt;
&lt;p&gt;Moore's Law didn't deliver its benefits in a smooth, continuous flow. It came in thresholds, price points at which qualitatively new applications became viable. The pattern with AI compute is likely to follow the same staircase function.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At 10x cheaper&lt;/strong&gt; (plausible within 1-2 years): AI becomes viable for tasks that are currently "almost worth it." Small businesses that can't justify \$500/month for AI tooling find it worthwhile at \$50/month. Individual professionals (accountants, lawyers, doctors, engineers) integrate AI into their daily workflow not as an occasional tool but as a constant companion. The volume of AI-mediated work increases dramatically, but the character of work doesn't fundamentally change. This is the equivalent of the minicomputer era: the same kind of computing, available to more people.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At 100x cheaper&lt;/strong&gt; (plausible within 3-5 years): The applications listed above become economically viable. Continuous analysis, full-codebase reasoning, personalized education, preventive medicine at scale. At this price point, AI stops being a tool you use and starts being infrastructure you run on. Every document you write gets reviewed. Every decision you make gets a second opinion. Every student gets a tutor. Every patient gets a diagnostician. The total volume of inference consumed per capita increases by far more than 100x, because new use cases emerge that weren't contemplated at the prior price. This is the personal computer moment: qualitatively new categories of use.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;At 1,000x cheaper&lt;/strong&gt; (plausible within 5-8 years): Intelligence becomes ambient and disposable. You don't think about whether to use AI for a task any more than you think about whether to use electricity for a task. Every appliance, every vehicle, every building, every piece of infrastructure has embedded reasoning running continuously. Your home understands your patterns and adapts. Your car negotiates traffic in real time not just with sensors but with models that predict the behavior of every other vehicle. Agricultural equipment analyzes soil conditions at the individual plant level. Supply chains optimize in real time across thousands of variables. This is the smartphone moment: computing so cheap and pervasive that it becomes invisible.&lt;/p&gt;
&lt;h3&gt;The Compounding Effect&lt;/h3&gt;
&lt;p&gt;There's a dynamic in AI cost reduction that didn't exist with traditional Moore's Law: cheaper inference enables better models, which enables even cheaper inference.&lt;/p&gt;
&lt;p&gt;When inference is expensive, researchers are constrained in how they can train and evaluate models. Each experiment costs real money. Each architecture search consumes significant compute budgets. When inference costs drop, researchers can run more experiments, evaluate more architectures, and discover more efficient approaches, which further reduces costs. Distillation (training a smaller model to mimic a larger one) becomes more practical when the larger model is cheaper to run at scale. Synthetic data generation (using AI to create training data for other AI) becomes more economical. The cost reduction compounds on itself.&lt;/p&gt;
&lt;p&gt;This is already happening. GPT-4 was used to generate synthetic training data for GPT-4o. Claude's training pipeline uses prior Claude models to evaluate and filter training examples. Google's Gemini models help design the next generation of TPU chips that will run future Gemini models. The AI equivalent of "using computers to design better computers" arrived in year three of the current wave, decades earlier in the relative timeline than it took the semiconductor industry to reach the same recursive dynamic.&lt;/p&gt;
&lt;p&gt;The implication is that the cost curve isn't just declining; it's declining at an accelerating rate because each improvement enables the next one. The semiconductor industry saw this acceleration plateau after about fifty years as it approached physical limits of silicon. AI has no equivalent physical constraint on the horizon. The limits are architectural and algorithmic, and those limits have been falling faster than hardware limits ever did.&lt;/p&gt;
&lt;h3&gt;What the Semiconductor Analogy Actually Predicts&lt;/h3&gt;
&lt;p&gt;&lt;img src="https://tinycomputers.io/images/moores-law-intelligence/cray-1.jpg" alt="A Cray-1 supercomputer on display, showing its distinctive cylindrical tower design with bench seating and exposed cooling plumbing" style="float: right; max-width: 45%; margin: 0 0 1em 1.5em; border-radius: 4px;"&gt;&lt;/p&gt;
&lt;p&gt;In 1975, a Cray-1 supercomputer delivered about 160 MFLOPS and cost \$8 million. In 2025, an iPhone delivers roughly 2 TFLOPS of neural engine performance and costs \$800. That's a 12,500x performance increase at a 10,000x cost decrease, a net improvement of roughly 100 million times in price-performance over fifty years.&lt;/p&gt;
&lt;p&gt;Nobody in 1975 predicted Instagram, Uber, Google Maps, or Spotify. Not because these applications required fundamentally new physics; they just required compute that was cheap enough to run in a device that fit in your pocket. The applications were latent, waiting for the cost curve to reach them.&lt;/p&gt;
&lt;p&gt;The history is instructive at each threshold. When a capable computer crossed below \$20,000 in the early 1980s, it unlocked small business accounting, the same work mainframes did, just for smaller organizations. When it crossed below \$2,000 in the mid-1990s, it unlocked home computing, and with it the web browser, email, and e-commerce. When capable compute crossed below \$200 in the smartphone era, it unlocked ride-sharing, mobile payments, and social media, none of which had any conceptual precursor at the \$20,000 price point. Each 10x reduction didn't just expand the existing market. It created a market that was literally unimaginable at the prior price.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://tinycomputers.io/images/moores-law-intelligence/ibm-system-360.jpg" alt="An IBM System/360 Model 30 mainframe computer with its distinctive red cabinet and operator control panel" style="float: right; max-width: 45%; margin: 0 0 1em 1.5em; border-radius: 4px;"&gt;&lt;/p&gt;
&lt;p&gt;The same principle applies to intelligence. We are in the mainframe era of AI. The applications we see today (chatbots, code assistants, image generators) are the equivalent of payroll processing and scientific computation on 1960s mainframes. They are real and valuable, but they represent a tiny fraction of what becomes possible when the cost drops by five or six orders of magnitude.&lt;/p&gt;
&lt;p&gt;What are the Instagram and Uber equivalents of cheap intelligence? By definition, we can't fully predict them. But we can identify the structural conditions that will enable them:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When intelligence costs less than attention, delegation becomes default.&lt;/strong&gt; Today, the cognitive cost of formulating a good prompt, evaluating the output, and iterating often exceeds the cost of just doing the task yourself. As models get cheaper, faster, and better at understanding context, the threshold shifts. Eventually, not delegating a cognitive task to AI becomes the irrational choice, the way not using a calculator for arithmetic became irrational.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When intelligence costs less than data storage, everything gets analyzed.&lt;/strong&gt; Today, most data that organizations collect is never analyzed. It's stored, archived, and forgotten, because the cost of human analysis exceeds the expected value of the insights. When AI analysis is effectively free, every dataset gets examined. Every log file gets reviewed. Every customer interaction gets analyzed for patterns. The volume of insight generated from existing data increases by orders of magnitude.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;When intelligence costs less than communication overhead, organizations restructure.&lt;/strong&gt; This is already starting. A significant fraction of white-collar work is coordination: meetings, emails, status updates, project management. These exist because humans need to synchronize their mental models of shared projects. AI tools are already compressing this layer: meeting summaries that eliminate the need for half the attendees, project dashboards that maintain themselves, codebases where an AI agent tracks the state of every open issue so developers don't have to sit through standup. When AI can maintain a comprehensive, always-current model of a project's state, much of the coordination overhead that justifies entire job categories (project managers, program managers, business analysts, internal consultants) begins to evaporate. An organization that currently needs 50 people to coordinate a complex project might need 10, with AI handling the information synthesis that previously required human intermediaries. That's a genuine productivity gain. It's also 40 people who need to find something else to do, and the honest answer is that we don't yet know how fast the demand side creates new roles to absorb them.&lt;/p&gt;
&lt;h3&gt;The Demand Expansion Is the Story&lt;/h3&gt;
&lt;p&gt;The instinct when hearing "AI gets 1,000x cheaper" is to think about cost savings. That's the substitution frame: doing the same things for less money. And yes, that will happen. But the semiconductor analogy tells us that cost savings are the boring part of the story.&lt;/p&gt;
&lt;p&gt;When compute got 1,000x cheaper between 1980 and 2000, the interesting story wasn't that scientific simulations got cheaper to run. It was that entirely new industries (PC software, internet services, mobile apps, social media, cloud computing) emerged to consume orders of magnitude more compute than the entire prior industry had used. The efficiency gain on existing applications was dwarfed by the demand expansion from new applications.&lt;/p&gt;
&lt;p&gt;The same will likely be true for intelligence. Consider bandwidth as a parallel case. In 1995, a 28.8 kbps modem made email and basic web pages viable. Nobody was streaming video; it was physically impossible at that bandwidth, not merely expensive. By 2005, broadband had made streaming music viable. By 2015, streaming 4K video was routine. By 2025, cloud gaming and real-time video conferencing were infrastructure-level assumptions. Total bandwidth consumption didn't decline as it got cheaper. It increased by roughly a million times, because each generation of cost reduction enabled applications that consumed orders of magnitude more bandwidth than the previous generation's entire output.&lt;/p&gt;
&lt;p&gt;The interesting story isn't that customer support gets cheaper. It's the applications that are currently impossible (not difficult, not expensive, but literally impossible at current price points) that become not just possible but routine.&lt;/p&gt;
&lt;p&gt;A world where every small business has a CFO-grade financial analyst. Where every patient has a diagnostician who has read every relevant paper published in the last decade. Where every student has a tutor who knows exactly where they're struggling and why. Where every local government has the analytical capacity currently reserved for federal agencies.&lt;/p&gt;
&lt;p&gt;And the nature of building software itself is changing in ways that go beyond "engineers with better tools." For most of computing history, writing code meant a human translating intent into syntax, line by line, function by function. AI assistance started as autocomplete: suggesting the next line, filling in boilerplate. But that phase is already ending. Today, with tools like Claude Code, the workflow has inverted. The human describes what they want (an architecture, a feature, a behavior) and the AI writes the implementation across files, runs the tests, and iterates on failures. The engineer's role shifts from writing code to directing and reviewing it, from syntax to judgment. At 10x cheaper, this is how professional developers work. At 100x cheaper, it's how small teams build products that previously required departments. At 1,000x cheaper, the barrier between "person with an idea" and "working software" functionally disappears. The entire concept of what it means to be a software engineer is being rewritten in real time, not by replacing engineers, but by redefining the skill from "can you write this code?" to "do you know what to build and why?"&lt;/p&gt;
&lt;p&gt;These aren't efficiency improvements on existing systems. They're new capabilities that create new categories of economic activity, new forms of organization, and new kinds of products and services that don't have current analogs, just as social media, ride-sharing, and cloud computing had no analogs in the mainframe era.&lt;/p&gt;
&lt;h3&gt;The Question That Matters&lt;/h3&gt;
&lt;p&gt;I should be honest about what I don't know. The displacement scenarios for white-collar labor are not fantasy. AI is already capable enough to handle work that was solidly middle-class professional territory two years ago: document review, financial analysis, code generation, customer support, content production. The scenarios where this accelerates faster than the economy can absorb are plausible, and anyone who dismisses them outright isn't paying attention. When a technology can replicate cognitive labor at a fraction of the cost, the transitional pain for the people whose livelihoods depend on that labor is real and potentially severe. The speed matters: prior technology transitions unfolded over decades, and AI compression of that timeline into years is a genuine uncertainty that historical analogy doesn't fully resolve.&lt;/p&gt;
&lt;p&gt;But there is a question that displacement scenarios consistently underweight, and it's the one I explored in my &lt;a href="https://tinycomputers.io/posts/the-jevons-counter-thesis-why-ai-displacement-scenarios-underweight-demand-expansion.html"&gt;Jevons counter-thesis&lt;/a&gt;: what happens on the demand side? Every model that projects mass unemployment from cheap AI is implicitly assuming that the economy remains roughly the same size, with machines doing the work humans used to do. That's the substitution frame. And the substitution frame has been wrong at every prior technological inflection point, not slightly wrong, but wrong by orders of magnitude.&lt;/p&gt;
&lt;p&gt;The semiconductor industry's answer, delivered over sixty years of data, is unambiguous. Every order-of-magnitude cost reduction generated more economic activity, more employment, and more total compute consumption than the one before it. The economy didn't shrink as compute got cheaper. It restructured around cheap compute and grew. Roughly 80% of Americans who need legal help can't afford it today. Personalized tutoring is a luxury good. Custom software is out of reach for most small businesses. These aren't speculative markets; they're documented unmet demand suppressed by the cost of human intelligence. When that cost collapses, the demand doesn't stay static.&lt;/p&gt;
&lt;p&gt;The honest answer is that both things will happen simultaneously. Jobs will be displaced, some permanently. And new categories of economic activity will emerge that are currently inconceivable, just as social media and cloud computing were inconceivable in the mainframe era. The question is which force dominates, and how fast the transition occurs. I think the historical pattern favors demand expansion, but I hold that view with the humility of someone who knows the speed of this particular transition is unprecedented.&lt;/p&gt;
&lt;p&gt;AI inference costs are following the same curve as semiconductors, possibly faster. The tokens-per-dollar ratio will improve by orders of magnitude. And when it does, the applications that emerge will make today's AI use cases look as quaint as running payroll on a room-sized mainframe.&lt;/p&gt;
&lt;p&gt;The thought experiment ends where all Jevons stories end: with more consumption, not less. More intelligence deployed, not less. More economic activity built on cheap cognition, not less. The cost curve is the enabling condition. What gets built on top of it is the part we can't fully predict, and historically, that's always been the most interesting part.&lt;/p&gt;</description><category>ai</category><category>compute costs</category><category>demand expansion</category><category>economics</category><category>inference</category><category>jevons paradox</category><category>moore's law</category><category>semiconductors</category><category>technology</category><category>tokens</category><guid>https://tinycomputers.io/posts/moores-law-for-intelligence-what-happens-when-thinking-gets-cheap.html</guid><pubDate>Sat, 28 Feb 2026 14:00:00 GMT</pubDate></item></channel></rss>