<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TinyComputers.io (Posts about tesla p40)</title><link>https://tinycomputers.io/</link><description></description><atom:link href="https://tinycomputers.io/categories/tesla-p40.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 A.C. Jokela 
&lt;!-- div style="width: 100%" --&gt;
&lt;a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"&gt;&lt;img alt="" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /&gt; Creative Commons Attribution-ShareAlike&lt;/a&gt;&amp;nbsp;|&amp;nbsp;
&lt;!-- /div --&gt;
</copyright><lastBuildDate>Mon, 06 Apr 2026 22:12:50 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Running a 22B Video Model on Four Tesla P40s</title><link>https://tinycomputers.io/posts/running-ltx-video-on-four-tesla-p40s.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/running-ltx-video-on-four-tesla-p40s_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;22 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;LTX-Video 2.3 is a 22 billion parameter model that generates video from text prompts. It was designed for modern hardware: GPUs with bfloat16 support, high-bandwidth memory, and enough VRAM to hold the full model on one or two cards. The &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;Tesla P40&lt;/a&gt; has none of these things. It is a Pascal-generation GPU from 2016, with 24GB of GDDR5X per card, no native bfloat16, no Tensor Cores, and a PCIe 3.0 bus. It was built for data center inference workloads that no longer exist.&lt;/p&gt;
&lt;p&gt;I have four of them in a rack-mount server in an unheated shop building in Minnesota. Together they provide 96GB of VRAM. The question was whether that 96GB, spread across four old cards, could run a model that was never meant to run on any of them.&lt;/p&gt;
&lt;p&gt;The answer is yes, with significant caveats and a substantial amount of code to work around hardware limitations that the model's authors never anticipated.&lt;/p&gt;
&lt;h3&gt;The Problem&lt;/h3&gt;
&lt;p&gt;LTX-Video 2.3's transformer has 48 blocks. At fp16 precision, the model weights alone consume roughly 44GB. With the Gemma text encoder, the video VAE encoder/decoder, the spatial upsampler, and the audio components, the full pipeline needs more memory than any single P40 can provide. The model doesn't fit on one card. It doesn't fit on two. It barely fits on three, with no room for activations during inference.&lt;/p&gt;
&lt;p&gt;Four cards at 24GB each gives 96GB total, which is enough for the weights with room for intermediate activations. But CUDA doesn't automatically spread a model across multiple GPUs. You have to tell it how.&lt;/p&gt;
&lt;p&gt;The standard approach for multi-GPU inference is &lt;code&gt;accelerate&lt;/code&gt;'s &lt;code&gt;dispatch_model&lt;/code&gt;, which automatically distributes model layers across available GPUs based on memory constraints. This works for the Gemma text encoder, which is a straightforward transformer. For the LTX transformer, it doesn't work, because the model has a custom forward pass with audio-video cross-attention that &lt;code&gt;accelerate&lt;/code&gt;'s automatic dispatch can't handle correctly. The model needs to move data between GPUs at specific points in the forward pass, and &lt;code&gt;accelerate&lt;/code&gt; doesn't know where those points are.&lt;/p&gt;
&lt;p&gt;The solution was manual pipeline parallelism: split the 48 transformer blocks evenly across four GPUs (12 blocks per card), keep the shared components (patchify projections, normalization, output projections) on GPU 0, and write a custom forward pass that moves tensors between devices at block boundaries.&lt;/p&gt;
&lt;h3&gt;The Precision Problem&lt;/h3&gt;
&lt;p&gt;Even with the model split across four cards, nothing worked on the first attempt. Or the fifth. Getting LTX-Video running on Pascal hardware was an iterative process, with Claude Code generating solutions and me testing them against the actual hardware. Each failure revealed another assumption the model made about the GPU it would run on. The feedback loop was brutal: load a 22B model across four GPUs, wait eight minutes for a test generation, get a black frame or a NaN error, diagnose which precision boundary caused it, generate a fix, and try again.&lt;/p&gt;
&lt;p&gt;The first problem was bfloat16. The model weights are stored in bf16 format. Pascal GPUs cannot compute in bf16. PyTorch handles this silently for some operations by promoting to fp32, but other operations fail or produce garbage. The initial approach was the obvious one: monkey-patch &lt;code&gt;torch.bfloat16&lt;/code&gt; to redirect to &lt;code&gt;torch.float16&lt;/code&gt;. This seemed to work at load time. The model loaded, the weights populated, no errors. Then the first forward pass produced NaN everywhere. The monkey-patch had corrupted the safetensors weight loading. The weights loaded as fp16 bit patterns interpreted as bf16 values, which is not the same thing. A bf16 value of 1.0 has a different bit pattern than an fp16 value of 1.0. Reinterpret one as the other and you get a number that's either wildly wrong or NaN.&lt;/p&gt;
&lt;p&gt;The second attempt tried running everything in fp16 natively, converting weights properly during load. This got further: the model produced output that wasn't NaN. But the output was a solid green frame. The intermediate activations in the transformer blocks were overflowing fp16 range. Values above 65,504 become infinity in fp16, and the model's internal representations regularly exceed that during the attention and feedforward passes. The green frame was the model's attempt to decode latents that had been clipped to infinity at some point in the pipeline.&lt;/p&gt;
&lt;p&gt;The working solution was to let the model builder properly convert weights from bf16 to fp16 on load, then run the entire computation pipeline in float32. The weights sit in memory as fp16 (saving space), but every computation promotes to fp32 before executing. This required patching &lt;code&gt;F.linear&lt;/code&gt; to handle mixed dtype inputs:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;_orig_linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_mixed_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_orig_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mixed_linear&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The same pattern extends to every normalization function and every convolution operation. Layer norm, group norm, RMS norm, conv1d through conv_transpose3d: all patched to handle mixed dtypes and accumulate in float32. Without these patches, intermediate values overflow fp16 range (values above 65,504 become infinity) and the output is a black frame.&lt;/p&gt;
&lt;h3&gt;The Gemma Problem&lt;/h3&gt;
&lt;p&gt;The text encoder is Google's Gemma 3, a separate model that converts text prompts into embeddings the video transformer can condition on. Gemma's attention mechanism overflows when run in fp16 on Pascal hardware. The attention scores grow large enough to exceed fp16 range, producing NaN values that propagate through the rest of the pipeline.&lt;/p&gt;
&lt;p&gt;The fix was running the entire Gemma encoder in float32. This uses more memory, but the text encoder only runs once per generation (to encode the prompt), and its weights can be freed from GPU memory before the transformer starts. The sequence is: load Gemma across all four GPUs using &lt;code&gt;accelerate&lt;/code&gt;, encode the prompt in float32, delete the encoder, free the memory, then load the video transformer.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;encode_prompt_float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;
    &lt;span class="n"&gt;te&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_encoder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Dispatch across all 4 GPUs for memory&lt;/span&gt;
    &lt;span class="n"&gt;max_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_balanced_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"22GiB"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
        &lt;span class="n"&gt;no_split_module_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Gemma3DecoderLayer"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;te&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dispatch_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;hidden_states&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attention_mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Free GPU memory before transformer loads&lt;/span&gt;
    &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;te&lt;/span&gt;
    &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This load-encode-delete cycle is ugly but necessary. There isn't enough total memory to hold both Gemma and the video transformer simultaneously, even across four cards. The sequential approach works because each component only needs to exist during its phase of the pipeline.&lt;/p&gt;
&lt;h3&gt;The Pipeline&lt;/h3&gt;
&lt;p&gt;The generation runs in two stages, matching LTX-Video's distilled inference schedule.&lt;/p&gt;
&lt;p&gt;Stage 1 generates a half-resolution latent video (e.g., 256x384) through 8 denoising steps. Each step runs the full 48-block transformer, with data moving across all four GPUs:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;patched_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ltx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transformer_blocks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;dev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block_devices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Every GPU boundary involves a tensor transfer across PCIe 3.0. With 12 blocks per GPU, there are 3 boundary crossings per denoising step (GPU 0 to 1, 1 to 2, 2 to 3), plus a final transfer back to GPU 0. With 8 denoising steps, that's 32 cross-device transfers per stage, each moving both video and audio state tensors. PCIe 3.0 x16 has a theoretical bandwidth of ~16 GB/s. The tensors being transferred are small relative to the bandwidth (attention states and activations, not full weight matrices), so the overhead is manageable. But it adds up.&lt;/p&gt;
&lt;p&gt;Stage 1 takes roughly 4 minutes for 241 frames at 24 fps (a 10-second clip). The spatial upsampler then doubles the resolution. Stage 2 runs 3 more denoising steps at full resolution (512x768), taking roughly 6.5 minutes. The VAE decoder converts latents to pixels and generates the audio track in another 40 seconds.&lt;/p&gt;
&lt;p&gt;Total generation time for a 10-second, 512x768 video with audio: approximately 18.5 minutes. For a 1-second clip (25 frames): about 8 minutes. For a 4-second clip (97 frames): about 10.5 minutes.&lt;/p&gt;
&lt;h3&gt;The Memory Layout&lt;/h3&gt;
&lt;p&gt;During inference, the four GPUs aren't loaded equally. GPU 0 carries extra weight because it hosts all the shared components (patchify projections, normalization layers, output projections) plus its 12 transformer blocks. The actual memory distribution:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;VRAM Used&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;10.8 GB&lt;/td&gt;
&lt;td&gt;Shared components + blocks 0-11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 12-23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 24-35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 36-47&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That's 38.7 GB of the available 96 GB. The remaining 57 GB provides headroom for activations, KV cache growth, and the VAE decoder. There's enough margin that generation never OOMs, even at 241 frames.&lt;/p&gt;
&lt;h3&gt;The API&lt;/h3&gt;
&lt;p&gt;Running inference from the command line is fine for testing, but generating videos for blog content requires something more practical. I wrapped the generation script in a FastAPI server with an async job queue:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# Submit a text-to-video job&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;-X&lt;span class="w"&gt; &lt;/span&gt;POST&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prompt=A cinematic flyover of a Zilog Z80 processor on a PCB"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duration=10"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"seed=42"&lt;/span&gt;

&lt;span class="c1"&gt;# Submit an image-to-video job&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;-X&lt;span class="w"&gt; &lt;/span&gt;POST&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prompt=A fluffy orange cat dancing"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duration=4"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"image=@cat.jpg"&lt;/span&gt;

&lt;span class="c1"&gt;# Check status&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs/07420abb6d82

&lt;span class="c1"&gt;# Download result&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs/07420abb6d82/video&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;output.mp4
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Jobs queue and execute sequentially. The GPU can only handle one generation at a time, and the load-encode-delete cycle for Gemma means there's significant setup overhead per job. The API spawns each job as a subprocess, which gives clean GPU memory cleanup between runs. If a generation crashes (which happened frequently during development), the next job starts fresh.&lt;/p&gt;
&lt;p&gt;The server supports both text-to-video and image-to-video. Image conditioning locks the first frame to a provided image and generates subsequent frames from it, which produces more controllable results for specific visual subjects. In practice, image-to-video is the more useful mode. Text-to-video gives the model complete creative freedom, which means the output is unpredictable. You might ask for a Z80 processor and get something that looks like a generic IC, or something that looks like a Z80, depending on the seed. Image-to-video lets you provide the exact first frame you want and the model animates from there. For blog content where visual accuracy matters, starting from a real photograph or a specific reference image gives consistently better results.&lt;/p&gt;
&lt;h3&gt;What the Output Looks Like&lt;/h3&gt;
&lt;p&gt;The video quality is genuinely good. LTX-Video 2.3 produces coherent motion, reasonable physics, and detailed textures. Here are three examples, generated entirely on the P40 server:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-video: "A cinematic flyover of a Zilog Z80 processor on a printed circuit board" (10 seconds, 18.5 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-z80-flyover.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;&lt;strong&gt;Image-to-video: "A fluffy orange cat with a hat dancing" (4 seconds, 10.5 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-cat-dancing.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;&lt;strong&gt;Text-to-video: "A cat sitting on a windowsill, sunlight streaming in" (1 second, 8 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-cat-windowsill.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;The model understands object permanence, lighting consistency, and basic spatial relationships. The Z80 flyover produces a recognizable IC package with surrounding components, proper lighting, and smooth camera movement.&lt;/p&gt;
&lt;p&gt;The audio is a different story. LTX-Video 2.3 generates an audio track alongside the video, but the results are inconsistent. Prompts describing characters speaking produce odd ambient music instead of voices. Prompts describing environments produce vaguely appropriate soundscapes. The audio pipeline works mechanically (it generates real audio waveforms via a separate VAE decoder and vocoder), but the semantic connection between prompt and audio output is weak. For blog content, I'd likely strip the generated audio and add narration or music separately.&lt;/p&gt;
&lt;p&gt;The 512x768 resolution at 24fps is usable for web content. It's not 4K. It's not going to replace stock footage for production video. But for blog hero images in motion, visual demonstrations, or supplementary content alongside text, it works.&lt;/p&gt;
&lt;h3&gt;What This Cost&lt;/h3&gt;
&lt;p&gt;The hardware cost is zero incremental. The four P40s and the server already existed for &lt;a href="https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html"&gt;LLM inference&lt;/a&gt;. LTX-Video is an additional workload on the same hardware.&lt;/p&gt;
&lt;p&gt;The electricity cost is modest. The server draws roughly 500W under full GPU load. An 18.5-minute generation (10-second video at full resolution) consumes about 0.15 kWh, roughly $0.024 at Minnesota residential rates. You could generate forty 10-second clips for a dollar.&lt;/p&gt;
&lt;p&gt;The real cost was development time. Getting from "model downloaded" to "working generation pipeline" took many iterations across multiple sessions with Claude Code. Each precision-related failure mode (bf16 corruption, fp16 overflow, mixed-dtype kernel errors, NaN propagation through attention) required diagnosis, a hypothesis, a code change, and a test cycle that involved loading a 22B model across four GPUs. The feedback loop was slow. A single test takes 8 to 18 minutes to confirm whether a change worked. Many didn't.&lt;/p&gt;
&lt;h3&gt;The Broader Point&lt;/h3&gt;
&lt;p&gt;A 22 billion parameter video generation model was not designed to run on 2016 hardware. The authors assumed bf16, assumed modern attention kernels, assumed enough memory on one or two cards. None of those assumptions hold on the P40.&lt;/p&gt;
&lt;p&gt;But the model runs anyway, because the underlying math doesn't actually require any of those features. Bfloat16 is a convenience, not a requirement; float32 computes the same function. Flash attention is an optimization, not a necessity; standard attention produces identical results. And 96GB across four cards is 96GB, regardless of whether it's cutting-edge HBM3 or decade-old GDDR5X.&lt;/p&gt;
&lt;p&gt;The generation is slow. Eighteen minutes for ten seconds of video is not competitive with a single A100, which would finish the same job in under two minutes. The float32 computation pipeline roughly doubles the FLOPS required compared to the bf16 path the model was designed for, and the PCIe 3.0 transfers between four separate memory pools add latency that a single modern GPU with unified HBM would never incur. But competitive wasn't the point. The point was that four GPUs I bought on eBay for a thousand dollars total, sitting in a server in a shop building, can run a model that was released this month. The gap between "latest model" and "latest hardware" is not as wide as the spec sheets suggest, as long as you're willing to write the code that bridges it.&lt;/p&gt;
&lt;p&gt;The P40 server was already paying for itself on &lt;a href="https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html"&gt;LLM inference&lt;/a&gt; and &lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;TTS generation&lt;/a&gt;. Video generation is one more workload on a machine that I own, running models that I choose, on a schedule that I control. The 18-minute wait is the price of not asking anyone's permission.&lt;/p&gt;</description><category>ai</category><category>cuda</category><category>gpu</category><category>home lab</category><category>inference</category><category>ltx video</category><category>multi-gpu</category><category>pascal</category><category>pipeline parallelism</category><category>tesla p40</category><category>video generation</category><guid>https://tinycomputers.io/posts/running-ltx-video-on-four-tesla-p40s.html</guid><pubDate>Fri, 20 Mar 2026 13:00:00 GMT</pubDate></item><item><title>The Economics of Owning Your Own Inference</title><link>https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/the-economics-of-owning-your-own-inference_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;21 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;I own $5,500 worth of GPU hardware dedicated to running AI models locally. I also pay for a Claude Max subscription that I use for nearly everything that matters. If that sounds like a contradiction, it is the entire subject of this article.&lt;/p&gt;
&lt;p&gt;The local inference conversation online is dominated by two positions. The first: why pay for API calls when you can run models on your own hardware? The second: local models are worse, so just pay for the good ones. Both are correct. Both are incomplete. The interesting question is where the boundary falls between them, and the answer turns out to depend less on cost-per-token arithmetic than on what kind of work you are doing.&lt;/p&gt;
&lt;h3&gt;The Split&lt;/h3&gt;
&lt;p&gt;I use Claude for research, code review, writing feedback, technical analysis, and anything that used to be a Google search. The frontier models are better at all of these tasks than anything I can run locally. Not marginally better; categorically better. An 8B parameter model running on my hardware is not in the same conversation as Claude Opus or GPT-5.4 for anything requiring reasoning, nuance, or broad knowledge. The subscription cost is fixed regardless of volume, which eliminates per-query friction entirely. For interactive, quality-sensitive work, I pay for the best model available and I do not think about it.&lt;/p&gt;
&lt;p&gt;Local inference handles everything else: the batch jobs, the grunt work, the high-volume tasks where model quality matters less than model availability. The work that would be expensive at cloud API rates not because any single call costs much, but because the calls number in the tens of thousands.&lt;/p&gt;
&lt;p&gt;This is not a temporary arrangement while local models catch up. It is a structural split. Frontier models are getting better. Local models are also getting better. The gap is not closing in the ways that matter for my usage, because the tasks I send to each side are fundamentally different. I do not need my local 8B model to reason better. I need it to process text cheaply and without metering.&lt;/p&gt;
&lt;h3&gt;What the Local Hardware Actually Does&lt;/h3&gt;
&lt;p&gt;Three workloads. All batch. All quality-tolerant.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-speech.&lt;/strong&gt; Every post on this site has an &lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;AI-generated audio narration&lt;/a&gt;. This is the workload that justifies the hardware on its own. Google Cloud Platform has superior TTS voices; Chirp3-HD sounds noticeably more natural than any open-source model I have tested. I ran a novel through it once: 82,000 words, 500,000 characters, $17.25. That is reasonable for a one-off project.&lt;/p&gt;
&lt;p&gt;It is not reasonable for a library of blog posts that I revise and regenerate periodically. At GCP rates ($16 per million characters, more for premium voices), narrating every post on this site would cost $200 to $400, and that bill resets every time I edit an article and regenerate the audio. Open-source TTS (&lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;F5-TTS and Qwen TTS&lt;/a&gt;) mispronounces technical terms. The prosody goes flat on dense jargon. But it is good enough for blog narration. "Good enough" at zero marginal cost beats "excellent" at $4 to $10 per post when you are generating audio daily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Code scanning.&lt;/strong&gt; Running local models over source files for pattern detection, documentation extraction, and automated analysis. These jobs produce high token volume at low quality requirements. An 8B model is adequate. The token count across a full codebase makes API pricing add up in a way that individual queries do not.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Infrastructure work.&lt;/strong&gt; Benchmarking hardware (as in this article), testing prompt structures across quantization levels, evaluating model behavior under different configurations. These queries have no value individually. They are the test drives, not the commute. Paying per-token for test drives is paying per-mile to drive your own car around the block.&lt;/p&gt;
&lt;p&gt;None of these workloads require a frontier model. All of them generate enough volume to make metered pricing uncomfortable. That is the boundary.&lt;/p&gt;
&lt;h3&gt;The Machines&lt;/h3&gt;
&lt;p&gt;Two machines. Both mine. Both running &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;four-GPU Tesla P40 server&lt;/a&gt;: Penguin Computing 2U chassis, Xeon E5-2697A v4, 252GB DDR4 ECC, four Tesla P40s with 24GB GDDR5X each. Ninety-six gigabytes of VRAM. Pascal architecture, 2016 vintage. Built from eBay parts for about $2,500. Lives in an unheated shop building in Minnesota.&lt;/p&gt;
&lt;p&gt;A Bosgame M5 mini desktop: AMD Ryzen AI MAX+ 395, Strix Halo APU with integrated RDNA 3.5 graphics. No discrete GPU. CPU and GPU share 128GB DDR5, roughly 60GB addressable as VRAM through ROCm 7.2. Cost about $3,000. Fits on a desk.&lt;/p&gt;
&lt;h3&gt;What They Cost to Run&lt;/h3&gt;
&lt;p&gt;I logged GPU power draw at 500-millisecond intervals during inference using &lt;code&gt;nvidia-smi&lt;/code&gt; on the P40 server and &lt;code&gt;rocm-smi&lt;/code&gt; on the Strix Halo. Same prompt, same models, same Ollama configuration. All models ran 100% on GPU.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;P40 tok/s&lt;/th&gt;
&lt;th&gt;P40 GPU Power&lt;/th&gt;
&lt;th&gt;Halo tok/s&lt;/th&gt;
&lt;th&gt;Halo GPU Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 3B&lt;/td&gt;
&lt;td&gt;91.2&lt;/td&gt;
&lt;td&gt;170W avg&lt;/td&gt;
&lt;td&gt;78.4&lt;/td&gt;
&lt;td&gt;64W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;47.5&lt;/td&gt;
&lt;td&gt;278W avg&lt;/td&gt;
&lt;td&gt;40.2&lt;/td&gt;
&lt;td&gt;82W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B (4K ctx)&lt;/td&gt;
&lt;td&gt;6.3&lt;/td&gt;
&lt;td&gt;278W avg&lt;/td&gt;
&lt;td&gt;5.6&lt;/td&gt;
&lt;td&gt;81W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 is 15-18% faster in raw throughput. It draws 3-4x the power. The 3B model lives on a single P40; the other three cards idle at ~9W each but still cost electricity. The 8B and 70B models span two GPUs while two idle. You always pay for cards that are not working. The Strix Halo has one GPU. No idle penalty.&lt;/p&gt;
&lt;p&gt;GPU power is not total system power. The P40 server's Xeons, 252GB of RAM, dual PSUs, and fans add roughly 200W to the GPU figures. The Strix Halo's APU and DDR5 add roughly 40-60W. Conservative estimates for total system draw: 500W for the P40 under load, 120W for the Strix Halo.&lt;/p&gt;
&lt;p&gt;At Minnesota residential electricity rates ($0.157/kWh), the cost per million tokens:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;3B&lt;/th&gt;
&lt;th&gt;8B&lt;/th&gt;
&lt;th&gt;70B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P40 Server&lt;/td&gt;
&lt;td&gt;$0.19/M&lt;/td&gt;
&lt;td&gt;$0.46/M&lt;/td&gt;
&lt;td&gt;$3.47/M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strix Halo&lt;/td&gt;
&lt;td&gt;$0.06/M&lt;/td&gt;
&lt;td&gt;$0.13/M&lt;/td&gt;
&lt;td&gt;$0.94/M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Why the Per-Token Number Is Misleading&lt;/h3&gt;
&lt;p&gt;Those numbers look competitive with hosted inference, which runs $0.05 to $0.20 per million tokens for 8B-class models through providers like Together AI or Groq. The Strix Halo at $0.13/M sits squarely in that range. The P40 at $0.46/M does not.&lt;/p&gt;
&lt;p&gt;But per-token cost during active inference is the wrong metric for two reasons.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hardware amortization changes the math.&lt;/strong&gt; The P40 server cost $2,500. The Strix Halo cost $3,000. Amortized over two years, that adds $0.14/hr and $0.11/hr respectively. On the 8B model, the all-in cost per million tokens rises to about $1.28 for the P40 and $0.90 for the Strix Halo. Both are more expensive than every hosted inference API for the same model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Idle power is the dominant cost.&lt;/strong&gt; The P40 server draws roughly 340W at idle: $38.50 per month whether I run a single query or not. The Strix Halo draws roughly 35W at idle: $4.20 per month. Over a year, idle electricity alone costs $462 on the P40 and $50 on the Strix Halo. If you are not using the hardware frequently, idle power overwhelms everything else in the cost model.&lt;/p&gt;
&lt;p&gt;Per-token math at load flatters local inference by ignoring the hours when the hardware is doing nothing. It is like calculating your car's fuel economy only during highway driving and ignoring that it sits in the driveway 22 hours a day with the engine running.&lt;/p&gt;
&lt;h3&gt;Why I Run Both Anyway&lt;/h3&gt;
&lt;p&gt;The per-token economics favor API providers. The per-workload economics favor local hardware for specific tasks. TTS is the starkest example.&lt;/p&gt;
&lt;p&gt;Generating a 20-minute blog narration on the Strix Halo takes about 45 minutes of inference at roughly 85W above idle power. The incremental electricity cost is about $0.02. The same narration through Google Cloud TTS would cost $4 to $10 depending on character count and voice tier.&lt;/p&gt;
&lt;p&gt;That is a 200-to-500x cost difference on the marginal unit. And the marginal unit is what matters, because the question is never "should I generate TTS at all?" It is "should I regenerate the audio for this post I just edited?" or "should I try a different voice on this article?" or "should I narrate this niche post about PCB trace routing that maybe fifty people will listen to?"&lt;/p&gt;
&lt;p&gt;At $4 to $10 per narration, the answer to all of those is "probably not." At $0.02, the answer is "why wouldn't I?" That shift from "probably not" to "why not" is the entire economic argument for owning TTS hardware. It is not about the average cost. It is about the marginal decision.&lt;/p&gt;
&lt;p&gt;Before running local TTS, I narrated posts selectively with Google Cloud's Text-to-Speech. Some were too long or too niche to justify the GCP cost. Now every post gets audio. I regenerate after revisions without thinking about it. I have run the same post through three different TTS models to compare voice quality. I experiment with speaker voices, pacing parameters, and chunk sizes. The total volume of audio I have generated locally exceeds what I would have purchased from Google at any price point. This is &lt;a href="https://tinycomputers.io/posts/jevons-paradox.html"&gt;Jevons Paradox&lt;/a&gt; at the smallest possible scale: make TTS cheap enough and I do not produce the same amount of TTS for less money; I produce vastly more TTS for slightly less money.&lt;/p&gt;
&lt;p&gt;The same logic applies to code scanning. Any individual scan is cheap enough through an API. But the friction of metered pricing discourages the kind of speculative, exploratory analysis that turns up unexpected findings. When the marginal cost is zero, I scan more freely and more often. The value is not in any single scan; it is in the scans I would not have run otherwise.&lt;/p&gt;
&lt;h3&gt;The Strix Halo Problem&lt;/h3&gt;
&lt;p&gt;The most surprising result in the benchmarks is the Strix Halo's efficiency. An integrated APU with no discrete GPU delivers 40.2 tokens per second at 82W of GPU power. The P40 server delivers 47.5 tokens per second at 278W of GPU power. The P40 is 18% faster. The Strix Halo uses 70% less power. In performance per GPU watt, the Strix Halo (0.49 tok/s per watt) is nearly three times more efficient than the P40 (0.17 tok/s per watt).&lt;/p&gt;
&lt;p&gt;This creates a problem for the P40 server's economics. The server's advantage is VRAM: 96GB lets it run 120B MoE models that the Strix Halo cannot fit. For the gpt-oss 120B model, the P40 server is the valid option. But for everything 8B and below, the Strix Halo is cheaper to buy ($2,000 vs. $2,500), cheaper to idle ($4.20/month vs. $38.50/month), cheaper per token ($0.13/M vs. $0.46/M), quieter, smaller, and only 18% slower.&lt;/p&gt;
&lt;p&gt;If I were building a local inference setup today from scratch and my workload was 8B models and TTS, I would buy the Strix Halo and nothing else. The P40 server justifies its existence only for the large models that need its VRAM and the fact that I put it together well before the current RAM price spike.&lt;/p&gt;
&lt;p&gt;This is worth sitting with for a moment, because it inverts the conventional wisdom about inference hardware. The enterprise GPU server that looks impressive on paper (four GPUs, 96GB VRAM, 2U rack mount) loses on total cost of ownership to a $3,000 mini desktop for the workloads that dominate my actual usage. The P40's raw throughput advantage is real but small. Its power cost advantage is negative. The VRAM advantage matters only for models most people do not run.&lt;/p&gt;
&lt;h3&gt;The Maintenance Tax&lt;/h3&gt;
&lt;p&gt;The per-token calculations ignore the cost of keeping these machines running. It is not zero.&lt;/p&gt;
&lt;p&gt;I have had two kernel updates break the NVIDIA DKMS module on the P40 server. The AMD machine requires &lt;a href="https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html"&gt;specific pre-release PyTorch wheels&lt;/a&gt; and environment variable overrides for ROCm to function on gfx1151 hardware. While running the benchmarks for this article, I discovered that Ollama on the Strix Halo had been running entirely on CPU because the systemd service file lacked the &lt;code&gt;HSA_OVERRIDE_GFX_VERSION=11.5.1&lt;/code&gt; variable. Every benchmark I had run on that machine prior to catching this was measuring CPU inference, not GPU inference. The fix took two minutes. Finding it took longer.&lt;/p&gt;
&lt;p&gt;The P40 server's fans run at full speed from October through April because the BMC interprets Minnesota winter temperatures as a hardware malfunction. The noise is audible from the house, 150 feet away.&lt;/p&gt;
&lt;p&gt;None of this is catastrophic. All of it is time. And time spent debugging DKMS modules or adding environment variables to systemd units is time not spent on the work that the hardware is supposed to enable. A Claude Max subscription requires zero maintenance. The local hardware requires ongoing attention. That asymmetry does not show up in per-token cost tables, but it is real.&lt;/p&gt;
&lt;h3&gt;Who This Is For&lt;/h3&gt;
&lt;p&gt;Most people should not build a local inference server. If you use AI for interactive tasks (questions, code, analysis, writing), a frontier model subscription is a better product at a lower total cost than any local setup. The quality gap between a local 8B model and Claude or GPT-5.4 is not closing in the ways that matter for conversational use. Pay for the good models. Use them freely.&lt;/p&gt;
&lt;p&gt;Local inference makes economic sense when you have a specific, high-volume, quality-tolerant workload that you will run often enough to justify hardware sitting on 24/7. TTS is the clearest case. Batch code analysis is another. If you cannot name the workload, you do not have one, and the hardware will cost you $40 to $50 per month in idle electricity to find out.&lt;/p&gt;
&lt;p&gt;The split between frontier subscriptions and local batch processing is not a compromise. It is, for my usage, the correct architecture. The frontier model handles the work where quality determines value. The local hardware handles the work where volume determines cost. Neither replaces the other. The mistake is thinking they compete.&lt;/p&gt;</description><category>ai</category><category>amd</category><category>benchmarks</category><category>claude</category><category>economics</category><category>gpu</category><category>home lab</category><category>inference</category><category>jevons paradox</category><category>local inference</category><category>power consumption</category><category>strix halo</category><category>tesla p40</category><category>tts</category><guid>https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html</guid><pubDate>Tue, 17 Mar 2026 13:00:00 GMT</pubDate></item><item><title>The Real Cost of Running Qwen TTS Locally: Three Machines Compared</title><link>https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/the-real-cost-of-running-qwen-tts-locally-three-machines-compared_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;17 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;&lt;img src="https://tinycomputers.io/images/qwen-tts-benchmark/p40-server-shop.jpg" alt="The Tesla P40 server standing on its side in an unheated Minnesota shop building, one of three machines benchmarked for local TTS generation" style="float: right; max-width: 40%; margin: 0 0 1em 1.5em; border-radius: 4px; box-shadow: 0 30px 40px rgba(0,0,0,.1);"&gt;&lt;/p&gt;
&lt;p&gt;Every post on this site has an audio version. A small player at the top, a few minutes of narration, generated entirely on local hardware. No cloud API, no per-character fees, no data leaving the network. I wrote about &lt;a href="https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html"&gt;setting up the pipeline on AMD Strix Halo&lt;/a&gt; earlier this year, and the system has been running in production since, generating narrations for new posts, regenerating old ones when I revise them, and occasionally processing long-form content that would cost real money through Google Cloud TTS or ElevenLabs.&lt;/p&gt;
&lt;p&gt;But I now have three machines capable of running Qwen3-TTS, and they could not be more different from each other. An Apple M3 Max laptop. An AMD Ryzen AI MAX+ 395 mini desktop with integrated Radeon graphics. And a &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;four-GPU Tesla P40 server&lt;/a&gt; built from decade-old enterprise hardware bought on eBay. Three different silicon vendors, three different compute backends (MPS, ROCm, and CUDA) running the same model on the same text.&lt;/p&gt;
&lt;p&gt;The question I wanted to answer is simple: how do they actually compare? Not on paper. Not in theoretical FLOPS. In wall-clock time, generating real audio from a real blog post.&lt;/p&gt;
&lt;p&gt;The answer turned out to be more interesting than I expected, because the numbers tell a story about hardware architecture that raw specifications completely miss.&lt;/p&gt;
&lt;h3&gt;The Setup&lt;/h3&gt;
&lt;p&gt;The model is &lt;a href="https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice"&gt;Qwen3-TTS-12Hz-1.7B-CustomVoice&lt;/a&gt;, a 1.7 billion parameter autoregressive text-to-speech model from Alibaba's Qwen team. It generates natural-sounding speech with multiple speaker voices. I use the Eric voice for all blog narrations: clear, professional, well-paced for technical content.&lt;/p&gt;
&lt;p&gt;The three machines:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Apple M3 Max&lt;/strong&gt;, a &lt;a href="https://amzn.to/4rwlTa6"&gt;MacBook Pro&lt;/a&gt; with Apple's M3 Max chip. 14 CPU cores, 30 GPU cores, 64GB unified memory. The GPU runs through PyTorch's MPS (Metal Performance Shaders) backend. This is my daily driver laptop, and it generates TTS when I am writing and editing posts.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;AMD Radeon 8060S&lt;/strong&gt;, a Bosgame M5 mini desktop running &lt;a href="https://amzn.to/4bv5CMG"&gt;AMD's Ryzen AI MAX+ 395&lt;/a&gt;. This is a Strix Halo APU with integrated RDNA 3.5 graphics, not a discrete GPU. It shares 128GB of DDR5 system memory with the CPU, with roughly 96GB addressable as VRAM. The GPU runs through ROCm 7.2 with PyTorch 2.9.1. The gfx1151 architecture requires specific PyTorch wheels from AMD's pre-release index and several environment variable overrides to function. I wrote a &lt;a href="https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html"&gt;full setup guide&lt;/a&gt; for this machine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;NVIDIA Tesla P40&lt;/strong&gt;, a 2U rack-mount server with four &lt;a href="https://www.ebay.com/itm/306087510352?_skw=nvidia+tesla+p40+24gb+gpu&amp;amp;epid=27032254618&amp;amp;itmmeta=01KKJEGQKSK110HNM6214EB0TT&amp;amp;hash=item47443cc150:g:qAwAAOSwy0toUHXh&amp;amp;itmprp=enc%3AAQALAAABAGfYFPkwiKCW4ZNSs2u11xAq6UjArKrgnuEyMVTZhAZhOSUGYags6TsDJvvCEOa51UH2r%2BRe%2F182ah6rgiTIAIRULQNEL9rbiinCXMor%2FBNNZk0GaNKqTWkq9pLWGoRBM8NL%2BjC1aSA63XPe4YsFHjQkb%2Fmup21S3UM7oqwBrW%2BHep1E07lnrt2vzkljSA4xg7SnrA%2BFDtOdqvDwO4tpgB0t%2BtCv9%2BlXoh%2BeoEgpJqXgaaM0ad48OfmgKB13PF9RIPXLNI6z4SjV2O%2FXOk6nYPyD9Eg5wbzdmsXfNRhwitz7HEZ1bTRUnRmvKzQrw4B3r3LAag5f8%2B8CcCWfCRAkkG8%3D%7Ctkp%3ABk9SR4j6ws6cZw&amp;amp;mkcid=1&amp;amp;mkrid=711-53200-19255-0&amp;amp;siteid=0&amp;amp;campid=5338960379&amp;amp;customid=&amp;amp;toolid=10001&amp;amp;mkevt=1"&gt;Tesla P40 GPUs&lt;/a&gt;, each with 24GB of GDDR5X. Pascal architecture from 2016. Compute capability 6.1. No Tensor Cores, no native bfloat16 support. The benchmark uses a single P40, since Qwen TTS runs on one GPU. This machine lives in an unheated shop building in Minnesota and &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;screams through the winter&lt;/a&gt; when the BMC misinterprets sub-zero ambient temperatures as a hardware malfunction.&lt;/p&gt;
&lt;p&gt;All three machines run the same model checkpoint, the same text input, and the same speaker voice. The only differences are the silicon and the compute backend.&lt;/p&gt;
&lt;h3&gt;The Benchmark&lt;/h3&gt;
&lt;p&gt;I used a standardized 2,411-character passage, five paragraphs on the Jevons Paradox, dense enough to exercise the model's prosody and pacing on real written content. Each machine ran three consecutive generations from the same loaded model, producing roughly three minutes of audio per run. The first run includes kernel compilation and cache warmup; subsequent runs reflect steady-state performance.&lt;/p&gt;
&lt;p&gt;The metric that matters is Real-Time Factor (RTF): how many seconds of wall-clock time it takes to generate one second of audio. An RTF of 1.0 means the model generates audio at exactly real-time speed. Below 1.0 is faster than real-time. Above 1.0 means you are waiting.&lt;/p&gt;
&lt;h4&gt;Individual Runs&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Apple M3 Max (MPS)&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;698.5s&lt;/td&gt;
&lt;td&gt;197.7s&lt;/td&gt;
&lt;td&gt;3.53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;533.1s&lt;/td&gt;
&lt;td&gt;184.2s&lt;/td&gt;
&lt;td&gt;2.89&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;447.8s&lt;/td&gt;
&lt;td&gt;179.2s&lt;/td&gt;
&lt;td&gt;2.50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;559.8s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;187.0s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.97&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;AMD Radeon 8060S (ROCm)&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;729.2s&lt;/td&gt;
&lt;td&gt;173.6s&lt;/td&gt;
&lt;td&gt;4.20&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;460.0s&lt;/td&gt;
&lt;td&gt;204.8s&lt;/td&gt;
&lt;td&gt;2.25&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;548.2s&lt;/td&gt;
&lt;td&gt;214.2s&lt;/td&gt;
&lt;td&gt;2.56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;579.1s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;197.5s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.00&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;&lt;strong&gt;NVIDIA Tesla P40 (CUDA)&lt;/strong&gt;&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1511.4s&lt;/td&gt;
&lt;td&gt;204.1s&lt;/td&gt;
&lt;td&gt;7.41&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;1225.7s&lt;/td&gt;
&lt;td&gt;171.6s&lt;/td&gt;
&lt;td&gt;7.14&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;1537.2s&lt;/td&gt;
&lt;td&gt;206.7s&lt;/td&gt;
&lt;td&gt;7.44&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1424.8s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;194.1s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.33&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h4&gt;Summary&lt;/h4&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Avg RTF&lt;/th&gt;
&lt;th&gt;Best RTF&lt;/th&gt;
&lt;th&gt;Avg Gen Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MacBook Pro&lt;/td&gt;
&lt;td&gt;M3 Max (MPS)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.97&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.50&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;559.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bosgame M5&lt;/td&gt;
&lt;td&gt;Radeon 8060S (ROCm)&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;2.25&lt;/td&gt;
&lt;td&gt;579.1s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Penguin 2U&lt;/td&gt;
&lt;td&gt;Tesla P40 (CUDA)&lt;/td&gt;
&lt;td&gt;7.33&lt;/td&gt;
&lt;td&gt;7.14&lt;/td&gt;
&lt;td&gt;1424.8s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;What the Numbers Mean&lt;/h3&gt;
&lt;p&gt;The headline result is that the M3 Max and Radeon 8060S are essentially tied, and the Tesla P40 is roughly 2.4 times slower than both. But that summary hides the interesting details.&lt;/p&gt;
&lt;h4&gt;The Warmup Effect Is Massive&lt;/h4&gt;
&lt;p&gt;On both the M3 Max and the Radeon 8060S, the first run is dramatically slower than subsequent runs. The M3 Max goes from RTF 3.53 on run 1 to RTF 2.50 on run 3, a 29% improvement. The AMD shows an even larger swing: RTF 4.20 on run 1 dropping to RTF 2.25 on run 2, a 46% improvement.&lt;/p&gt;
&lt;p&gt;This is kernel compilation. Both MPS and ROCm compile GPU kernels on first use and cache them for subsequent calls. The Qwen TTS model hits a wide variety of kernel shapes during autoregressive generation (different sequence lengths, different attention patterns) and each new shape triggers a compilation on the first encounter. By run 2, most of the common shapes are cached, and performance stabilizes.&lt;/p&gt;
&lt;p&gt;The P40 shows almost no warmup effect. RTF 7.41 on run 1, 7.14 on run 2, 7.44 on run 3. CUDA's kernel compilation is faster and more mature, so the overhead is absorbed within the first few seconds rather than spread across the entire run. But this maturity does not translate into faster inference; CUDA compiles faster, but the P40's hardware is fundamentally slower at the operations this model requires.&lt;/p&gt;
&lt;p&gt;This has a practical implication that matters: &lt;strong&gt;short benchmarks on MPS and ROCm are misleading.&lt;/strong&gt; I initially ran a quick 276-character test on all three machines before doing the full benchmark. The short test showed the AMD at RTF 9.20, almost identical to the P40's RTF 10.01, and far behind the M3 Max's RTF 2.84. That result nearly led me to conclude the AMD was performing as poorly as decade-old hardware. The longer benchmark, with its warmup effect amortized across more generation, revealed the truth: the AMD is just as fast as the M3 Max once the kernels are cached. If I had stopped at the short test, I would have drawn exactly the wrong conclusion.&lt;/p&gt;
&lt;h4&gt;Why the P40 Is So Slow&lt;/h4&gt;
&lt;p&gt;The Tesla P40 is a Pascal-generation GPU from 2016. It has 3,840 CUDA cores and 24GB of GDDR5X memory. On paper, it should be competitive; 12 TFLOPS of FP32 compute is not trivial. And for LLM inference through Ollama, the P40 &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;performs remarkably well&lt;/a&gt;, outperforming quad T4 instances on models up to 8B parameters.&lt;/p&gt;
&lt;p&gt;TTS is a different workload. Qwen3-TTS is an autoregressive transformer that generates audio tokens one at a time, each conditioned on all previous tokens. This means the inference is heavily memory-bandwidth bound during the decoding phase, and compute-bound during the attention and feedforward passes. The model is distributed in bfloat16 precision, which the P40 cannot compute natively; Pascal predates bfloat16 support entirely. PyTorch silently promotes bf16 operations to fp32 on the P40, roughly doubling the computation per operation and halving the effective throughput.&lt;/p&gt;
&lt;p&gt;The P40 also lacks the SDPA (Scaled Dot-Product Attention) hardware acceleration that newer architectures provide. On the M3 Max, MPS routes attention through Metal's optimized primitives. On the AMD, ROCm's AOTriton provides experimental flash attention support. On the P40, attention runs through standard CUDA kernels without any of these accelerations. For a model that generates thousands of autoregressive steps per audio clip, each involving a full attention pass over the growing sequence, this compounds dramatically.&lt;/p&gt;
&lt;p&gt;The P40 is not bad hardware. It is excellent hardware for the workloads it was designed for: batch inference on quantized LLMs where its 24GB of VRAM per card creates a memory advantage. But autoregressive TTS in bfloat16 hits every one of its architectural weaknesses simultaneously.&lt;/p&gt;
&lt;h4&gt;Unified Memory Wins This Workload&lt;/h4&gt;
&lt;p&gt;Both the M3 Max and the Radeon 8060S use unified memory architectures, where the CPU and GPU share the same physical memory pool. The M3 Max has 64GB of unified LPDDR5. The Radeon 8060S shares 128GB of DDR5 with the CPU, with roughly 96GB addressable as VRAM.&lt;/p&gt;
&lt;p&gt;For a 1.7B parameter model in bf16, the weights occupy roughly 3.4GB. The model fits comfortably on all three machines. But the autoregressive generation pattern creates a stream of intermediate activations (KV cache entries, attention scores, feedforward intermediates) that grow with the sequence length. On a unified memory architecture, these intermediates exist in the same memory space as the model weights, avoiding any PCIe transfer overhead. On the P40, every interaction between CPU and GPU crosses a PCIe 3.0 bus.&lt;/p&gt;
&lt;p&gt;For LLM inference, where the bottleneck is token generation throughput and the KV cache fits in VRAM, the P40's discrete memory is fine. For TTS, where the model generates hundreds of audio tokens per second of speech and the attention window grows continuously, the memory access pattern favors unified architectures.&lt;/p&gt;
&lt;p&gt;This is not a universal statement about unified versus discrete memory. A modern discrete GPU with HBM2e or GDDR6X and PCIe 4.0 or 5.0 would likely outperform both the M3 Max and the Radeon 8060S on this workload. The P40's problem is not that its memory is discrete; it is that its memory is slow and its bus is narrow by 2026 standards.&lt;/p&gt;
&lt;h3&gt;The Model Architecture Question&lt;/h3&gt;
&lt;p&gt;While benchmarking Qwen TTS, I also ran a quick comparison with &lt;a href="https://huggingface.co/SWivid/F5-TTS"&gt;F5-TTS&lt;/a&gt; on the AMD machine to sanity-check the results. F5-TTS is a flow-matching model, fundamentally different from Qwen's autoregressive approach. Where Qwen generates audio tokens sequentially, each conditioned on all previous tokens, F5 generates audio in parallel through an iterative refinement process.&lt;/p&gt;
&lt;p&gt;The difference is stark. On the same Radeon 8060S, the same text, the same hardware:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;th&gt;Audio Length&lt;/th&gt;
&lt;th&gt;RTF&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-TTS&lt;/td&gt;
&lt;td&gt;579.1s (avg)&lt;/td&gt;
&lt;td&gt;197.5s&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F5-TTS&lt;/td&gt;
&lt;td&gt;17.4s&lt;/td&gt;
&lt;td&gt;27.2s&lt;/td&gt;
&lt;td&gt;0.64&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;F5-TTS is faster than real-time. Qwen3-TTS takes three times longer than the audio it produces. On normalized terms, F5 is roughly five times faster than Qwen at steady-state, and the gap widens on shorter content where Qwen's warmup overhead is proportionally larger.&lt;/p&gt;
&lt;p&gt;This is not an apples-to-apples quality comparison. Qwen3-TTS generally produces more natural prosody, better handling of complex sentence structures, and more consistent speaker identity across long passages. F5-TTS is excellent but can occasionally drift in voice character or pacing on very long content. For blog narration, both are well above the threshold of "good enough," and the quality difference is smaller than you might expect given the architectural gap.&lt;/p&gt;
&lt;p&gt;The point is that hardware is only half the story. The choice of model architecture can matter more than the choice of GPU. A flow-matching model on integrated AMD graphics outperforms an autoregressive model on Apple's best laptop silicon by a wide margin. If generation speed is the constraint, switching models gains more than switching hardware.&lt;/p&gt;
&lt;h3&gt;What This Costs in Practice&lt;/h3&gt;
&lt;p&gt;The abstract benchmark numbers translate into concrete time and electricity costs when you are generating audio for a library of blog posts.&lt;/p&gt;
&lt;p&gt;A typical TinyComputers post runs 3,000 to 5,000 words, producing 15 to 25 minutes of narrated audio. At steady-state RTF:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;15 min audio&lt;/th&gt;
&lt;th&gt;25 min audio&lt;/th&gt;
&lt;th&gt;System Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;M3 Max&lt;/td&gt;
&lt;td&gt;~38 min&lt;/td&gt;
&lt;td&gt;~63 min&lt;/td&gt;
&lt;td&gt;~50W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Radeon 8060S&lt;/td&gt;
&lt;td&gt;~38 min&lt;/td&gt;
&lt;td&gt;~63 min&lt;/td&gt;
&lt;td&gt;~100W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tesla P40&lt;/td&gt;
&lt;td&gt;~110 min&lt;/td&gt;
&lt;td&gt;~183 min&lt;/td&gt;
&lt;td&gt;~400W&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The M3 Max and Radeon 8060S are tied on generation time, but the M3 Max draws roughly half the system power. For a single post, the electricity cost difference is negligible, a fraction of a cent. For batch processing a backlog of thirty posts, the M3 Max costs about \$0.18 in electricity versus \$0.36 for the AMD and \$3.50 for the P40.&lt;/p&gt;
&lt;p&gt;None of these numbers are alarming. Even the P40, at nearly two and a half hours per post and 400 watts from the wall, costs under fifteen cents in electricity per narration at Minnesota residential rates. The equivalent Google Cloud TTS job would cost \$4 to \$16 per post depending on the voice quality tier.&lt;/p&gt;
&lt;p&gt;To put cloud costs in perspective: I recently ran a fiction novel through Google's Chirp3-HD voice: 82,000 words, roughly 500,000 characters of text plus SSML markup. The bill came to \$17.25 at Google's rate of \$30 per million characters. That is not unreasonable for a one-off project, but it adds up quickly if you are generating audio regularly. The entire library of TinyComputers narrations (dozens of posts, hours of audio) has cost me nothing beyond the electricity to run the machines I already own. The economics of local TTS are favorable on every machine in the comparison.&lt;/p&gt;
&lt;p&gt;The real cost is time. If I am generating audio for a single new post, I start it on whichever machine is idle and check back in an hour. If I am regenerating audio for twenty posts after changing the speaker voice or updating the pipeline, the M3 Max or AMD will finish overnight. The P40 would take most of a weekend.&lt;/p&gt;
&lt;h3&gt;The Right Machine for the Job&lt;/h3&gt;
&lt;p&gt;After running these benchmarks, my workflow has shifted. The M3 Max is the default for new post narration; it is fast, quiet, and I am usually sitting in front of it when I finish writing. The AMD handles batch jobs and overnight processing, where its slightly higher power draw does not matter and its equivalent speed makes it interchangeable with the Mac. The P40 server is reserved for what it does best: &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;running large language models&lt;/a&gt; through Ollama, where its 96GB of aggregate VRAM gives it an advantage that neither the Mac nor the AMD can match.&lt;/p&gt;
&lt;p&gt;The P40 can still generate TTS in a pinch, and it does; when both other machines are occupied, I will queue a job on the P40 and accept the longer wait. But for a workload that is inherently autoregressive, memory-bandwidth sensitive, and dependent on bf16 precision, a ten-year-old Pascal GPU is the wrong tool.&lt;/p&gt;
&lt;p&gt;What surprised me most is how well the AMD performs. The Radeon 8060S is an integrated GPU sharing system memory with the CPU. It has no HBM, no dedicated VRAM, no NVLink. Its ROCm software stack requires environment variable hacks, pre-release PyTorch wheels, and a GFX version override to function at all. And yet, once the kernels warm up, it matches Apple's best laptop silicon stride for stride. The raw hardware is there: 40 RDNA 3.5 compute units with access to a deep pool of DDR5 memory. The software just needs to get out of the way, and on run 2 and beyond, it does.&lt;/p&gt;
&lt;h3&gt;Lessons&lt;/h3&gt;
&lt;p&gt;Three takeaways from this exercise that generalize beyond TTS:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Short benchmarks lie.&lt;/strong&gt; Kernel compilation overhead on MPS and ROCm is large enough to dominate a short test. If you are evaluating a new model on non-CUDA hardware, run it at least twice before drawing conclusions. The first run is measuring the software stack, not the hardware.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Architecture matters more than clock speed.&lt;/strong&gt; The P40 has more raw FLOPS than the Radeon 8060S. It does not matter. The P40 lacks native bf16, lacks efficient attention primitives, and sits behind a PCIe 3.0 bus. The Radeon has all three, and ties a chip designed by Apple's custom silicon team. For autoregressive models, the architectural fit between model and hardware dominates everything else.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Model choice can outweigh hardware choice.&lt;/strong&gt; F5-TTS running on the weakest GPU in this comparison is five times faster than Qwen3-TTS running on the strongest. If your constraint is generation speed and you can accept a modest quality trade-off, switching to a flow-matching architecture gains more than any hardware upgrade short of a data center GPU.&lt;/p&gt;
&lt;p&gt;The audio player at the top of each post on this site represents a few minutes of machine time on one of these three machines. Which machine generated it depends on the day, the workload, and what else is running. The listener cannot tell the difference. The audio sounds the same regardless of whether it was generated on a laptop, a mini desktop, or a rack-mount server in a cold Minnesota shop. That is the real benchmark: not which machine is fastest, but that all three are fast enough.&lt;/p&gt;</description><category>amd</category><category>apple silicon</category><category>audio</category><category>benchmarks</category><category>cuda</category><category>gpu</category><category>inference</category><category>m3 max</category><category>machine learning</category><category>mps</category><category>nvidia</category><category>qwen</category><category>rocm</category><category>strix halo</category><category>tesla p40</category><category>text-to-speech</category><category>tts</category><guid>https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html</guid><pubDate>Thu, 12 Mar 2026 14:00:00 GMT</pubDate></item><item><title>Repurposing Enterprise GPUs: The Tesla P40 Home Lab Story</title><link>https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;17 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;There is a window, maybe eighteen months wide, where enterprise hardware hits a pricing sweet spot. The first-generation buyers (the hyperscalers, the research labs, the Fortune 500 AI teams) have moved on to the next generation. The second-hand market floods. Prices crater. And if you know what you're looking for, you can build something genuinely capable for less than a month of cloud compute.&lt;/p&gt;
&lt;p&gt;I built a four-GPU inference server for about twenty-five hundred dollars. This is the story of how, why, and whether you should do the same.&lt;/p&gt;
&lt;h3&gt;The Buy&lt;/h3&gt;
&lt;p&gt;The acquisition strategy is straightforward: eBay, patience, and knowing what to look for.&lt;/p&gt;
&lt;p&gt;Tesla P40s started appearing in volume on the secondary market around 2023, when cloud providers and enterprise data centers began cycling them out in favor of A100s and H100s. A card that sold for over five thousand dollars new was suddenly available for three hundred, then two hundred and fifty, then, if you watched listings carefully and were willing to buy from decommissioned lot sellers, sometimes less. I picked up four cards over the course of about two months, averaging two hundred and fifty dollars each.&lt;/p&gt;
&lt;p&gt;The chassis was a Penguin Computing 2U rack-mount server, also from eBay. These show up when government labs and research institutions liquidate equipment. The Penguin Computing systems are well-built, with proper server-grade construction, redundant power supplies, and engineered airflow. Mine takes the Xeon E5-2697A v4 and two were purchased from eBay: eighteen Broadwell cores, more than enough CPU to keep four GPUs fed. The chassis cost around six hundred dollars.&lt;/p&gt;
&lt;p&gt;Memory was the lucky purchase. I bought 252GB of DDR4 ECC RAM before the memory price spike that hit in late 2024 when every company on Earth decided they needed AI infrastructure simultaneously. What I paid around two hundred and fifty dollars for would cost significantly more today. Total build: roughly twenty-five hundred dollars.&lt;/p&gt;
&lt;h3&gt;The Hardware&lt;/h3&gt;
&lt;p&gt;The Tesla P40 is a 2016-era data center GPU. NVIDIA designed it for the Pascal generation, targeting inference workloads in enterprise environments. The specifications, for something you can buy on eBay for two hundred and fifty dollars, are remarkable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;24GB GDDR5X&lt;/strong&gt; per card, more memory than an RTX 4090&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;3,840 CUDA cores&lt;/strong&gt;, Pascal architecture, compute capability 6.1&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;12 TFLOPS FP32&lt;/strong&gt;, respectable even by 2026 standards for inference&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;250W TDP&lt;/strong&gt;: this is a data center card and it draws power like one&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Multiply by four and you get 96GB of VRAM for a thousand dollars. That is an extraordinary amount of GPU memory for the price. For context, a single NVIDIA A100 80GB still sells for north of five thousand dollars on the secondary market. Four P40s give you more total VRAM for a fraction of the cost.&lt;/p&gt;
&lt;h3&gt;What You Give Up&lt;/h3&gt;
&lt;p&gt;There is no free lunch in computing, and the P40 makes you pay for its low price in specific, sometimes painful ways.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No Tensor Cores.&lt;/strong&gt; The P40 predates NVIDIA's Tensor Core architecture, which arrived with Volta in 2017. Tensor Cores accelerate matrix multiplication (the fundamental operation in neural network inference) by factors of 4x to 16x depending on precision. The P40 does everything with its CUDA cores, the old-fashioned way. This matters less than you might think for inference at moderate batch sizes, but it means you will never match the throughput of a V100 or newer card, clock for clock.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No native BF16 or FP16.&lt;/strong&gt; This is the real gotcha. BF16 (bfloat16) has become the default precision for large language models. It is what most model weights are distributed in. The P40 cannot compute in BF16 natively; it emulates it through FP32 operations, which is roughly 21% slower than native support. In practice, this means you are running quantized models (Q4, Q5, Q8) through llama.cpp or similar frameworks, which handle the precision conversion for you. It works. It is not optimal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Passive cooling designed for server airflow.&lt;/strong&gt; The P40 is a blower-style card designed for 1U and 2U server chassis with front-to-back forced airflow. In a proper server, this is fine. In anything else, you need to solve cooling yourself. I put mine in a Penguin Computing 2U rack-mount chassis, which has the right airflow characteristics, but this is not a card you drop into a desktop tower.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PCIe 3.0 x16.&lt;/strong&gt; The P40 connects via PCIe 3.0, which provides about 16 GB/s of bandwidth per direction. When you are running a model that spans four GPUs, the inter-GPU communication goes over PCIe, not NVLink. This creates a bottleneck for models that require heavy cross-GPU communication. For inference, where the communication pattern is more predictable than training, this is manageable. For training, it would be a serious constraint.&lt;/p&gt;
&lt;h3&gt;The Minnesota Problem&lt;/h3&gt;
&lt;p&gt;My server lives in an unheated shop building in northern Minnesota. This has created an issue that no hardware review will prepare you for.&lt;/p&gt;
&lt;p&gt;When ambient temperatures drop below freezing (which, in Minnesota, means roughly October through April) the onboard temperature sensors report values that the baseboard management controller interprets as a malfunction. The BMC's response is to spin every fan to maximum RPM as a protective measure.&lt;/p&gt;
&lt;p&gt;The result is a machine that, on quiet winter nights, is audible from the house. The house is a hundred and fifty feet away.&lt;/p&gt;
&lt;p&gt;I have not solved this problem. I have learned to live with it. You can override BMC fan curves on some platforms, but the Penguin Computing firmware is locked down in ways that make this nontrivial, and frankly, a server that runs its fans at full speed because it thinks it is dying is doing exactly what it should be doing. The firmware's assumptions are just wrong for the environment.&lt;/p&gt;
&lt;p&gt;The server runs 24/7 regardless of the season, and the cold air actually keeps the GPUs well within thermal limits. The irony is that the machine has never been cooler or louder than when it is twenty below zero outside. If you are considering a similar setup in a garage, basement, or outbuilding, factor in noise. A 2U server with four 250W GPUs is not quiet under any circumstances, and server-grade fans at full RPM are genuinely loud.&lt;/p&gt;
&lt;h3&gt;Setting Up the Software Stack&lt;/h3&gt;
&lt;p&gt;The driver situation for the P40 in 2026 is straightforward, though it was not always. NVIDIA's &lt;code&gt;nvidia-driver-570-server&lt;/code&gt; package works cleanly on Ubuntu, and the DKMS module rebuilds automatically on kernel updates, most of the time. I have had exactly two occasions where a kernel update broke the NVIDIA module and required manual intervention. This is fewer than I expected.&lt;/p&gt;
&lt;p&gt;For inference, I run &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt;, which wraps llama.cpp and provides a simple API for model management and inference. Ollama handles multi-GPU sharding automatically: when you load a model, it distributes layers across GPUs based on available memory and model size. A 65GB model like gpt-oss:120b fits across three of the four P40s, leaving one free. Smaller models may only need one or two cards. The allocation is generally sensible, though you have less control over placement than you would with raw llama.cpp.&lt;/p&gt;
&lt;p&gt;The alternative stack (vLLM, TGI, or raw llama.cpp) offers more control over GPU assignment but requires more configuration. With llama.cpp directly, you can pin specific GPU layers to specific devices, which lets you optimize for the P40's memory topology. vLLM provides better batching and continuous batching for serving multiple concurrent requests. For a home lab where the primary use case is running various models for experimentation and development rather than serving production traffic, Ollama's simplicity wins.&lt;/p&gt;
&lt;p&gt;One thing worth noting: the P40 is well-supported by the GGUF ecosystem that llama.cpp (and therefore Ollama) uses. GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0) run without issues on Pascal hardware. The quantization handles the BF16 problem for you: model weights are stored in 4-bit or 8-bit integer formats and dequantized to FP32 at runtime, which the P40 handles natively. You are not fighting the hardware; you are working with it.&lt;/p&gt;
&lt;h3&gt;The Benchmarks&lt;/h3&gt;
&lt;p&gt;Theory is cheap. Benchmarks are what matter. I ran the same inference workload across three configurations: my four P40 home lab, a single AWS Tesla T4 instance, and a quad T4 instance on AWS. The T4 is the closest cloud comparison; it is the workhorse inference GPU in AWS's fleet, one generation newer than the P40 (Turing architecture, 2018), with 16GB of GDDR6 and actual Tensor Cores.&lt;/p&gt;
&lt;p&gt;All benchmarks used Ollama with the same prompt, measuring tokens per second during the evaluation phase (excluding model load time).&lt;/p&gt;
&lt;h4&gt;Dense Models&lt;/h4&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;4x P40 (Home Lab)&lt;/th&gt;
&lt;th&gt;1x T4 (AWS \$0.53/hr)&lt;/th&gt;
&lt;th&gt;4x T4 (AWS \$3.91/hr)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2&lt;/td&gt;
&lt;td&gt;3B&lt;/td&gt;
&lt;td&gt;94.3 tok/s&lt;/td&gt;
&lt;td&gt;81.5 tok/s&lt;/td&gt;
&lt;td&gt;101.5 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;52.7 tok/s&lt;/td&gt;
&lt;td&gt;36.9 tok/s&lt;/td&gt;
&lt;td&gt;40.3 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;47.8 tok/s&lt;/td&gt;
&lt;td&gt;35.7 tok/s&lt;/td&gt;
&lt;td&gt;29.2 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 wins on the 7B and 8B models by substantial margins, 31% and 64% respectively over the quad T4 configuration. The only model where the T4 edges ahead is the 3B, which is small enough to fit entirely on a single GPU. Here, the T4's higher clock speeds and faster GDDR6 memory give it an advantage because there is no multi-GPU overhead to penalize it.&lt;/p&gt;
&lt;p&gt;The 8B result is particularly interesting. The quad T4 actually performs &lt;em&gt;worse&lt;/em&gt; than a single T4 on this model (29.2 vs 35.7 tok/s). Ollama shards the model across all four GPUs even though it fits on one, and the PCIe communication overhead between four T4s costs more than it gains. The P40, with its larger 24GB per-card memory, likely fits more of the model per GPU, reducing cross-GPU transfers.&lt;/p&gt;
&lt;h4&gt;The MoE Advantage&lt;/h4&gt;
&lt;p&gt;The most compelling benchmark comes from OpenAI's gpt-oss, a 120-billion parameter mixture-of-experts model with only 5.1 billion active parameters per token. The MoE architecture means the model's total weight is large (it needs the memory), but the computation per token is modest (only a fraction of the parameters fire for any given input).&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;4x P40&lt;/th&gt;
&lt;th&gt;4x T4 (AWS \$3.91/hr)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-oss&lt;/td&gt;
&lt;td&gt;120B MoE (5.1B active)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.1 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20.6 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 runs OpenAI's 120B model at 28.1 tokens per second, 36% faster than the cloud instance, and fast enough for comfortable interactive use. This is a state-of-the-art model running on decade-old GPUs at a speed that would have been impressive on much newer hardware a year ago.&lt;/p&gt;
&lt;p&gt;The reason is memory. The gpt-oss model uses MXFP4 quantization on its MoE weights, bringing the total model size to about 65GB. Four P40s offer 96GB of VRAM, enough to hold the entire model in GPU memory. Four T4s offer only 64GB, which means some of the model likely spills to system RAM, adding latency on every token.&lt;/p&gt;
&lt;p&gt;This is the P40's superpower: 24GB per card was overkill in 2016, and it is exactly right in 2026. Models have grown to fill the memory, and the P40 has more of it per dollar than almost anything else on the market.&lt;/p&gt;
&lt;h4&gt;Where It Falls Apart&lt;/h4&gt;
&lt;p&gt;Dense 70B models are a different story. Llama 3.1 70B at Q4_0 quantization (39GB) fits across 96GB of P40 VRAM, but the inference speed is essentially unusable: 0.033 tokens per second. One token every thirty seconds. Answering "What is 2+2?" took six and a half minutes. The combination of no Tensor Cores, PCIe 3.0 interconnect, and the sheer volume of cross-GPU data transfers for a dense 70B model pushes the per-token latency beyond any practical threshold.&lt;/p&gt;
&lt;p&gt;The quad T4 on AWS managed 2.0 tokens per second on the same model, sixty times faster. Slow, but functional. The T4's Tensor Cores make the difference here; at this scale, the P40's raw CUDA cores simply cannot keep up with the matrix math.&lt;/p&gt;
&lt;p&gt;The lesson: MoE models and quantized models up to about 8B parameters are the P40's sweet spot. Dense models above 13B start hitting diminishing returns. Dense 70B is a wall.&lt;/p&gt;
&lt;h3&gt;The Cost Argument&lt;/h3&gt;
&lt;p&gt;Here is the math that justifies the project.&lt;/p&gt;
&lt;p&gt;A &lt;code&gt;g4dn.12xlarge&lt;/code&gt; on AWS (four Tesla T4s, 48 vCPUs, 192GB RAM) costs \$3.91 per hour. My home lab outperforms it on every model except the smallest. If I run inference for just four hours a day, the cloud cost would be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily&lt;/strong&gt;: \$15.64&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monthly&lt;/strong&gt;: \$469&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yearly&lt;/strong&gt;: \$5,694&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My server cost \$2,500 to build. It pays for itself in roughly five months of equivalent cloud usage. After that, the only ongoing cost is electricity. At Minnesota residential rates (roughly \$0.12/kWh) and an average draw of 800W under load, that is about \$70 per month. Less than a single day of the equivalent cloud instance.&lt;/p&gt;
&lt;p&gt;Even if you factor in the P40's lower performance on some workloads and assume you only get 70% of the cloud equivalent's utility, the break-even point is still well under a year. For a home lab that runs 24/7 for development, experimentation, and &lt;a href="https://tinycomputers.io/posts/clean-room-z80-emulator.html"&gt;text-to-speech generation&lt;/a&gt;, the economics are overwhelming.&lt;/p&gt;
&lt;h3&gt;What I Actually Use It For&lt;/h3&gt;
&lt;p&gt;The server runs several workloads:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Local LLM inference.&lt;/strong&gt; This is the primary use case. Having a local inference server with 96GB of VRAM means I can run frontier-class open-weight models without sending data to a cloud API. For development work, where I might make hundreds of inference calls while iterating on a project, the zero marginal cost changes how I work. I experiment more freely when each query costs nothing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-speech.&lt;/strong&gt; I run &lt;a href="https://tinycomputers.io/posts/clean-room-z80-emulator.html"&gt;Qwen TTS&lt;/a&gt; on the P40s to generate audio narration for blog posts. The model fits comfortably in the P40's memory, and the generation speed is acceptable for batch processing. The narration you hear on posts across this site was generated on these GPUs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Development and testing.&lt;/strong&gt; When I am building projects like &lt;a href="https://tinycomputers.io/posts/sampo-designing-a-16-bit-risc-cpu-from-scratch-part-1-theory-and-architecture.html"&gt;Sampo&lt;/a&gt; or &lt;a href="https://tinycomputers.io/posts/introducing-lattice-a-crystallization-based-programming-language.html"&gt;Lattice&lt;/a&gt;, having local GPU compute available for testing AI-assisted workflows means I do not need to worry about API rate limits or costs during intensive development sessions.&lt;/p&gt;
&lt;p&gt;The server sits on my local network at a static IP, accessible from any machine in the house. It is always on, always available, and always free to use. That availability changes your relationship with AI inference in ways that are hard to appreciate until you have lived with it. There is a psychological difference between "this costs two cents per query" and "this costs nothing per query." The first makes you think about whether the query is worth it. The second lets you experiment without friction, and that friction reduction, compounded across hundreds of daily interactions, fundamentally changes how you work.&lt;/p&gt;
&lt;p&gt;This is, incidentally, a small-scale example of the &lt;a href="https://tinycomputers.io/posts/jevons-paradox.html"&gt;Jevons Paradox&lt;/a&gt; I have been writing about in this blog's economics series. Making inference cheaper did not cause me to run the same number of queries and pocket the savings. It caused me to run dramatically more queries, on more models, for more projects, consuming more total compute than I ever would have purchased from a cloud provider. The efficiency created demand.&lt;/p&gt;
&lt;h3&gt;Should You Build One?&lt;/h3&gt;
&lt;p&gt;The honest answer is: it depends on what you value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Build one if:&lt;/strong&gt;
- You run local inference regularly and the cloud costs are adding up
- You want 96GB of VRAM for under a thousand dollars in GPU costs
- You have the physical space, electrical capacity, and noise tolerance for a rack-mount server
- You enjoy the process of building and configuring systems; this is not a plug-and-play experience&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Do not build one if:&lt;/strong&gt;
- You need the latest model performance (Tensor Cores, FP8, NVLink)
- You are training models, not running inference
- You need reliability guarantees; this is a home lab, not a production environment
- You are not comfortable with Linux system administration, driver debugging, and occasional hardware troubleshooting&lt;/p&gt;
&lt;p&gt;The P40 window will not last forever. As newer GPUs age out of data centers (the V100, the A100) the P40 will eventually lose its price-to-performance advantage. The V100, with its first-generation Tensor Cores and 32GB of HBM2, is already starting to appear at attractive secondary market prices. Within a year, it may be the new sweet spot. But right now, in early 2026, four P40s on eBay represent one of the best deals in GPU computing. Ninety-six gigabytes of VRAM, proven CUDA compatibility, and a decade of driver maturity, for the price of a weekend trip.&lt;/p&gt;
&lt;p&gt;The server in my shop building will keep running. The fans will keep screaming through the Minnesota winter. And I will keep running models on hardware that a hyperscaler discarded three years ago, at speeds that would have been remarkable on any hardware five years ago. That is the beauty of the secondary market: someone else paid for the R&amp;amp;D, someone else paid for the depreciation, and you get the compute.&lt;/p&gt;</description><category>ai</category><category>benchmarks</category><category>cuda</category><category>deep learning</category><category>ebay</category><category>enterprise hardware</category><category>gpu</category><category>home lab</category><category>inference</category><category>nvidia</category><category>ollama</category><category>tesla p40</category><guid>https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html</guid><pubDate>Wed, 11 Mar 2026 14:00:00 GMT</pubDate></item></channel></rss>