<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TinyComputers.io (Posts about home lab)</title><link>https://tinycomputers.io/</link><description></description><atom:link href="https://tinycomputers.io/categories/home-lab.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 A.C. Jokela 
&lt;!-- div style="width: 100%" --&gt;
&lt;a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"&gt;&lt;img alt="" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /&gt; Creative Commons Attribution-ShareAlike&lt;/a&gt;&amp;nbsp;|&amp;nbsp;
&lt;!-- /div --&gt;
</copyright><lastBuildDate>Fri, 17 Apr 2026 03:38:51 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>The Thing and the Endpoint: Why a Z80 Gathers a World and an API Doesn't</title><link>https://tinycomputers.io/posts/the-thing-and-the-endpoint.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/the-thing-and-the-endpoint_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;28 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;A Z80 DIP-40 weighs 5.7 grams. Run Zork on it, or run Zork in a browser emulator. The bytes execute the same way. One of these is a thing. The other isn't.&lt;/p&gt;
&lt;p&gt;That distinction has a name. Heidegger called it &lt;em&gt;Das Ding&lt;/em&gt;, the thing. He meant it in a specific sense that has nothing to do with how we normally use the word. A thing, for him, is something that gathers a world. A wine jug gathers earth (the clay, the grape), sky (the rain that watered the vines, the sun that ripened them), the mortals who drink from it and made it, and the occasion of its use. The jug is not a container that happens to have history. The gathering is the jug's being a jug.&lt;/p&gt;
&lt;p&gt;That sounds mystical on first read. On second read it describes something you already know. A Z80 RetroShield running CP/M and Zork at 2 a.m. on a workbench gathers a world in this specific sense. A request to an OpenAI endpoint does not, and cannot, and was deliberately designed not to. This essay is about why that difference matters, and why the people who build home labs and retro computing setups feel it even when they can't name it.&lt;/p&gt;
&lt;h3&gt;What the RetroShield Gathers&lt;/h3&gt;
&lt;p&gt;Start with the chip. The Z80 on my bench was fabricated by Zilog sometime in the late 1990s, which I know because the date code stamped on the plastic reads 9734. The silicon die underneath that plastic implements an instruction set designed in 1975 by Masatoshi Shima, the engineer who had already co-designed the Intel 4004 and 8080, and Federico Faggin, who had defected from Intel in 1974 to found Zilog. The Z80's register set inherits the 8080's. The opcode encoding is backwards-compatible with 8080 binaries. The chip in my hand is a physical artifact of a specific engineering defection.&lt;/p&gt;
&lt;p&gt;The plastic package is a DIP-40. Two rows of twenty pins, 0.6 inches between rows, 0.1 inches pin-to-pin. When you drop it into a machined socket, the pins bind slightly before seating. That's not sloppy tolerance, it's designed in: the socket contacts have to wipe against the pin to break through the oxide layer that forms on the tin plating. Every retro computer from the TRS-80 to the ZX Spectrum to the MSX used this package.&lt;/p&gt;
&lt;p&gt;The RetroShield is an Arduino shield. Erturk Kocalar published the original design on GitLab as open hardware. His version fits one Z80 on a 2x18 Arduino Mega bus header. Mine is a revision that fits two Z80s on the same shield, shared address and data buses, separate control signals on a supplementary header. The Gerbers were exported from pcb-rnd, a fork of the original gEDA PCB program maintained by Tibor Palinkas. The traces were placed by Freerouting, which Alfons Wirtz originally wrote on Oracle's dime, then re-released as MIT-licensed after Oracle lost interest. The board was fabricated by PCBWay in Shenzhen with a four-week turnaround. I soldered the DIP-40 sockets myself and discovered on the first power-up that every bus line was shorted to ground because the ground fill polygon's clearance cutouts hadn't fully encircled the pins in the Gerber export. Part 2 of this series is the story of finding that out.&lt;/p&gt;
&lt;p&gt;When the chip executes, it reads an opcode. 0xDB is &lt;code&gt;IN A, (n)&lt;/code&gt;, read a byte from an I/O port. The Arduino Mega firmware intercepts that read, treats the Z80 as if it were a CPU attached to a memory-mapped terminal, and feeds bytes back. The bytes are a Z3 storyfile: Zork I, compiled in 1980 by the Dynamic Modeling Group at MIT's Laboratory for Computer Science using a language called ZIL, originally written in MDL on a PDP-10 and ported into a virtual machine that could run on any 8-bit or 16-bit home computer of the era. Infocom ported it to CP/M. CP/M ran on the Z80. The chain closes on itself.&lt;/p&gt;
&lt;p&gt;Typing "GO NORTH" into the serial terminal produces, after a pause, the text "You are in an open field west of a big white house, with a boarded front door." That pause is not latency in any network sense. It is the Z80, at 4 MHz, running the Z-machine interpreter through thirty or so thousand clock cycles, each of which is a real transition of real silicon on real power.&lt;/p&gt;
&lt;p&gt;That is the gathering. It is not decoration. Zilog is present. MIT is present. Kocalar is present. PCBWay is present. pcb-rnd is present. The engineers who decided in 1975 that the return instruction should be one byte are present, because the silicon they designed is still decoding that byte at 4 MHz on my bench. My own ground-fill debugging is present, because the fill polygon is gone from this revision and that absence has a history. The thing gathers.&lt;/p&gt;
&lt;h3&gt;What the Cloud Endpoint Gathers&lt;/h3&gt;
&lt;p&gt;You can play Zork in a browser. archive.org hosts a Frotz build compiled to WebAssembly. You click a link, a Z3 interpreter materializes in a JavaScript sandbox, a virtual screen renders a virtual terminal, and "GO NORTH" produces "You are in an open field west of a big white house." Bit for bit, the same bytes of output. The game is the same. The Z-machine is the same. The story file is the same.&lt;/p&gt;
&lt;p&gt;But nothing gathers.&lt;/p&gt;
&lt;p&gt;The browser tab is not a thing in Heidegger's sense. It is a runtime. Runtimes are designed to be interchangeable. Run the same Frotz build in Chrome, in Firefox, in Safari. Run it on a phone, on a desktop, on a Chromebook in a school. Each one produces identical output from identical input. That interchangeability is not an accident or a failure. It is the entire engineering accomplishment of the web stack. A Z-machine interpreter that only ran on one specific browser on one specific machine would be a lesser piece of software, not a greater one.&lt;/p&gt;
&lt;p&gt;This is even clearer if the emulator is on a cloud-hosted runtime. You click a link to play-zork.com, it spawns a container in some datacenter, the container runs Frotz, the output streams back to you over HTTPS. Where is Zork running right now? Physically, electrically, in which building? You do not know. You are not meant to know. The service's value proposition depends on you not knowing. If US-East-1 fails over to US-East-2, your session survives with at most a reconnect. If Vercel goes under and the operator moves to Cloudflare Workers, your experience is identical. The gathering is suppressed by design.&lt;/p&gt;
&lt;p&gt;The same is true at a higher level of abstraction. A call to &lt;code&gt;api.openai.com/v1/chat/completions&lt;/code&gt; hits some cluster of H100s somewhere. Maybe in Texas. Maybe in Iowa. Maybe in Norway. The model behind the endpoint has weights, trained on hardware you will never see by engineers you will never meet. Tomorrow OpenAI might swap the backing model. Or add a 403 quota limit. Or migrate the inference stack to Blackwell. Your code does not change. That is the contract. The contract is the endpoint. The thing behind the endpoint is deliberately, structurally, invisible.&lt;/p&gt;
&lt;p&gt;This is not a complaint. The contract is useful. A company running a Rails app wants exactly this: stable interface, invisible infrastructure, someone else's problem. But the cost of that abstraction, the thing you pay with, is the gathering. The endpoint cannot gather a world because the world behind it is required to be interchangeable with any other world that can satisfy the contract.&lt;/p&gt;
&lt;h3&gt;Heidegger's Jug&lt;/h3&gt;
&lt;p&gt;In 1950 Heidegger gave a lecture called &lt;em&gt;Das Ding&lt;/em&gt;. He spent most of it talking about a wine jug. The essay is notoriously hard to read and almost comically literal. He describes the jug's sides, its base, its void. He distinguishes the jug from a cup and from a bottle. He asks what it means for a jug to be a jug.&lt;/p&gt;
&lt;p&gt;His answer is that a jug is not defined by its shape, its material, or its containing function. A jug is defined by what it gathers. When wine is poured from the jug, there gathers in that pouring: the earth (the grape that grew in soil, the clay fired into the vessel), the sky (the rain, the sun), the mortals (the drinker, the potter, the host), and what he calls the divinities (the toast, the libation, the occasion that makes this pouring different from running tap water into a glass). The fourfold, he called it. Earth, sky, mortals, divinities.&lt;/p&gt;
&lt;p&gt;The fourfold is the part of the essay that reads as mystical. Ignore the specific terminology if it grates. The structural claim underneath is simpler: a thing is a thing to the extent that it is a node in a web of presence. The jug is not just a container. The jug is a place where a whole world becomes, briefly and locally, present.&lt;/p&gt;
&lt;p&gt;Heidegger's counter-example in a later essay is the bridge. The old bridge at Heidelberg is a thing in his sense. It gathers the two banks, the river underneath, the road that runs across it, the people who cross. The bridge is what makes those things into a coherent place. The hydroelectric plant on the Rhine, which he treats in &lt;em&gt;The Question Concerning Technology&lt;/em&gt;, is not a thing. It is a piece of what he called the standing-reserve, &lt;em&gt;Bestand&lt;/em&gt;. The plant converts the river into potential electrical output, on demand, interchangeable with any other kilowatt on the grid. The plant does not gather. It extracts.&lt;/p&gt;
&lt;p&gt;This is the same distinction that separates the Z80 on my bench from the cloud-hosted Frotz emulator. The Z80 is a bridge. The cloud emulator is a power plant.&lt;/p&gt;
&lt;h3&gt;What's Actually Different&lt;/h3&gt;
&lt;p&gt;The functional output is the same. That is the central puzzle. The bytes of Zork's output are identical. The game is playable in either location. The player's subjective experience of "GO NORTH" producing a description of the open field is the same to within the tolerance of noticing.&lt;/p&gt;
&lt;p&gt;What is different is what each running copy &lt;em&gt;means&lt;/em&gt;, in a sense of meaning that is not about semantics but about presence.&lt;/p&gt;
&lt;p&gt;The Z80 running Zork on my bench means: Zilog's 1975 design decisions, Infocom's 1980 implementation, Kocalar's open hardware, my four-week wait for PCBWay, my ground-fill debugging session, the specific 4 MHz crystal that drives this specific chip tonight. The game is the surface. The gathering is what makes the game &lt;em&gt;this&lt;/em&gt; game and not an abstract instance of gameplay.&lt;/p&gt;
&lt;p&gt;The cloud-hosted Zork means: the game. That's the whole content. The infrastructure is interchangeable by contract, the hardware is invisible by design, the history is irrelevant to the service. You play Zork. That is the product. The product is the endpoint. The endpoint is the product.&lt;/p&gt;
&lt;p&gt;This is why people who run home labs can tell you war stories and people who use APIs cannot. "Remember the fan seizing on the P40 in July." "Remember when the ground fill shorted every bus line." "Remember the first time Forth actually loaded and we watched OK appear on the terminal." These stories are possible because the thing is specific, present, and has its own biography. "Remember that 503 from OpenAI last Tuesday" is not a story. It is a status page entry. The difference is not nostalgia or sentimentality. The difference is that one event happened to a thing and the other event happened to a contract.&lt;/p&gt;
&lt;h3&gt;The Enframing Connection&lt;/h3&gt;
&lt;p&gt;I wrote earlier about &lt;em&gt;Enframing&lt;/em&gt;, Heidegger's term for the mode of revealing that dominates the modern technological era. Enframing, &lt;em&gt;Gestell&lt;/em&gt;, is the stance that frames everything in advance as standing-reserve: resources on call, available on demand, interchangeable for the purpose at hand. The hydroelectric plant enframes the river as kilowatts. The modern timber industry enframes the forest as board-feet. The cloud endpoint enframes computation as a billable unit.&lt;/p&gt;
&lt;p&gt;Enframing is not a villain in Heidegger's telling. It is not a mistake. It is a stance that reveals certain truths about things, specifically their exchangeability as resources, at the cost of concealing other truths, specifically their being as things.&lt;/p&gt;
&lt;p&gt;The cloud endpoint is what Enframing looks like at the level of infrastructure. The GPU cluster is enframed as tokens-per-second, which are enframed as dollars-per-million-tokens, which are enframed as a line item on an invoice. That enframing is what makes the cloud economically tractable. It is also what makes the cloud unable to gather.&lt;/p&gt;
&lt;p&gt;The Z80 on my bench resists Enframing. Not because it's old or small or personal, but because I haven't framed it that way. I haven't asked it to be interchangeable. I haven't said "give me CP/M compute on demand at the lowest price." I have said "here is this specific chip, running this specific program, in this specific session." That's not a resource request. That's a relationship with a thing.&lt;/p&gt;
&lt;p&gt;This essay is not a sequel to &lt;em&gt;Enframing the Code&lt;/em&gt;. It is a companion piece, addressing what Enframing costs. Enframing names the stance. This one names what falls out of view when the stance becomes total.&lt;/p&gt;
&lt;h3&gt;Why People Build&lt;/h3&gt;
&lt;p&gt;The retro computing and home lab communities do something that looks, from an economic standpoint, irrational. They spend four-week lead times and hundreds of dollars to produce hardware that they could replace with a five-minute browser session for free. They run LLMs on Tesla P40s pulled out of eBay auction lots when the equivalent API call would cost fractions of a cent. They solder DIP-40 sockets in their basements when the emulator is a click away.&lt;/p&gt;
&lt;p&gt;You can explain this as nostalgia, and people sometimes do. You can explain it as hobby, and that's also partly right. You can explain it as skill acquisition, which is closer but still not the reason. The economic irrationality goes away the moment you stop assuming that the only value of running Zork is playing Zork.&lt;/p&gt;
&lt;p&gt;People build because the thing gathers. The RetroShield is not just a way to run Zork. It is a way to make Infocom's 1980 engineering present in the room tonight. It is a way to put Faggin's chip design decisions into active service at 4 MHz. It is a way to hold the physical object that descends from Zilog's break with Intel, from MIT's Dynamic Modeling Group, from the whole genealogy of 8-bit personal computing, and to use that object for its intended purpose on a Tuesday evening fifty years after the design was finalized.&lt;/p&gt;
&lt;p&gt;None of that is available through the endpoint. The endpoint is a contract for Zork. It is not a gathering of Zork's world.&lt;/p&gt;
&lt;p&gt;The feeling that people describe when they say "running Zork on a real Z80 feels different" is not aesthetic preference. It is the presence of the gathering. Something is actually there that is not there when you run the emulator in a browser tab, and that something is not information. It is a specific thing's being a thing.&lt;/p&gt;
&lt;h3&gt;What This Predicts&lt;/h3&gt;
&lt;p&gt;A test of the claim: this framework predicts that communities will form around specific hardware and not around cloud providers, and it predicts which specific hardware will gather the most.&lt;/p&gt;
&lt;p&gt;Communities form around the Tesla P40. Around the Raspberry Pi. Around the RetroShield. Around specific FPGA boards like the ULX3S and the Tang Nano 9K. Around the PDP-11 (still). Around the Apple IIe. Around the Amiga. Around AMD's Strix Halo in my own recent posts. The common feature: these are things with specific histories, specific constraints, specific failure modes, specific communities of use.&lt;/p&gt;
&lt;p&gt;Communities do not form around "the API endpoint for a frontier LLM." They do not form around "managed Postgres." They do not form around "us-east-1." There are users of those things, and there are engineers who get very good at using them, but the thing itself is not a gathering point because the thing is structurally interchangeable. You can run managed Postgres on AWS or GCP or Azure. It doesn't matter. That's the value. That's also why no one has a tattoo of managed Postgres.&lt;/p&gt;
&lt;p&gt;Within the cloud, communities do sometimes form, but they form around thing-like artifacts: specific open-source projects like Kubernetes or Postgres itself, specific hardware generations like the original A100 launch or the H200 launch, specific incidents like the us-east-1 outage of December 2021. The gathering happens when the abstraction fails or when a specific thing peeks through.&lt;/p&gt;
&lt;p&gt;This is not a prediction that cloud computing will fail or that people will abandon it. They won't. The endpoint is too useful. The prediction is narrower: the cloud will never gather the way things gather, and people will keep building physical hardware in their basements even when it is economically irrational, because the gathering is not available any other way.&lt;/p&gt;
&lt;h3&gt;The Chip on the Bench&lt;/h3&gt;
&lt;p&gt;I started with the weight of a Z80, 5.7 grams. End there. The chip is still on my bench. It is in a socket. The socket is on a PCB. The PCB is in a Mega header. The Mega is connected to my laptop by USB. The laptop is rendering a serial terminal. The terminal is showing the Zork prompt. The prompt is waiting.&lt;/p&gt;
&lt;p&gt;The physical object in front of me is small. It fits under my thumb. It was designed fifty years ago by an engineer who had just quit Intel. It has been sitting in a drawer for some years. Tonight it is running. Tonight a specific piece of silicon, fabricated in 1997, is decoding instructions written in 1980 by people in Cambridge, Massachusetts, to produce text that was designed to be read by someone sitting at a CRT terminal in a dorm room in 1982. That whole world is present on my bench, gathered by this chip, for as long as I keep the 4 MHz crystal running.&lt;/p&gt;
&lt;p&gt;When I type "GO NORTH" and the text appears, I am not receiving a service. I am participating in a thing that is thinging, in Heidegger's awkward verb form. I am one of the mortals in the fourfold. Faggin is one. Shima is one. The Infocom implementers are some. Kocalar is one. PCBWay's fabrication technicians are some. We are all gathered around the chip for the duration of this session.&lt;/p&gt;
&lt;p&gt;The API endpoint offers me none of this. The API endpoint offers me Zork. That's a different thing entirely, and most of the time it's what I want. But sometimes, on a Tuesday evening in 2026, it isn't, and the reason why has a name.&lt;/p&gt;</description><category>abstraction</category><category>cloud</category><category>das ding</category><category>hardware</category><category>heidegger</category><category>home lab</category><category>philosophy</category><category>retro computing</category><category>retroshield</category><category>z80</category><guid>https://tinycomputers.io/posts/the-thing-and-the-endpoint.html</guid><pubDate>Thu, 16 Apr 2026 13:00:00 GMT</pubDate></item><item><title>Running a 22B Video Model on Four Tesla P40s</title><link>https://tinycomputers.io/posts/running-ltx-video-on-four-tesla-p40s.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/running-ltx-video-on-four-tesla-p40s_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;22 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;LTX-Video 2.3 is a 22 billion parameter model that generates video from text prompts. It was designed for modern hardware: GPUs with bfloat16 support, high-bandwidth memory, and enough VRAM to hold the full model on one or two cards. The &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;Tesla P40&lt;/a&gt; has none of these things. It is a Pascal-generation GPU from 2016, with 24GB of GDDR5X per card, no native bfloat16, no Tensor Cores, and a PCIe 3.0 bus. It was built for data center inference workloads that no longer exist.&lt;/p&gt;
&lt;p&gt;I have four of them in a rack-mount server in an unheated shop building in Minnesota. Together they provide 96GB of VRAM. The question was whether that 96GB, spread across four old cards, could run a model that was never meant to run on any of them.&lt;/p&gt;
&lt;p&gt;The answer is yes, with significant caveats and a substantial amount of code to work around hardware limitations that the model's authors never anticipated.&lt;/p&gt;
&lt;h3&gt;The Problem&lt;/h3&gt;
&lt;p&gt;LTX-Video 2.3's transformer has 48 blocks. At fp16 precision, the model weights alone consume roughly 44GB. With the Gemma text encoder, the video VAE encoder/decoder, the spatial upsampler, and the audio components, the full pipeline needs more memory than any single P40 can provide. The model doesn't fit on one card. It doesn't fit on two. It barely fits on three, with no room for activations during inference.&lt;/p&gt;
&lt;p&gt;Four cards at 24GB each gives 96GB total, which is enough for the weights with room for intermediate activations. But CUDA doesn't automatically spread a model across multiple GPUs. You have to tell it how.&lt;/p&gt;
&lt;p&gt;The standard approach for multi-GPU inference is &lt;code&gt;accelerate&lt;/code&gt;'s &lt;code&gt;dispatch_model&lt;/code&gt;, which automatically distributes model layers across available GPUs based on memory constraints. This works for the Gemma text encoder, which is a straightforward transformer. For the LTX transformer, it doesn't work, because the model has a custom forward pass with audio-video cross-attention that &lt;code&gt;accelerate&lt;/code&gt;'s automatic dispatch can't handle correctly. The model needs to move data between GPUs at specific points in the forward pass, and &lt;code&gt;accelerate&lt;/code&gt; doesn't know where those points are.&lt;/p&gt;
&lt;p&gt;The solution was manual pipeline parallelism: split the 48 transformer blocks evenly across four GPUs (12 blocks per card), keep the shared components (patchify projections, normalization, output projections) on GPU 0, and write a custom forward pass that moves tensors between devices at block boundaries.&lt;/p&gt;
&lt;h3&gt;The Precision Problem&lt;/h3&gt;
&lt;p&gt;Even with the model split across four cards, nothing worked on the first attempt. Or the fifth. Getting LTX-Video running on Pascal hardware was an iterative process, with Claude Code generating solutions and me testing them against the actual hardware. Each failure revealed another assumption the model made about the GPU it would run on. The feedback loop was brutal: load a 22B model across four GPUs, wait eight minutes for a test generation, get a black frame or a NaN error, diagnose which precision boundary caused it, generate a fix, and try again.&lt;/p&gt;
&lt;p&gt;The first problem was bfloat16. The model weights are stored in bf16 format. Pascal GPUs cannot compute in bf16. PyTorch handles this silently for some operations by promoting to fp32, but other operations fail or produce garbage. The initial approach was the obvious one: monkey-patch &lt;code&gt;torch.bfloat16&lt;/code&gt; to redirect to &lt;code&gt;torch.float16&lt;/code&gt;. This seemed to work at load time. The model loaded, the weights populated, no errors. Then the first forward pass produced NaN everywhere. The monkey-patch had corrupted the safetensors weight loading. The weights loaded as fp16 bit patterns interpreted as bf16 values, which is not the same thing. A bf16 value of 1.0 has a different bit pattern than an fp16 value of 1.0. Reinterpret one as the other and you get a number that's either wildly wrong or NaN.&lt;/p&gt;
&lt;p&gt;The second attempt tried running everything in fp16 natively, converting weights properly during load. This got further: the model produced output that wasn't NaN. But the output was a solid green frame. The intermediate activations in the transformer blocks were overflowing fp16 range. Values above 65,504 become infinity in fp16, and the model's internal representations regularly exceed that during the attention and feedforward passes. The green frame was the model's attempt to decode latents that had been clipped to infinity at some point in the pipeline.&lt;/p&gt;
&lt;p&gt;The working solution was to let the model builder properly convert weights from bf16 to fp16 on load, then run the entire computation pipeline in float32. The weights sit in memory as fp16 (saving space), but every computation promotes to fp32 before executing. This required patching &lt;code&gt;F.linear&lt;/code&gt; to handle mixed dtype inputs:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;_orig_linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_mixed_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;weight&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;bias&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_orig_linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linear&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mixed_linear&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The same pattern extends to every normalization function and every convolution operation. Layer norm, group norm, RMS norm, conv1d through conv_transpose3d: all patched to handle mixed dtypes and accumulate in float32. Without these patches, intermediate values overflow fp16 range (values above 65,504 become infinity) and the output is a black frame.&lt;/p&gt;
&lt;h3&gt;The Gemma Problem&lt;/h3&gt;
&lt;p&gt;The text encoder is Google's Gemma 3, a separate model that converts text prompts into embeddings the video transformer can condition on. Gemma's attention mechanism overflows when run in fp16 on Pascal hardware. The attention scores grow large enough to exceed fp16 range, producing NaN values that propagate through the rest of the pipeline.&lt;/p&gt;
&lt;p&gt;The fix was running the entire Gemma encoder in float32. This uses more memory, but the text encoder only runs once per generation (to encode the prompt), and its weights can be freed from GPU memory before the transformer starts. The sequence is: load Gemma across all four GPUs using &lt;code&gt;accelerate&lt;/code&gt;, encode the prompt in float32, delete the encoder, free the memory, then load the video transformer.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;encode_prompt_float32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtype&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float32&lt;/span&gt;
    &lt;span class="n"&gt;te&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model_ledger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_encoder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Dispatch across all 4 GPUs for memory&lt;/span&gt;
    &lt;span class="n"&gt;max_memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;get_balanced_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_memory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"22GiB"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
        &lt;span class="n"&gt;no_split_module_classes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Gemma3DecoderLayer"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;te&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dispatch_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;hidden_states&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attention_mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;te&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Free GPU memory before transformer loads&lt;/span&gt;
    &lt;span class="k"&gt;del&lt;/span&gt; &lt;span class="n"&gt;te&lt;/span&gt;
    &lt;span class="n"&gt;gc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;collect&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;empty_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This load-encode-delete cycle is ugly but necessary. There isn't enough total memory to hold both Gemma and the video transformer simultaneously, even across four cards. The sequential approach works because each component only needs to exist during its phase of the pipeline.&lt;/p&gt;
&lt;h3&gt;The Pipeline&lt;/h3&gt;
&lt;p&gt;The generation runs in two stages, matching LTX-Video's distilled inference schedule.&lt;/p&gt;
&lt;p&gt;Stage 1 generates a half-resolution latent video (e.g., 256x384) through 8 denoising steps. Each step runs the full 48-block transformer, with data moving across all four GPUs:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;patched_process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ltx&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transformer_blocks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;dev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block_devices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;perturbations&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;video&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;move_args_to_device&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;video&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audio&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Every GPU boundary involves a tensor transfer across PCIe 3.0. With 12 blocks per GPU, there are 3 boundary crossings per denoising step (GPU 0 to 1, 1 to 2, 2 to 3), plus a final transfer back to GPU 0. With 8 denoising steps, that's 32 cross-device transfers per stage, each moving both video and audio state tensors. PCIe 3.0 x16 has a theoretical bandwidth of ~16 GB/s. The tensors being transferred are small relative to the bandwidth (attention states and activations, not full weight matrices), so the overhead is manageable. But it adds up.&lt;/p&gt;
&lt;p&gt;Stage 1 takes roughly 4 minutes for 241 frames at 24 fps (a 10-second clip). The spatial upsampler then doubles the resolution. Stage 2 runs 3 more denoising steps at full resolution (512x768), taking roughly 6.5 minutes. The VAE decoder converts latents to pixels and generates the audio track in another 40 seconds.&lt;/p&gt;
&lt;p&gt;Total generation time for a 10-second, 512x768 video with audio: approximately 18.5 minutes. For a 1-second clip (25 frames): about 8 minutes. For a 4-second clip (97 frames): about 10.5 minutes.&lt;/p&gt;
&lt;h3&gt;The Memory Layout&lt;/h3&gt;
&lt;p&gt;During inference, the four GPUs aren't loaded equally. GPU 0 carries extra weight because it hosts all the shared components (patchify projections, normalization layers, output projections) plus its 12 transformer blocks. The actual memory distribution:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;VRAM Used&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;10.8 GB&lt;/td&gt;
&lt;td&gt;Shared components + blocks 0-11&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 12-23&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 24-35&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;9.3 GB&lt;/td&gt;
&lt;td&gt;Blocks 36-47&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;That's 38.7 GB of the available 96 GB. The remaining 57 GB provides headroom for activations, KV cache growth, and the VAE decoder. There's enough margin that generation never OOMs, even at 241 frames.&lt;/p&gt;
&lt;h3&gt;The API&lt;/h3&gt;
&lt;p&gt;Running inference from the command line is fine for testing, but generating videos for blog content requires something more practical. I wrapped the generation script in a FastAPI server with an async job queue:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# Submit a text-to-video job&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;-X&lt;span class="w"&gt; &lt;/span&gt;POST&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prompt=A cinematic flyover of a Zilog Z80 processor on a PCB"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duration=10"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"seed=42"&lt;/span&gt;

&lt;span class="c1"&gt;# Submit an image-to-video job&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;-X&lt;span class="w"&gt; &lt;/span&gt;POST&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prompt=A fluffy orange cat dancing"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"duration=4"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-F&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"image=@cat.jpg"&lt;/span&gt;

&lt;span class="c1"&gt;# Check status&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs/07420abb6d82

&lt;span class="c1"&gt;# Download result&lt;/span&gt;
curl&lt;span class="w"&gt; &lt;/span&gt;http://10.1.1.24:8585/jobs/07420abb6d82/video&lt;span class="w"&gt; &lt;/span&gt;-o&lt;span class="w"&gt; &lt;/span&gt;output.mp4
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Jobs queue and execute sequentially. The GPU can only handle one generation at a time, and the load-encode-delete cycle for Gemma means there's significant setup overhead per job. The API spawns each job as a subprocess, which gives clean GPU memory cleanup between runs. If a generation crashes (which happened frequently during development), the next job starts fresh.&lt;/p&gt;
&lt;p&gt;The server supports both text-to-video and image-to-video. Image conditioning locks the first frame to a provided image and generates subsequent frames from it, which produces more controllable results for specific visual subjects. In practice, image-to-video is the more useful mode. Text-to-video gives the model complete creative freedom, which means the output is unpredictable. You might ask for a Z80 processor and get something that looks like a generic IC, or something that looks like a Z80, depending on the seed. Image-to-video lets you provide the exact first frame you want and the model animates from there. For blog content where visual accuracy matters, starting from a real photograph or a specific reference image gives consistently better results.&lt;/p&gt;
&lt;h3&gt;What the Output Looks Like&lt;/h3&gt;
&lt;p&gt;The video quality is genuinely good. LTX-Video 2.3 produces coherent motion, reasonable physics, and detailed textures. Here are three examples, generated entirely on the P40 server:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-video: "A cinematic flyover of a Zilog Z80 processor on a printed circuit board" (10 seconds, 18.5 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-z80-flyover.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;&lt;strong&gt;Image-to-video: "A fluffy orange cat with a hat dancing" (4 seconds, 10.5 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-cat-dancing.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;&lt;strong&gt;Text-to-video: "A cat sitting on a windowsill, sunlight streaming in" (1 second, 8 minutes to generate)&lt;/strong&gt;&lt;/p&gt;
&lt;video controls preload="metadata" style="max-width: 100%; border-radius: 6px; box-shadow: 0 10px 20px rgba(0,0,0,.1); margin: 1em 0;"&gt;
&lt;source src="https://tinycomputers.io/ltx-cat-windowsill.mp4" type="video/mp4"&gt;
&lt;/source&gt;&lt;/video&gt;

&lt;p&gt;The model understands object permanence, lighting consistency, and basic spatial relationships. The Z80 flyover produces a recognizable IC package with surrounding components, proper lighting, and smooth camera movement.&lt;/p&gt;
&lt;p&gt;The audio is a different story. LTX-Video 2.3 generates an audio track alongside the video, but the results are inconsistent. Prompts describing characters speaking produce odd ambient music instead of voices. Prompts describing environments produce vaguely appropriate soundscapes. The audio pipeline works mechanically (it generates real audio waveforms via a separate VAE decoder and vocoder), but the semantic connection between prompt and audio output is weak. For blog content, I'd likely strip the generated audio and add narration or music separately.&lt;/p&gt;
&lt;p&gt;The 512x768 resolution at 24fps is usable for web content. It's not 4K. It's not going to replace stock footage for production video. But for blog hero images in motion, visual demonstrations, or supplementary content alongside text, it works.&lt;/p&gt;
&lt;h3&gt;What This Cost&lt;/h3&gt;
&lt;p&gt;The hardware cost is zero incremental. The four P40s and the server already existed for &lt;a href="https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html"&gt;LLM inference&lt;/a&gt;. LTX-Video is an additional workload on the same hardware.&lt;/p&gt;
&lt;p&gt;The electricity cost is modest. The server draws roughly 500W under full GPU load. An 18.5-minute generation (10-second video at full resolution) consumes about 0.15 kWh, roughly $0.024 at Minnesota residential rates. You could generate forty 10-second clips for a dollar.&lt;/p&gt;
&lt;p&gt;The real cost was development time. Getting from "model downloaded" to "working generation pipeline" took many iterations across multiple sessions with Claude Code. Each precision-related failure mode (bf16 corruption, fp16 overflow, mixed-dtype kernel errors, NaN propagation through attention) required diagnosis, a hypothesis, a code change, and a test cycle that involved loading a 22B model across four GPUs. The feedback loop was slow. A single test takes 8 to 18 minutes to confirm whether a change worked. Many didn't.&lt;/p&gt;
&lt;h3&gt;The Broader Point&lt;/h3&gt;
&lt;p&gt;A 22 billion parameter video generation model was not designed to run on 2016 hardware. The authors assumed bf16, assumed modern attention kernels, assumed enough memory on one or two cards. None of those assumptions hold on the P40.&lt;/p&gt;
&lt;p&gt;But the model runs anyway, because the underlying math doesn't actually require any of those features. Bfloat16 is a convenience, not a requirement; float32 computes the same function. Flash attention is an optimization, not a necessity; standard attention produces identical results. And 96GB across four cards is 96GB, regardless of whether it's cutting-edge HBM3 or decade-old GDDR5X.&lt;/p&gt;
&lt;p&gt;The generation is slow. Eighteen minutes for ten seconds of video is not competitive with a single A100, which would finish the same job in under two minutes. The float32 computation pipeline roughly doubles the FLOPS required compared to the bf16 path the model was designed for, and the PCIe 3.0 transfers between four separate memory pools add latency that a single modern GPU with unified HBM would never incur. But competitive wasn't the point. The point was that four GPUs I bought on eBay for a thousand dollars total, sitting in a server in a shop building, can run a model that was released this month. The gap between "latest model" and "latest hardware" is not as wide as the spec sheets suggest, as long as you're willing to write the code that bridges it.&lt;/p&gt;
&lt;p&gt;The P40 server was already paying for itself on &lt;a href="https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html"&gt;LLM inference&lt;/a&gt; and &lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;TTS generation&lt;/a&gt;. Video generation is one more workload on a machine that I own, running models that I choose, on a schedule that I control. The 18-minute wait is the price of not asking anyone's permission.&lt;/p&gt;</description><category>ai</category><category>cuda</category><category>gpu</category><category>home lab</category><category>inference</category><category>ltx video</category><category>multi-gpu</category><category>pascal</category><category>pipeline parallelism</category><category>tesla p40</category><category>video generation</category><guid>https://tinycomputers.io/posts/running-ltx-video-on-four-tesla-p40s.html</guid><pubDate>Fri, 20 Mar 2026 13:00:00 GMT</pubDate></item><item><title>The Economics of Owning Your Own Inference</title><link>https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/the-economics-of-owning-your-own-inference_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;21 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;I own $5,500 worth of GPU hardware dedicated to running AI models locally. I also pay for a Claude Max subscription that I use for nearly everything that matters. If that sounds like a contradiction, it is the entire subject of this article.&lt;/p&gt;
&lt;p&gt;The local inference conversation online is dominated by two positions. The first: why pay for API calls when you can run models on your own hardware? The second: local models are worse, so just pay for the good ones. Both are correct. Both are incomplete. The interesting question is where the boundary falls between them, and the answer turns out to depend less on cost-per-token arithmetic than on what kind of work you are doing.&lt;/p&gt;
&lt;h3&gt;The Split&lt;/h3&gt;
&lt;p&gt;I use Claude for research, code review, writing feedback, technical analysis, and anything that used to be a Google search. The frontier models are better at all of these tasks than anything I can run locally. Not marginally better; categorically better. An 8B parameter model running on my hardware is not in the same conversation as Claude Opus or GPT-5.4 for anything requiring reasoning, nuance, or broad knowledge. The subscription cost is fixed regardless of volume, which eliminates per-query friction entirely. For interactive, quality-sensitive work, I pay for the best model available and I do not think about it.&lt;/p&gt;
&lt;p&gt;Local inference handles everything else: the batch jobs, the grunt work, the high-volume tasks where model quality matters less than model availability. The work that would be expensive at cloud API rates not because any single call costs much, but because the calls number in the tens of thousands.&lt;/p&gt;
&lt;p&gt;This is not a temporary arrangement while local models catch up. It is a structural split. Frontier models are getting better. Local models are also getting better. The gap is not closing in the ways that matter for my usage, because the tasks I send to each side are fundamentally different. I do not need my local 8B model to reason better. I need it to process text cheaply and without metering.&lt;/p&gt;
&lt;h3&gt;What the Local Hardware Actually Does&lt;/h3&gt;
&lt;p&gt;Three workloads. All batch. All quality-tolerant.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-speech.&lt;/strong&gt; Every post on this site has an &lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;AI-generated audio narration&lt;/a&gt;. This is the workload that justifies the hardware on its own. Google Cloud Platform has superior TTS voices; Chirp3-HD sounds noticeably more natural than any open-source model I have tested. I ran a novel through it once: 82,000 words, 500,000 characters, $17.25. That is reasonable for a one-off project.&lt;/p&gt;
&lt;p&gt;It is not reasonable for a library of blog posts that I revise and regenerate periodically. At GCP rates ($16 per million characters, more for premium voices), narrating every post on this site would cost $200 to $400, and that bill resets every time I edit an article and regenerate the audio. Open-source TTS (&lt;a href="https://tinycomputers.io/posts/the-real-cost-of-running-qwen-tts-locally-three-machines-compared.html"&gt;F5-TTS and Qwen TTS&lt;/a&gt;) mispronounces technical terms. The prosody goes flat on dense jargon. But it is good enough for blog narration. "Good enough" at zero marginal cost beats "excellent" at $4 to $10 per post when you are generating audio daily.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Code scanning.&lt;/strong&gt; Running local models over source files for pattern detection, documentation extraction, and automated analysis. These jobs produce high token volume at low quality requirements. An 8B model is adequate. The token count across a full codebase makes API pricing add up in a way that individual queries do not.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Infrastructure work.&lt;/strong&gt; Benchmarking hardware (as in this article), testing prompt structures across quantization levels, evaluating model behavior under different configurations. These queries have no value individually. They are the test drives, not the commute. Paying per-token for test drives is paying per-mile to drive your own car around the block.&lt;/p&gt;
&lt;p&gt;None of these workloads require a frontier model. All of them generate enough volume to make metered pricing uncomfortable. That is the boundary.&lt;/p&gt;
&lt;h3&gt;The Machines&lt;/h3&gt;
&lt;p&gt;Two machines. Both mine. Both running &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;A &lt;a href="https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html"&gt;four-GPU Tesla P40 server&lt;/a&gt;: Penguin Computing 2U chassis, Xeon E5-2697A v4, 252GB DDR4 ECC, four Tesla P40s with 24GB GDDR5X each. Ninety-six gigabytes of VRAM. Pascal architecture, 2016 vintage. Built from eBay parts for about $2,500. Lives in an unheated shop building in Minnesota.&lt;/p&gt;
&lt;p&gt;A Bosgame M5 mini desktop: AMD Ryzen AI MAX+ 395, Strix Halo APU with integrated RDNA 3.5 graphics. No discrete GPU. CPU and GPU share 128GB DDR5, roughly 60GB addressable as VRAM through ROCm 7.2. Cost about $3,000. Fits on a desk.&lt;/p&gt;
&lt;h3&gt;What They Cost to Run&lt;/h3&gt;
&lt;p&gt;I logged GPU power draw at 500-millisecond intervals during inference using &lt;code&gt;nvidia-smi&lt;/code&gt; on the P40 server and &lt;code&gt;rocm-smi&lt;/code&gt; on the Strix Halo. Same prompt, same models, same Ollama configuration. All models ran 100% on GPU.&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;P40 tok/s&lt;/th&gt;
&lt;th&gt;P40 GPU Power&lt;/th&gt;
&lt;th&gt;Halo tok/s&lt;/th&gt;
&lt;th&gt;Halo GPU Power&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 3B&lt;/td&gt;
&lt;td&gt;91.2&lt;/td&gt;
&lt;td&gt;170W avg&lt;/td&gt;
&lt;td&gt;78.4&lt;/td&gt;
&lt;td&gt;64W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;47.5&lt;/td&gt;
&lt;td&gt;278W avg&lt;/td&gt;
&lt;td&gt;40.2&lt;/td&gt;
&lt;td&gt;82W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B (4K ctx)&lt;/td&gt;
&lt;td&gt;6.3&lt;/td&gt;
&lt;td&gt;278W avg&lt;/td&gt;
&lt;td&gt;5.6&lt;/td&gt;
&lt;td&gt;81W avg&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 is 15-18% faster in raw throughput. It draws 3-4x the power. The 3B model lives on a single P40; the other three cards idle at ~9W each but still cost electricity. The 8B and 70B models span two GPUs while two idle. You always pay for cards that are not working. The Strix Halo has one GPU. No idle penalty.&lt;/p&gt;
&lt;p&gt;GPU power is not total system power. The P40 server's Xeons, 252GB of RAM, dual PSUs, and fans add roughly 200W to the GPU figures. The Strix Halo's APU and DDR5 add roughly 40-60W. Conservative estimates for total system draw: 500W for the P40 under load, 120W for the Strix Halo.&lt;/p&gt;
&lt;p&gt;At Minnesota residential electricity rates ($0.157/kWh), the cost per million tokens:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;3B&lt;/th&gt;
&lt;th&gt;8B&lt;/th&gt;
&lt;th&gt;70B&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P40 Server&lt;/td&gt;
&lt;td&gt;$0.19/M&lt;/td&gt;
&lt;td&gt;$0.46/M&lt;/td&gt;
&lt;td&gt;$3.47/M&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Strix Halo&lt;/td&gt;
&lt;td&gt;$0.06/M&lt;/td&gt;
&lt;td&gt;$0.13/M&lt;/td&gt;
&lt;td&gt;$0.94/M&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Why the Per-Token Number Is Misleading&lt;/h3&gt;
&lt;p&gt;Those numbers look competitive with hosted inference, which runs $0.05 to $0.20 per million tokens for 8B-class models through providers like Together AI or Groq. The Strix Halo at $0.13/M sits squarely in that range. The P40 at $0.46/M does not.&lt;/p&gt;
&lt;p&gt;But per-token cost during active inference is the wrong metric for two reasons.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hardware amortization changes the math.&lt;/strong&gt; The P40 server cost $2,500. The Strix Halo cost $3,000. Amortized over two years, that adds $0.14/hr and $0.11/hr respectively. On the 8B model, the all-in cost per million tokens rises to about $1.28 for the P40 and $0.90 for the Strix Halo. Both are more expensive than every hosted inference API for the same model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Idle power is the dominant cost.&lt;/strong&gt; The P40 server draws roughly 340W at idle: $38.50 per month whether I run a single query or not. The Strix Halo draws roughly 35W at idle: $4.20 per month. Over a year, idle electricity alone costs $462 on the P40 and $50 on the Strix Halo. If you are not using the hardware frequently, idle power overwhelms everything else in the cost model.&lt;/p&gt;
&lt;p&gt;Per-token math at load flatters local inference by ignoring the hours when the hardware is doing nothing. It is like calculating your car's fuel economy only during highway driving and ignoring that it sits in the driveway 22 hours a day with the engine running.&lt;/p&gt;
&lt;h3&gt;Why I Run Both Anyway&lt;/h3&gt;
&lt;p&gt;The per-token economics favor API providers. The per-workload economics favor local hardware for specific tasks. TTS is the starkest example.&lt;/p&gt;
&lt;p&gt;Generating a 20-minute blog narration on the Strix Halo takes about 45 minutes of inference at roughly 85W above idle power. The incremental electricity cost is about $0.02. The same narration through Google Cloud TTS would cost $4 to $10 depending on character count and voice tier.&lt;/p&gt;
&lt;p&gt;That is a 200-to-500x cost difference on the marginal unit. And the marginal unit is what matters, because the question is never "should I generate TTS at all?" It is "should I regenerate the audio for this post I just edited?" or "should I try a different voice on this article?" or "should I narrate this niche post about PCB trace routing that maybe fifty people will listen to?"&lt;/p&gt;
&lt;p&gt;At $4 to $10 per narration, the answer to all of those is "probably not." At $0.02, the answer is "why wouldn't I?" That shift from "probably not" to "why not" is the entire economic argument for owning TTS hardware. It is not about the average cost. It is about the marginal decision.&lt;/p&gt;
&lt;p&gt;Before running local TTS, I narrated posts selectively with Google Cloud's Text-to-Speech. Some were too long or too niche to justify the GCP cost. Now every post gets audio. I regenerate after revisions without thinking about it. I have run the same post through three different TTS models to compare voice quality. I experiment with speaker voices, pacing parameters, and chunk sizes. The total volume of audio I have generated locally exceeds what I would have purchased from Google at any price point. This is &lt;a href="https://tinycomputers.io/posts/jevons-paradox.html"&gt;Jevons Paradox&lt;/a&gt; at the smallest possible scale: make TTS cheap enough and I do not produce the same amount of TTS for less money; I produce vastly more TTS for slightly less money.&lt;/p&gt;
&lt;p&gt;The same logic applies to code scanning. Any individual scan is cheap enough through an API. But the friction of metered pricing discourages the kind of speculative, exploratory analysis that turns up unexpected findings. When the marginal cost is zero, I scan more freely and more often. The value is not in any single scan; it is in the scans I would not have run otherwise.&lt;/p&gt;
&lt;h3&gt;The Strix Halo Problem&lt;/h3&gt;
&lt;p&gt;The most surprising result in the benchmarks is the Strix Halo's efficiency. An integrated APU with no discrete GPU delivers 40.2 tokens per second at 82W of GPU power. The P40 server delivers 47.5 tokens per second at 278W of GPU power. The P40 is 18% faster. The Strix Halo uses 70% less power. In performance per GPU watt, the Strix Halo (0.49 tok/s per watt) is nearly three times more efficient than the P40 (0.17 tok/s per watt).&lt;/p&gt;
&lt;p&gt;This creates a problem for the P40 server's economics. The server's advantage is VRAM: 96GB lets it run 120B MoE models that the Strix Halo cannot fit. For the gpt-oss 120B model, the P40 server is the valid option. But for everything 8B and below, the Strix Halo is cheaper to buy ($2,000 vs. $2,500), cheaper to idle ($4.20/month vs. $38.50/month), cheaper per token ($0.13/M vs. $0.46/M), quieter, smaller, and only 18% slower.&lt;/p&gt;
&lt;p&gt;If I were building a local inference setup today from scratch and my workload was 8B models and TTS, I would buy the Strix Halo and nothing else. The P40 server justifies its existence only for the large models that need its VRAM and the fact that I put it together well before the current RAM price spike.&lt;/p&gt;
&lt;p&gt;This is worth sitting with for a moment, because it inverts the conventional wisdom about inference hardware. The enterprise GPU server that looks impressive on paper (four GPUs, 96GB VRAM, 2U rack mount) loses on total cost of ownership to a $3,000 mini desktop for the workloads that dominate my actual usage. The P40's raw throughput advantage is real but small. Its power cost advantage is negative. The VRAM advantage matters only for models most people do not run.&lt;/p&gt;
&lt;h3&gt;The Maintenance Tax&lt;/h3&gt;
&lt;p&gt;The per-token calculations ignore the cost of keeping these machines running. It is not zero.&lt;/p&gt;
&lt;p&gt;I have had two kernel updates break the NVIDIA DKMS module on the P40 server. The AMD machine requires &lt;a href="https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html"&gt;specific pre-release PyTorch wheels&lt;/a&gt; and environment variable overrides for ROCm to function on gfx1151 hardware. While running the benchmarks for this article, I discovered that Ollama on the Strix Halo had been running entirely on CPU because the systemd service file lacked the &lt;code&gt;HSA_OVERRIDE_GFX_VERSION=11.5.1&lt;/code&gt; variable. Every benchmark I had run on that machine prior to catching this was measuring CPU inference, not GPU inference. The fix took two minutes. Finding it took longer.&lt;/p&gt;
&lt;p&gt;The P40 server's fans run at full speed from October through April because the BMC interprets Minnesota winter temperatures as a hardware malfunction. The noise is audible from the house, 150 feet away.&lt;/p&gt;
&lt;p&gt;None of this is catastrophic. All of it is time. And time spent debugging DKMS modules or adding environment variables to systemd units is time not spent on the work that the hardware is supposed to enable. A Claude Max subscription requires zero maintenance. The local hardware requires ongoing attention. That asymmetry does not show up in per-token cost tables, but it is real.&lt;/p&gt;
&lt;h3&gt;Who This Is For&lt;/h3&gt;
&lt;p&gt;Most people should not build a local inference server. If you use AI for interactive tasks (questions, code, analysis, writing), a frontier model subscription is a better product at a lower total cost than any local setup. The quality gap between a local 8B model and Claude or GPT-5.4 is not closing in the ways that matter for conversational use. Pay for the good models. Use them freely.&lt;/p&gt;
&lt;p&gt;Local inference makes economic sense when you have a specific, high-volume, quality-tolerant workload that you will run often enough to justify hardware sitting on 24/7. TTS is the clearest case. Batch code analysis is another. If you cannot name the workload, you do not have one, and the hardware will cost you $40 to $50 per month in idle electricity to find out.&lt;/p&gt;
&lt;p&gt;The split between frontier subscriptions and local batch processing is not a compromise. It is, for my usage, the correct architecture. The frontier model handles the work where quality determines value. The local hardware handles the work where volume determines cost. Neither replaces the other. The mistake is thinking they compete.&lt;/p&gt;</description><category>ai</category><category>amd</category><category>benchmarks</category><category>claude</category><category>economics</category><category>gpu</category><category>home lab</category><category>inference</category><category>jevons paradox</category><category>local inference</category><category>power consumption</category><category>strix halo</category><category>tesla p40</category><category>tts</category><guid>https://tinycomputers.io/posts/the-economics-of-owning-your-own-inference.html</guid><pubDate>Tue, 17 Mar 2026 13:00:00 GMT</pubDate></item><item><title>Repurposing Enterprise GPUs: The Tesla P40 Home Lab Story</title><link>https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;17 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;There is a window, maybe eighteen months wide, where enterprise hardware hits a pricing sweet spot. The first-generation buyers (the hyperscalers, the research labs, the Fortune 500 AI teams) have moved on to the next generation. The second-hand market floods. Prices crater. And if you know what you're looking for, you can build something genuinely capable for less than a month of cloud compute.&lt;/p&gt;
&lt;p&gt;I built a four-GPU inference server for about twenty-five hundred dollars. This is the story of how, why, and whether you should do the same.&lt;/p&gt;
&lt;h3&gt;The Buy&lt;/h3&gt;
&lt;p&gt;The acquisition strategy is straightforward: eBay, patience, and knowing what to look for.&lt;/p&gt;
&lt;p&gt;Tesla P40s started appearing in volume on the secondary market around 2023, when cloud providers and enterprise data centers began cycling them out in favor of A100s and H100s. A card that sold for over five thousand dollars new was suddenly available for three hundred, then two hundred and fifty, then, if you watched listings carefully and were willing to buy from decommissioned lot sellers, sometimes less. I picked up four cards over the course of about two months, averaging two hundred and fifty dollars each.&lt;/p&gt;
&lt;p&gt;The chassis was a Penguin Computing 2U rack-mount server, also from eBay. These show up when government labs and research institutions liquidate equipment. The Penguin Computing systems are well-built, with proper server-grade construction, redundant power supplies, and engineered airflow. Mine takes the Xeon E5-2697A v4 and two were purchased from eBay: eighteen Broadwell cores, more than enough CPU to keep four GPUs fed. The chassis cost around six hundred dollars.&lt;/p&gt;
&lt;p&gt;Memory was the lucky purchase. I bought 252GB of DDR4 ECC RAM before the memory price spike that hit in late 2024 when every company on Earth decided they needed AI infrastructure simultaneously. What I paid around two hundred and fifty dollars for would cost significantly more today. Total build: roughly twenty-five hundred dollars.&lt;/p&gt;
&lt;h3&gt;The Hardware&lt;/h3&gt;
&lt;p&gt;The Tesla P40 is a 2016-era data center GPU. NVIDIA designed it for the Pascal generation, targeting inference workloads in enterprise environments. The specifications, for something you can buy on eBay for two hundred and fifty dollars, are remarkable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;24GB GDDR5X&lt;/strong&gt; per card, more memory than an RTX 4090&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;3,840 CUDA cores&lt;/strong&gt;, Pascal architecture, compute capability 6.1&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;12 TFLOPS FP32&lt;/strong&gt;, respectable even by 2026 standards for inference&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;250W TDP&lt;/strong&gt;: this is a data center card and it draws power like one&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Multiply by four and you get 96GB of VRAM for a thousand dollars. That is an extraordinary amount of GPU memory for the price. For context, a single NVIDIA A100 80GB still sells for north of five thousand dollars on the secondary market. Four P40s give you more total VRAM for a fraction of the cost.&lt;/p&gt;
&lt;h3&gt;What You Give Up&lt;/h3&gt;
&lt;p&gt;There is no free lunch in computing, and the P40 makes you pay for its low price in specific, sometimes painful ways.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No Tensor Cores.&lt;/strong&gt; The P40 predates NVIDIA's Tensor Core architecture, which arrived with Volta in 2017. Tensor Cores accelerate matrix multiplication (the fundamental operation in neural network inference) by factors of 4x to 16x depending on precision. The P40 does everything with its CUDA cores, the old-fashioned way. This matters less than you might think for inference at moderate batch sizes, but it means you will never match the throughput of a V100 or newer card, clock for clock.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;No native BF16 or FP16.&lt;/strong&gt; This is the real gotcha. BF16 (bfloat16) has become the default precision for large language models. It is what most model weights are distributed in. The P40 cannot compute in BF16 natively; it emulates it through FP32 operations, which is roughly 21% slower than native support. In practice, this means you are running quantized models (Q4, Q5, Q8) through llama.cpp or similar frameworks, which handle the precision conversion for you. It works. It is not optimal.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Passive cooling designed for server airflow.&lt;/strong&gt; The P40 is a blower-style card designed for 1U and 2U server chassis with front-to-back forced airflow. In a proper server, this is fine. In anything else, you need to solve cooling yourself. I put mine in a Penguin Computing 2U rack-mount chassis, which has the right airflow characteristics, but this is not a card you drop into a desktop tower.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;PCIe 3.0 x16.&lt;/strong&gt; The P40 connects via PCIe 3.0, which provides about 16 GB/s of bandwidth per direction. When you are running a model that spans four GPUs, the inter-GPU communication goes over PCIe, not NVLink. This creates a bottleneck for models that require heavy cross-GPU communication. For inference, where the communication pattern is more predictable than training, this is manageable. For training, it would be a serious constraint.&lt;/p&gt;
&lt;h3&gt;The Minnesota Problem&lt;/h3&gt;
&lt;p&gt;My server lives in an unheated shop building in northern Minnesota. This has created an issue that no hardware review will prepare you for.&lt;/p&gt;
&lt;p&gt;When ambient temperatures drop below freezing (which, in Minnesota, means roughly October through April) the onboard temperature sensors report values that the baseboard management controller interprets as a malfunction. The BMC's response is to spin every fan to maximum RPM as a protective measure.&lt;/p&gt;
&lt;p&gt;The result is a machine that, on quiet winter nights, is audible from the house. The house is a hundred and fifty feet away.&lt;/p&gt;
&lt;p&gt;I have not solved this problem. I have learned to live with it. You can override BMC fan curves on some platforms, but the Penguin Computing firmware is locked down in ways that make this nontrivial, and frankly, a server that runs its fans at full speed because it thinks it is dying is doing exactly what it should be doing. The firmware's assumptions are just wrong for the environment.&lt;/p&gt;
&lt;p&gt;The server runs 24/7 regardless of the season, and the cold air actually keeps the GPUs well within thermal limits. The irony is that the machine has never been cooler or louder than when it is twenty below zero outside. If you are considering a similar setup in a garage, basement, or outbuilding, factor in noise. A 2U server with four 250W GPUs is not quiet under any circumstances, and server-grade fans at full RPM are genuinely loud.&lt;/p&gt;
&lt;h3&gt;Setting Up the Software Stack&lt;/h3&gt;
&lt;p&gt;The driver situation for the P40 in 2026 is straightforward, though it was not always. NVIDIA's &lt;code&gt;nvidia-driver-570-server&lt;/code&gt; package works cleanly on Ubuntu, and the DKMS module rebuilds automatically on kernel updates, most of the time. I have had exactly two occasions where a kernel update broke the NVIDIA module and required manual intervention. This is fewer than I expected.&lt;/p&gt;
&lt;p&gt;For inference, I run &lt;a href="https://ollama.com"&gt;Ollama&lt;/a&gt;, which wraps llama.cpp and provides a simple API for model management and inference. Ollama handles multi-GPU sharding automatically: when you load a model, it distributes layers across GPUs based on available memory and model size. A 65GB model like gpt-oss:120b fits across three of the four P40s, leaving one free. Smaller models may only need one or two cards. The allocation is generally sensible, though you have less control over placement than you would with raw llama.cpp.&lt;/p&gt;
&lt;p&gt;The alternative stack (vLLM, TGI, or raw llama.cpp) offers more control over GPU assignment but requires more configuration. With llama.cpp directly, you can pin specific GPU layers to specific devices, which lets you optimize for the P40's memory topology. vLLM provides better batching and continuous batching for serving multiple concurrent requests. For a home lab where the primary use case is running various models for experimentation and development rather than serving production traffic, Ollama's simplicity wins.&lt;/p&gt;
&lt;p&gt;One thing worth noting: the P40 is well-supported by the GGUF ecosystem that llama.cpp (and therefore Ollama) uses. GGUF quantized models (Q4_K_M, Q5_K_M, Q8_0) run without issues on Pascal hardware. The quantization handles the BF16 problem for you: model weights are stored in 4-bit or 8-bit integer formats and dequantized to FP32 at runtime, which the P40 handles natively. You are not fighting the hardware; you are working with it.&lt;/p&gt;
&lt;h3&gt;The Benchmarks&lt;/h3&gt;
&lt;p&gt;Theory is cheap. Benchmarks are what matter. I ran the same inference workload across three configurations: my four P40 home lab, a single AWS Tesla T4 instance, and a quad T4 instance on AWS. The T4 is the closest cloud comparison; it is the workhorse inference GPU in AWS's fleet, one generation newer than the P40 (Turing architecture, 2018), with 16GB of GDDR6 and actual Tensor Cores.&lt;/p&gt;
&lt;p&gt;All benchmarks used Ollama with the same prompt, measuring tokens per second during the evaluation phase (excluding model load time).&lt;/p&gt;
&lt;h4&gt;Dense Models&lt;/h4&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;4x P40 (Home Lab)&lt;/th&gt;
&lt;th&gt;1x T4 (AWS \$0.53/hr)&lt;/th&gt;
&lt;th&gt;4x T4 (AWS \$3.91/hr)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2&lt;/td&gt;
&lt;td&gt;3B&lt;/td&gt;
&lt;td&gt;94.3 tok/s&lt;/td&gt;
&lt;td&gt;81.5 tok/s&lt;/td&gt;
&lt;td&gt;101.5 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;52.7 tok/s&lt;/td&gt;
&lt;td&gt;36.9 tok/s&lt;/td&gt;
&lt;td&gt;40.3 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1&lt;/td&gt;
&lt;td&gt;8B&lt;/td&gt;
&lt;td&gt;47.8 tok/s&lt;/td&gt;
&lt;td&gt;35.7 tok/s&lt;/td&gt;
&lt;td&gt;29.2 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 wins on the 7B and 8B models by substantial margins, 31% and 64% respectively over the quad T4 configuration. The only model where the T4 edges ahead is the 3B, which is small enough to fit entirely on a single GPU. Here, the T4's higher clock speeds and faster GDDR6 memory give it an advantage because there is no multi-GPU overhead to penalize it.&lt;/p&gt;
&lt;p&gt;The 8B result is particularly interesting. The quad T4 actually performs &lt;em&gt;worse&lt;/em&gt; than a single T4 on this model (29.2 vs 35.7 tok/s). Ollama shards the model across all four GPUs even though it fits on one, and the PCIe communication overhead between four T4s costs more than it gains. The P40, with its larger 24GB per-card memory, likely fits more of the model per GPU, reducing cross-GPU transfers.&lt;/p&gt;
&lt;h4&gt;The MoE Advantage&lt;/h4&gt;
&lt;p&gt;The most compelling benchmark comes from OpenAI's gpt-oss, a 120-billion parameter mixture-of-experts model with only 5.1 billion active parameters per token. The MoE architecture means the model's total weight is large (it needs the memory), but the computation per token is modest (only a fraction of the parameters fire for any given input).&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;4x P40&lt;/th&gt;
&lt;th&gt;4x T4 (AWS \$3.91/hr)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;gpt-oss&lt;/td&gt;
&lt;td&gt;120B MoE (5.1B active)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.1 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20.6 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The P40 runs OpenAI's 120B model at 28.1 tokens per second, 36% faster than the cloud instance, and fast enough for comfortable interactive use. This is a state-of-the-art model running on decade-old GPUs at a speed that would have been impressive on much newer hardware a year ago.&lt;/p&gt;
&lt;p&gt;The reason is memory. The gpt-oss model uses MXFP4 quantization on its MoE weights, bringing the total model size to about 65GB. Four P40s offer 96GB of VRAM, enough to hold the entire model in GPU memory. Four T4s offer only 64GB, which means some of the model likely spills to system RAM, adding latency on every token.&lt;/p&gt;
&lt;p&gt;This is the P40's superpower: 24GB per card was overkill in 2016, and it is exactly right in 2026. Models have grown to fill the memory, and the P40 has more of it per dollar than almost anything else on the market.&lt;/p&gt;
&lt;h4&gt;Where It Falls Apart&lt;/h4&gt;
&lt;p&gt;Dense 70B models are a different story. Llama 3.1 70B at Q4_0 quantization (39GB) fits across 96GB of P40 VRAM, but the inference speed is essentially unusable: 0.033 tokens per second. One token every thirty seconds. Answering "What is 2+2?" took six and a half minutes. The combination of no Tensor Cores, PCIe 3.0 interconnect, and the sheer volume of cross-GPU data transfers for a dense 70B model pushes the per-token latency beyond any practical threshold.&lt;/p&gt;
&lt;p&gt;The quad T4 on AWS managed 2.0 tokens per second on the same model, sixty times faster. Slow, but functional. The T4's Tensor Cores make the difference here; at this scale, the P40's raw CUDA cores simply cannot keep up with the matrix math.&lt;/p&gt;
&lt;p&gt;The lesson: MoE models and quantized models up to about 8B parameters are the P40's sweet spot. Dense models above 13B start hitting diminishing returns. Dense 70B is a wall.&lt;/p&gt;
&lt;h3&gt;The Cost Argument&lt;/h3&gt;
&lt;p&gt;Here is the math that justifies the project.&lt;/p&gt;
&lt;p&gt;A &lt;code&gt;g4dn.12xlarge&lt;/code&gt; on AWS (four Tesla T4s, 48 vCPUs, 192GB RAM) costs \$3.91 per hour. My home lab outperforms it on every model except the smallest. If I run inference for just four hours a day, the cloud cost would be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Daily&lt;/strong&gt;: \$15.64&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Monthly&lt;/strong&gt;: \$469&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Yearly&lt;/strong&gt;: \$5,694&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;My server cost \$2,500 to build. It pays for itself in roughly five months of equivalent cloud usage. After that, the only ongoing cost is electricity. At Minnesota residential rates (roughly \$0.12/kWh) and an average draw of 800W under load, that is about \$70 per month. Less than a single day of the equivalent cloud instance.&lt;/p&gt;
&lt;p&gt;Even if you factor in the P40's lower performance on some workloads and assume you only get 70% of the cloud equivalent's utility, the break-even point is still well under a year. For a home lab that runs 24/7 for development, experimentation, and &lt;a href="https://tinycomputers.io/posts/clean-room-z80-emulator.html"&gt;text-to-speech generation&lt;/a&gt;, the economics are overwhelming.&lt;/p&gt;
&lt;h3&gt;What I Actually Use It For&lt;/h3&gt;
&lt;p&gt;The server runs several workloads:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Local LLM inference.&lt;/strong&gt; This is the primary use case. Having a local inference server with 96GB of VRAM means I can run frontier-class open-weight models without sending data to a cloud API. For development work, where I might make hundreds of inference calls while iterating on a project, the zero marginal cost changes how I work. I experiment more freely when each query costs nothing.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Text-to-speech.&lt;/strong&gt; I run &lt;a href="https://tinycomputers.io/posts/clean-room-z80-emulator.html"&gt;Qwen TTS&lt;/a&gt; on the P40s to generate audio narration for blog posts. The model fits comfortably in the P40's memory, and the generation speed is acceptable for batch processing. The narration you hear on posts across this site was generated on these GPUs.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Development and testing.&lt;/strong&gt; When I am building projects like &lt;a href="https://tinycomputers.io/posts/sampo-designing-a-16-bit-risc-cpu-from-scratch-part-1-theory-and-architecture.html"&gt;Sampo&lt;/a&gt; or &lt;a href="https://tinycomputers.io/posts/introducing-lattice-a-crystallization-based-programming-language.html"&gt;Lattice&lt;/a&gt;, having local GPU compute available for testing AI-assisted workflows means I do not need to worry about API rate limits or costs during intensive development sessions.&lt;/p&gt;
&lt;p&gt;The server sits on my local network at a static IP, accessible from any machine in the house. It is always on, always available, and always free to use. That availability changes your relationship with AI inference in ways that are hard to appreciate until you have lived with it. There is a psychological difference between "this costs two cents per query" and "this costs nothing per query." The first makes you think about whether the query is worth it. The second lets you experiment without friction, and that friction reduction, compounded across hundreds of daily interactions, fundamentally changes how you work.&lt;/p&gt;
&lt;p&gt;This is, incidentally, a small-scale example of the &lt;a href="https://tinycomputers.io/posts/jevons-paradox.html"&gt;Jevons Paradox&lt;/a&gt; I have been writing about in this blog's economics series. Making inference cheaper did not cause me to run the same number of queries and pocket the savings. It caused me to run dramatically more queries, on more models, for more projects, consuming more total compute than I ever would have purchased from a cloud provider. The efficiency created demand.&lt;/p&gt;
&lt;h3&gt;Should You Build One?&lt;/h3&gt;
&lt;p&gt;The honest answer is: it depends on what you value.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Build one if:&lt;/strong&gt;
- You run local inference regularly and the cloud costs are adding up
- You want 96GB of VRAM for under a thousand dollars in GPU costs
- You have the physical space, electrical capacity, and noise tolerance for a rack-mount server
- You enjoy the process of building and configuring systems; this is not a plug-and-play experience&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Do not build one if:&lt;/strong&gt;
- You need the latest model performance (Tensor Cores, FP8, NVLink)
- You are training models, not running inference
- You need reliability guarantees; this is a home lab, not a production environment
- You are not comfortable with Linux system administration, driver debugging, and occasional hardware troubleshooting&lt;/p&gt;
&lt;p&gt;The P40 window will not last forever. As newer GPUs age out of data centers (the V100, the A100) the P40 will eventually lose its price-to-performance advantage. The V100, with its first-generation Tensor Cores and 32GB of HBM2, is already starting to appear at attractive secondary market prices. Within a year, it may be the new sweet spot. But right now, in early 2026, four P40s on eBay represent one of the best deals in GPU computing. Ninety-six gigabytes of VRAM, proven CUDA compatibility, and a decade of driver maturity, for the price of a weekend trip.&lt;/p&gt;
&lt;p&gt;The server in my shop building will keep running. The fans will keep screaming through the Minnesota winter. And I will keep running models on hardware that a hyperscaler discarded three years ago, at speeds that would have been remarkable on any hardware five years ago. That is the beauty of the secondary market: someone else paid for the R&amp;amp;D, someone else paid for the depreciation, and you get the compute.&lt;/p&gt;</description><category>ai</category><category>benchmarks</category><category>cuda</category><category>deep learning</category><category>ebay</category><category>enterprise hardware</category><category>gpu</category><category>home lab</category><category>inference</category><category>nvidia</category><category>ollama</category><category>tesla p40</category><guid>https://tinycomputers.io/posts/repurposing-enterprise-gpus-the-tesla-p40-home-lab-story.html</guid><pubDate>Wed, 11 Mar 2026 14:00:00 GMT</pubDate></item></channel></rss>