The Economics of Owning Your Own Inference

A.C. Jokela

2026-03-17

I own $5,500 worth of GPU hardware dedicated to running AI models locally. I also pay for a Claude Max subscription that I use for nearly everything that matters. If that sounds like a contradiction, it is the entire subject of this article.

The local inference conversation online is dominated by two positions. The first: why pay for API calls when you can run models on your own hardware? The second: local models are worse, so just pay for the good ones. Both are correct. Both are incomplete. The interesting question is where the boundary falls between them, and the answer turns out to depend less on cost-per-token arithmetic than on what kind of work you are doing.

The Split

I use Claude for research, code review, writing feedback, technical analysis, and anything that used to be a Google search. The frontier models are better at all of these tasks than anything I can run locally. Not marginally better; categorically better. An 8B parameter model running on my hardware is not in the same conversation as Claude Opus or GPT-5.4 for anything requiring reasoning, nuance, or broad knowledge. The subscription cost is fixed regardless of volume, which eliminates per-query friction entirely. For interactive, quality-sensitive work, I pay for the best model available and I do not think about it.

Local inference handles everything else: the batch jobs, the grunt work, the high-volume tasks where model quality matters less than model availability. The work that would be expensive at cloud API rates not because any single call costs much, but because the calls number in the tens of thousands.

This is not a temporary arrangement while local models catch up. It is a structural split. Frontier models are getting better. Local models are also getting better. The gap is not closing in the ways that matter for my usage, because the tasks I send to each side are fundamentally different. I do not need my local 8B model to reason better. I need it to process text cheaply and without metering.

What the Local Hardware Actually Does

Three workloads. All batch. All quality-tolerant.

Text-to-speech. Every post on this site has an AI-generated audio narration. This is the workload that justifies the hardware on its own. Google Cloud Platform has superior TTS voices; Chirp3-HD sounds noticeably more natural than any open-source model I have tested. I ran a novel through it once: 82,000 words, 500,000 characters, $17.25. That is reasonable for a one-off project.

It is not reasonable for a library of blog posts that I revise and regenerate periodically. At GCP rates ($16 per million characters, more for premium voices), narrating every post on this site would cost $200 to $400, and that bill resets every time I edit an article and regenerate the audio. Open-source TTS (F5-TTS and Qwen TTS) mispronounces technical terms. The prosody goes flat on dense jargon. But it is good enough for blog narration. "Good enough" at zero marginal cost beats "excellent" at $4 to $10 per post when you are generating audio daily.

Code scanning. Running local models over source files for pattern detection, documentation extraction, and automated analysis. These jobs produce high token volume at low quality requirements. An 8B model is adequate. The token count across a full codebase makes API pricing add up in a way that individual queries do not.

Infrastructure work. Benchmarking hardware (as in this article), testing prompt structures across quantization levels, evaluating model behavior under different configurations. These queries have no value individually. They are the test drives, not the commute. Paying per-token for test drives is paying per-mile to drive your own car around the block.

None of these workloads require a frontier model. All of them generate enough volume to make metered pricing uncomfortable. That is the boundary.

The Machines

Two machines. Both mine. Both running Ollama.

A four-GPU Tesla P40 server: Penguin Computing 2U chassis, Xeon E5-2697A v4, 252GB DDR4 ECC, four Tesla P40s with 24GB GDDR5X each. Ninety-six gigabytes of VRAM. Pascal architecture, 2016 vintage. Built from eBay parts for about $2,500. Lives in an unheated shop building in Minnesota.

A Bosgame M5 mini desktop: AMD Ryzen AI MAX+ 395, Strix Halo APU with integrated RDNA 3.5 graphics. No discrete GPU. CPU and GPU share 128GB DDR5, roughly 60GB addressable as VRAM through ROCm 7.2. Cost about $3,000. Fits on a desk.

What They Cost to Run

I logged GPU power draw at 500-millisecond intervals during inference using nvidia-smi on the P40 server and rocm-smi on the Strix Halo. Same prompt, same models, same Ollama configuration. All models ran 100% on GPU.

Model	P40 tok/s	P40 GPU Power	Halo tok/s	Halo GPU Power
Llama 3.2 3B	91.2	170W avg	78.4	64W avg
Llama 3.1 8B	47.5	278W avg	40.2	82W avg
Llama 3.1 70B (4K ctx)	6.3	278W avg	5.6	81W avg

The P40 is 15-18% faster in raw throughput. It draws 3-4x the power. The 3B model lives on a single P40; the other three cards idle at ~9W each but still cost electricity. The 8B and 70B models span two GPUs while two idle. You always pay for cards that are not working. The Strix Halo has one GPU. No idle penalty.

GPU power is not total system power. The P40 server's Xeons, 252GB of RAM, dual PSUs, and fans add roughly 200W to the GPU figures. The Strix Halo's APU and DDR5 add roughly 40-60W. Conservative estimates for total system draw: 500W for the P40 under load, 120W for the Strix Halo.

At Minnesota residential electricity rates ($0.157/kWh), the cost per million tokens:

Machine	3B	8B	70B
P40 Server	$0.19/M	$0.46/M	$3.47/M
Strix Halo	$0.06/M	$0.13/M	$0.94/M

Why the Per-Token Number Is Misleading

Those numbers look competitive with hosted inference, which runs $0.05 to $0.20 per million tokens for 8B-class models through providers like Together AI or Groq. The Strix Halo at $0.13/M sits squarely in that range. The P40 at $0.46/M does not.

But per-token cost during active inference is the wrong metric for two reasons.

Hardware amortization changes the math. The P40 server cost $2,500. The Strix Halo cost $3,000. Amortized over two years, that adds $0.14/hr and $0.11/hr respectively. On the 8B model, the all-in cost per million tokens rises to about $1.28 for the P40 and $0.90 for the Strix Halo. Both are more expensive than every hosted inference API for the same model.

Idle power is the dominant cost. The P40 server draws roughly 340W at idle: $38.50 per month whether I run a single query or not. The Strix Halo draws roughly 35W at idle: $4.20 per month. Over a year, idle electricity alone costs $462 on the P40 and $50 on the Strix Halo. If you are not using the hardware frequently, idle power overwhelms everything else in the cost model.

Per-token math at load flatters local inference by ignoring the hours when the hardware is doing nothing. It is like calculating your car's fuel economy only during highway driving and ignoring that it sits in the driveway 22 hours a day with the engine running.

Why I Run Both Anyway

The per-token economics favor API providers. The per-workload economics favor local hardware for specific tasks. TTS is the starkest example.

Generating a 20-minute blog narration on the Strix Halo takes about 45 minutes of inference at roughly 85W above idle power. The incremental electricity cost is about $0.02. The same narration through Google Cloud TTS would cost $4 to $10 depending on character count and voice tier.

That is a 200-to-500x cost difference on the marginal unit. And the marginal unit is what matters, because the question is never "should I generate TTS at all?" It is "should I regenerate the audio for this post I just edited?" or "should I try a different voice on this article?" or "should I narrate this niche post about PCB trace routing that maybe fifty people will listen to?"

At $4 to $10 per narration, the answer to all of those is "probably not." At $0.02, the answer is "why wouldn't I?" That shift from "probably not" to "why not" is the entire economic argument for owning TTS hardware. It is not about the average cost. It is about the marginal decision.

Before running local TTS, I narrated posts selectively with Google Cloud's Text-to-Speech. Some were too long or too niche to justify the GCP cost. Now every post gets audio. I regenerate after revisions without thinking about it. I have run the same post through three different TTS models to compare voice quality. I experiment with speaker voices, pacing parameters, and chunk sizes. The total volume of audio I have generated locally exceeds what I would have purchased from Google at any price point. This is Jevons Paradox at the smallest possible scale: make TTS cheap enough and I do not produce the same amount of TTS for less money; I produce vastly more TTS for slightly less money.

The same logic applies to code scanning. Any individual scan is cheap enough through an API. But the friction of metered pricing discourages the kind of speculative, exploratory analysis that turns up unexpected findings. When the marginal cost is zero, I scan more freely and more often. The value is not in any single scan; it is in the scans I would not have run otherwise.

The Strix Halo Problem

The most surprising result in the benchmarks is the Strix Halo's efficiency. An integrated APU with no discrete GPU delivers 40.2 tokens per second at 82W of GPU power. The P40 server delivers 47.5 tokens per second at 278W of GPU power. The P40 is 18% faster. The Strix Halo uses 70% less power. In performance per GPU watt, the Strix Halo (0.49 tok/s per watt) is nearly three times more efficient than the P40 (0.17 tok/s per watt).

This creates a problem for the P40 server's economics. The server's advantage is VRAM: 96GB lets it run 120B MoE models that the Strix Halo cannot fit. For the gpt-oss 120B model, the P40 server is the valid option. But for everything 8B and below, the Strix Halo is cheaper to buy ($2,000 vs. $2,500), cheaper to idle ($4.20/month vs. $38.50/month), cheaper per token ($0.13/M vs. $0.46/M), quieter, smaller, and only 18% slower.

If I were building a local inference setup today from scratch and my workload was 8B models and TTS, I would buy the Strix Halo and nothing else. The P40 server justifies its existence only for the large models that need its VRAM and the fact that I put it together well before the current RAM price spike.

This is worth sitting with for a moment, because it inverts the conventional wisdom about inference hardware. The enterprise GPU server that looks impressive on paper (four GPUs, 96GB VRAM, 2U rack mount) loses on total cost of ownership to a $3,000 mini desktop for the workloads that dominate my actual usage. The P40's raw throughput advantage is real but small. Its power cost advantage is negative. The VRAM advantage matters only for models most people do not run.

The Maintenance Tax

The per-token calculations ignore the cost of keeping these machines running. It is not zero.

I have had two kernel updates break the NVIDIA DKMS module on the P40 server. The AMD machine requires specific pre-release PyTorch wheels and environment variable overrides for ROCm to function on gfx1151 hardware. While running the benchmarks for this article, I discovered that Ollama on the Strix Halo had been running entirely on CPU because the systemd service file lacked the HSA_OVERRIDE_GFX_VERSION=11.5.1 variable. Every benchmark I had run on that machine prior to catching this was measuring CPU inference, not GPU inference. The fix took two minutes. Finding it took longer.

The P40 server's fans run at full speed from October through April because the BMC interprets Minnesota winter temperatures as a hardware malfunction. The noise is audible from the house, 150 feet away.

None of this is catastrophic. All of it is time. And time spent debugging DKMS modules or adding environment variables to systemd units is time not spent on the work that the hardware is supposed to enable. A Claude Max subscription requires zero maintenance. The local hardware requires ongoing attention. That asymmetry does not show up in per-token cost tables, but it is real.

Who This Is For

Most people should not build a local inference server. If you use AI for interactive tasks (questions, code, analysis, writing), a frontier model subscription is a better product at a lower total cost than any local setup. The quality gap between a local 8B model and Claude or GPT-5.4 is not closing in the ways that matter for conversational use. Pay for the good models. Use them freely.

Local inference makes economic sense when you have a specific, high-volume, quality-tolerant workload that you will run often enough to justify hardware sitting on 24/7. TTS is the clearest case. Batch code analysis is another. If you cannot name the workload, you do not have one, and the hardware will cost you $40 to $50 per month in idle electricity to find out.

The split between frontier subscriptions and local batch processing is not a compromise. It is, for my usage, the correct architecture. The frontier model handles the work where quality determines value. The local hardware handles the work where volume determines cost. Neither replaces the other. The mistake is thinking they compete.