There is a window, maybe eighteen months wide, where enterprise hardware hits a pricing sweet spot. The first-generation buyers — the hyperscalers, the research labs, the Fortune 500 AI teams — have moved on to the next generation. The second-hand market floods. Prices crater. And if you know what you're looking for, you can build something genuinely capable for less than a month of cloud compute.
I built a four-GPU inference server for about twenty-five hundred dollars. This is the story of how, why, and whether you should do the same.
The Buy
The acquisition strategy is straightforward: eBay, patience, and knowing what to look for.
Tesla P40s started appearing in volume on the secondary market around 2023, when cloud providers and enterprise data centers began cycling them out in favor of A100s and H100s. A card that sold for over five thousand dollars new was suddenly available for three hundred, then two hundred and fifty, then — if you watched listings carefully and were willing to buy from decommissioned lot sellers — sometimes less. I picked up four cards over the course of about two months, averaging two hundred and fifty dollars each.
The chassis was a Penguin Computing 2U rack-mount server, also from eBay. These show up when government labs and research institutions liquidate equipment. The Penguin Computing systems are well-built — proper server-grade construction with redundant power supplies and engineered airflow. Mine takes the Xeon E5-2697A v4 and two were purchased from eBay: eighteen Broadwell cores, more than enough CPU to keep four GPUs fed. The chassis cost around six hundred dollars.
Memory was the lucky purchase. I bought 252GB of DDR4 ECC RAM before the memory price spike that hit in late 2024 when every company on Earth decided they needed AI infrastructure simultaneously. What I paid around two hundred and fifty dollars for would cost significantly more today. Total build: roughly twenty-five hundred dollars.
The Hardware
The Tesla P40 is a 2016-era data center GPU. NVIDIA designed it for the Pascal generation, targeting inference workloads in enterprise environments. The specifications, for something you can buy on eBay for two hundred and fifty dollars, are remarkable:
- 24GB GDDR5X per card — more memory than an RTX 4090
- 3,840 CUDA cores — Pascal architecture, compute capability 6.1
- 12 TFLOPS FP32 — respectable even by 2026 standards for inference
- 250W TDP — this is a data center card and it draws power like one
Multiply by four and you get 96GB of VRAM for a thousand dollars. That is an extraordinary amount of GPU memory for the price. For context, a single NVIDIA A100 80GB still sells for north of five thousand dollars on the secondary market. Four P40s give you more total VRAM for a fraction of the cost.
What You Give Up
There is no free lunch in computing, and the P40 makes you pay for its low price in specific, sometimes painful ways.
No Tensor Cores. The P40 predates NVIDIA's Tensor Core architecture, which arrived with Volta in 2017. Tensor Cores accelerate matrix multiplication — the fundamental operation in neural network inference — by factors of 4x to 16x depending on precision. The P40 does everything with its CUDA cores, the old-fashioned way. This matters less than you might think for inference at moderate batch sizes, but it means you will never match the throughput of a V100 or newer card, clock for clock.
No native BF16 or FP16. This is the real gotcha. BF16 (bfloat16) has become the default precision for large language models. It is what most model weights are distributed in. The P40 cannot compute in BF16 natively — it emulates it through FP32 operations, which is roughly 21% slower than native support. In practice, this means you are running quantized models (Q4, Q5, Q8) through llama.cpp or similar frameworks, which handle the precision conversion for you. It works. It is not optimal.
Passive cooling designed for server airflow. The P40 is a blower-style card designed for 1U and 2U server chassis with front-to-back forced airflow. In a proper server, this is fine. In anything else, you need to solve cooling yourself. I put mine in a Penguin Computing 2U rack-mount chassis, which has the right airflow characteristics, but this is not a card you drop into a desktop tower.
PCIe 3.0 x16. The P40 connects via PCIe 3.0, which provides about 16 GB/s of bandwidth per direction. When you are running a model that spans four GPUs, the inter-GPU communication goes over PCIe, not NVLink. This creates a bottleneck for models that require heavy cross-GPU communication. For inference, where the communication pattern is more predictable than training, this is manageable. For training, it would be a serious constraint.
The Minnesota Problem
My server lives in an unheated shop building in northern Minnesota. This has created an issue that no hardware review will prepare you for.
When ambient temperatures drop below freezing — which, in Minnesota, means roughly October through April — the onboard temperature sensors report values that the baseboard management controller interprets as a malfunction. The BMC's response is to spin every fan to maximum RPM as a protective measure.
The result is a machine that, on quiet winter nights, is audible from the house. The house is a hundred and fifty feet away.
I have not solved this problem. I have learned to live with it. You can override BMC fan curves on some platforms, but the Penguin Computing firmware is locked down in ways that make this nontrivial, and frankly, a server that runs its fans at full speed because it thinks it is dying is doing exactly what it should be doing. The firmware's assumptions are just wrong for the environment.
The server runs 24/7 regardless of the season, and the cold air actually keeps the GPUs well within thermal limits — the irony is that the machine has never been cooler or louder than when it is twenty below zero outside. If you are considering a similar setup in a garage, basement, or outbuilding, factor in noise. A 2U server with four 250W GPUs is not quiet under any circumstances, and server-grade fans at full RPM are genuinely loud.
Setting Up the Software Stack
The driver situation for the P40 in 2026 is straightforward, though it was not always. NVIDIA's nvidia-driver-570-server package works cleanly on Ubuntu, and the DKMS module rebuilds automatically on kernel updates — most of the time. I have had exactly two occasions where a kernel update broke the NVIDIA module and required manual intervention. This is fewer than I expected.
For inference, I run Ollama, which wraps llama.cpp and provides a simple API for model management and inference. Ollama handles multi-GPU sharding automatically — when you load a model, it distributes layers across GPUs based on available memory and model size. A 65GB model like gpt-oss:120b fits across three of the four P40s, leaving one free. Smaller models may only need one or two cards. The allocation is generally sensible, though you have less control over placement than you would with raw llama.cpp.
The alternative stack — vLLM, TGI, or raw llama.cpp — offers more control over GPU assignment but requires more configuration. With llama.cpp directly, you can pin specific GPU layers to specific devices, which lets you optimize for the P40's memory topology. vLLM provides better batching and continuous batching for serving multiple concurrent requests. For a home lab where the primary use case is running various models for experimentation and development rather than serving production traffic, Ollama's simplicity wins.
One thing worth noting: the P40 is well-supported by the GGUF ecosystem that llama.cpp (and therefore Ollama) uses. GGUF quantized models — Q4_K_M, Q5_K_M, Q8_0 — run without issues on Pascal hardware. The quantization handles the BF16 problem for you: model weights are stored in 4-bit or 8-bit integer formats and dequantized to FP32 at runtime, which the P40 handles natively. You are not fighting the hardware; you are working with it.
The Benchmarks
Theory is cheap. Benchmarks are what matter. I ran the same inference workload across three configurations: my four P40 home lab, a single AWS Tesla T4 instance, and a quad T4 instance on AWS. The T4 is the closest cloud comparison — it is the workhorse inference GPU in AWS's fleet, one generation newer than the P40 (Turing architecture, 2018), with 16GB of GDDR6 and actual Tensor Cores.
All benchmarks used Ollama with the same prompt, measuring tokens per second during the evaluation phase (excluding model load time).
Dense Models
| Model | Parameters | 4x P40 (Home Lab) | 1x T4 (AWS \$0.53/hr) | 4x T4 (AWS \$3.91/hr) |
|---|---|---|---|---|
| Llama 3.2 | 3B | 94.3 tok/s | 81.5 tok/s | 101.5 tok/s |
| Qwen 2.5 | 7B | 52.7 tok/s | 36.9 tok/s | 40.3 tok/s |
| Llama 3.1 | 8B | 47.8 tok/s | 35.7 tok/s | 29.2 tok/s |
The P40 wins on the 7B and 8B models by substantial margins — 31% and 64% respectively over the quad T4 configuration. The only model where the T4 edges ahead is the 3B, which is small enough to fit entirely on a single GPU. Here, the T4's higher clock speeds and faster GDDR6 memory give it an advantage because there is no multi-GPU overhead to penalize it.
The 8B result is particularly interesting. The quad T4 actually performs worse than a single T4 on this model (29.2 vs 35.7 tok/s). Ollama shards the model across all four GPUs even though it fits on one, and the PCIe communication overhead between four T4s costs more than it gains. The P40, with its larger 24GB per-card memory, likely fits more of the model per GPU, reducing cross-GPU transfers.
The MoE Advantage
The most compelling benchmark comes from OpenAI's gpt-oss — a 120-billion parameter mixture-of-experts model with only 5.1 billion active parameters per token. The MoE architecture means the model's total weight is large (it needs the memory), but the computation per token is modest (only a fraction of the parameters fire for any given input).
| Model | Architecture | 4x P40 | 4x T4 (AWS \$3.91/hr) |
|---|---|---|---|
| gpt-oss | 120B MoE (5.1B active) | 28.1 tok/s | 20.6 tok/s |
The P40 runs OpenAI's 120B model at 28.1 tokens per second — 36% faster than the cloud instance, and fast enough for comfortable interactive use. This is a state-of-the-art model running on decade-old GPUs at a speed that would have been impressive on much newer hardware a year ago.
The reason is memory. The gpt-oss model uses MXFP4 quantization on its MoE weights, bringing the total model size to about 65GB. Four P40s offer 96GB of VRAM — enough to hold the entire model in GPU memory. Four T4s offer only 64GB, which means some of the model likely spills to system RAM, adding latency on every token.
This is the P40's superpower: 24GB per card was overkill in 2016, and it is exactly right in 2026. Models have grown to fill the memory, and the P40 has more of it per dollar than almost anything else on the market.
Where It Falls Apart
Dense 70B models are a different story. Llama 3.1 70B at Q4_0 quantization (39GB) fits across 96GB of P40 VRAM, but the inference speed is essentially unusable: 0.033 tokens per second. One token every thirty seconds. Answering "What is 2+2?" took six and a half minutes. The combination of no Tensor Cores, PCIe 3.0 interconnect, and the sheer volume of cross-GPU data transfers for a dense 70B model pushes the per-token latency beyond any practical threshold.
The quad T4 on AWS managed 2.0 tokens per second on the same model — sixty times faster. Slow, but functional. The T4's Tensor Cores make the difference here — at this scale, the P40's raw CUDA cores simply cannot keep up with the matrix math.
The lesson: MoE models and quantized models up to about 8B parameters are the P40's sweet spot. Dense models above 13B start hitting diminishing returns. Dense 70B is a wall.
The Cost Argument
Here is the math that justifies the project.
A g4dn.12xlarge on AWS — four Tesla T4s, 48 vCPUs, 192GB RAM — costs \$3.91 per hour. My home lab outperforms it on every model except the smallest. If I run inference for just four hours a day, the cloud cost would be:
- Daily: \$15.64
- Monthly: \$469
- Yearly: \$5,694
My server cost \$2,500 to build. It pays for itself in roughly five months of equivalent cloud usage. After that, the only ongoing cost is electricity. At Minnesota residential rates (roughly \$0.12/kWh) and an average draw of 800W under load, that is about \$70 per month. Less than a single day of the equivalent cloud instance.
Even if you factor in the P40's lower performance on some workloads and assume you only get 70% of the cloud equivalent's utility, the break-even point is still well under a year. For a home lab that runs 24/7 for development, experimentation, and text-to-speech generation, the economics are overwhelming.
What I Actually Use It For
The server runs several workloads:
Local LLM inference. This is the primary use case. Having a local inference server with 96GB of VRAM means I can run frontier-class open-weight models without sending data to a cloud API. For development work — where I might make hundreds of inference calls while iterating on a project — the zero marginal cost changes how I work. I experiment more freely when each query costs nothing.
Text-to-speech. I run Qwen TTS on the P40s to generate audio narration for blog posts. The model fits comfortably in the P40's memory, and the generation speed is acceptable for batch processing. The narration you hear on posts across this site was generated on these GPUs.
Development and testing. When I am building projects like Sampo or Lattice, having local GPU compute available for testing AI-assisted workflows means I do not need to worry about API rate limits or costs during intensive development sessions.
The server sits on my local network at a static IP, accessible from any machine in the house. It is always on, always available, and always free to use. That availability changes your relationship with AI inference in ways that are hard to appreciate until you have lived with it. There is a psychological difference between "this costs two cents per query" and "this costs nothing per query." The first makes you think about whether the query is worth it. The second lets you experiment without friction — and that friction reduction, compounded across hundreds of daily interactions, fundamentally changes how you work.
This is, incidentally, a small-scale example of the Jevons Paradox I have been writing about in this blog's economics series. Making inference cheaper did not cause me to run the same number of queries and pocket the savings. It caused me to run dramatically more queries, on more models, for more projects, consuming more total compute than I ever would have purchased from a cloud provider. The efficiency created demand.
Should You Build One?
The honest answer is: it depends on what you value.
Build one if: - You run local inference regularly and the cloud costs are adding up - You want 96GB of VRAM for under a thousand dollars in GPU costs - You have the physical space, electrical capacity, and noise tolerance for a rack-mount server - You enjoy the process of building and configuring systems — this is not a plug-and-play experience
Do not build one if: - You need the latest model performance (Tensor Cores, FP8, NVLink) - You are training models, not running inference - You need reliability guarantees — this is a home lab, not a production environment - You are not comfortable with Linux system administration, driver debugging, and occasional hardware troubleshooting
The P40 window will not last forever. As newer GPUs age out of data centers — the V100, the A100 — the P40 will eventually lose its price-to-performance advantage. The V100, with its first-generation Tensor Cores and 32GB of HBM2, is already starting to appear at attractive secondary market prices. Within a year, it may be the new sweet spot. But right now, in early 2026, four P40s on eBay represent one of the best deals in GPU computing. Ninety-six gigabytes of VRAM, proven CUDA compatibility, and a decade of driver maturity, for the price of a weekend trip.
The server in my shop building will keep running. The fans will keep screaming through the Minnesota winter. And I will keep running models on hardware that a hyperscaler discarded three years ago, at speeds that would have been remarkable on any hardware five years ago. That is the beauty of the secondary market — someone else paid for the R&D, someone else paid for the depreciation, and you get the compute.