The Real Cost of Running Qwen TTS Locally: Three Machines Compared

A.C. Jokela

2026-03-12

The Tesla P40 server standing on its side in an unheated Minnesota shop building, one of three machines benchmarked for local TTS generation

Every post on this site has an audio version. A small player at the top, a few minutes of narration, generated entirely on local hardware. No cloud API, no per-character fees, no data leaving the network. I wrote about setting up the pipeline on AMD Strix Halo earlier this year, and the system has been running in production since, generating narrations for new posts, regenerating old ones when I revise them, and occasionally processing long-form content that would cost real money through Google Cloud TTS or ElevenLabs.

But I now have three machines capable of running Qwen3-TTS, and they could not be more different from each other. An Apple M3 Max laptop. An AMD Ryzen AI MAX+ 395 mini desktop with integrated Radeon graphics. And a four-GPU Tesla P40 server built from decade-old enterprise hardware bought on eBay. Three different silicon vendors, three different compute backends (MPS, ROCm, and CUDA) running the same model on the same text.

The question I wanted to answer is simple: how do they actually compare? Not on paper. Not in theoretical FLOPS. In wall-clock time, generating real audio from a real blog post.

The answer turned out to be more interesting than I expected, because the numbers tell a story about hardware architecture that raw specifications completely miss.

The Setup

The model is Qwen3-TTS-12Hz-1.7B-CustomVoice, a 1.7 billion parameter autoregressive text-to-speech model from Alibaba's Qwen team. It generates natural-sounding speech with multiple speaker voices. I use the Eric voice for all blog narrations: clear, professional, well-paced for technical content.

The three machines:

Apple M3 Max, a MacBook Pro with Apple's M3 Max chip. 14 CPU cores, 30 GPU cores, 64GB unified memory. The GPU runs through PyTorch's MPS (Metal Performance Shaders) backend. This is my daily driver laptop, and it generates TTS when I am writing and editing posts.

AMD Radeon 8060S, a Bosgame M5 mini desktop running AMD's Ryzen AI MAX+ 395. This is a Strix Halo APU with integrated RDNA 3.5 graphics, not a discrete GPU. It shares 128GB of DDR5 system memory with the CPU, with roughly 96GB addressable as VRAM. The GPU runs through ROCm 7.2 with PyTorch 2.9.1. The gfx1151 architecture requires specific PyTorch wheels from AMD's pre-release index and several environment variable overrides to function. I wrote a full setup guide for this machine.

NVIDIA Tesla P40, a 2U rack-mount server with four Tesla P40 GPUs, each with 24GB of GDDR5X. Pascal architecture from 2016. Compute capability 6.1. No Tensor Cores, no native bfloat16 support. The benchmark uses a single P40, since Qwen TTS runs on one GPU. This machine lives in an unheated shop building in Minnesota and screams through the winter when the BMC misinterprets sub-zero ambient temperatures as a hardware malfunction.

All three machines run the same model checkpoint, the same text input, and the same speaker voice. The only differences are the silicon and the compute backend.

The Benchmark

I used a standardized 2,411-character passage, five paragraphs on the Jevons Paradox, dense enough to exercise the model's prosody and pacing on real written content. Each machine ran three consecutive generations from the same loaded model, producing roughly three minutes of audio per run. The first run includes kernel compilation and cache warmup; subsequent runs reflect steady-state performance.

The metric that matters is Real-Time Factor (RTF): how many seconds of wall-clock time it takes to generate one second of audio. An RTF of 1.0 means the model generates audio at exactly real-time speed. Below 1.0 is faster than real-time. Above 1.0 means you are waiting.

Individual Runs

Apple M3 Max (MPS)

Run	Generation Time	Audio Length	RTF
1	698.5s	197.7s	3.53
2	533.1s	184.2s	2.89
3	447.8s	179.2s	2.50
Average	559.8s	187.0s	2.97

AMD Radeon 8060S (ROCm)

Run	Generation Time	Audio Length	RTF
1	729.2s	173.6s	4.20
2	460.0s	204.8s	2.25
3	548.2s	214.2s	2.56
Average	579.1s	197.5s	3.00

NVIDIA Tesla P40 (CUDA)

Run	Generation Time	Audio Length	RTF
1	1511.4s	204.1s	7.41
2	1225.7s	171.6s	7.14
3	1537.2s	206.7s	7.44
Average	1424.8s	194.1s	7.33

Summary

Machine	GPU	Avg RTF	Best RTF	Avg Gen Time
MacBook Pro	M3 Max (MPS)	2.97	2.50	559.8s
Bosgame M5	Radeon 8060S (ROCm)	3.00	2.25	579.1s
Penguin 2U	Tesla P40 (CUDA)	7.33	7.14	1424.8s

What the Numbers Mean

The headline result is that the M3 Max and Radeon 8060S are essentially tied, and the Tesla P40 is roughly 2.4 times slower than both. But that summary hides the interesting details.

The Warmup Effect Is Massive

On both the M3 Max and the Radeon 8060S, the first run is dramatically slower than subsequent runs. The M3 Max goes from RTF 3.53 on run 1 to RTF 2.50 on run 3, a 29% improvement. The AMD shows an even larger swing: RTF 4.20 on run 1 dropping to RTF 2.25 on run 2, a 46% improvement.

This is kernel compilation. Both MPS and ROCm compile GPU kernels on first use and cache them for subsequent calls. The Qwen TTS model hits a wide variety of kernel shapes during autoregressive generation (different sequence lengths, different attention patterns) and each new shape triggers a compilation on the first encounter. By run 2, most of the common shapes are cached, and performance stabilizes.

The P40 shows almost no warmup effect. RTF 7.41 on run 1, 7.14 on run 2, 7.44 on run 3. CUDA's kernel compilation is faster and more mature, so the overhead is absorbed within the first few seconds rather than spread across the entire run. But this maturity does not translate into faster inference; CUDA compiles faster, but the P40's hardware is fundamentally slower at the operations this model requires.

This has a practical implication that matters: short benchmarks on MPS and ROCm are misleading. I initially ran a quick 276-character test on all three machines before doing the full benchmark. The short test showed the AMD at RTF 9.20, almost identical to the P40's RTF 10.01, and far behind the M3 Max's RTF 2.84. That result nearly led me to conclude the AMD was performing as poorly as decade-old hardware. The longer benchmark, with its warmup effect amortized across more generation, revealed the truth: the AMD is just as fast as the M3 Max once the kernels are cached. If I had stopped at the short test, I would have drawn exactly the wrong conclusion.

Why the P40 Is So Slow

The Tesla P40 is a Pascal-generation GPU from 2016. It has 3,840 CUDA cores and 24GB of GDDR5X memory. On paper, it should be competitive; 12 TFLOPS of FP32 compute is not trivial. And for LLM inference through Ollama, the P40 performs remarkably well, outperforming quad T4 instances on models up to 8B parameters.

TTS is a different workload. Qwen3-TTS is an autoregressive transformer that generates audio tokens one at a time, each conditioned on all previous tokens. This means the inference is heavily memory-bandwidth bound during the decoding phase, and compute-bound during the attention and feedforward passes. The model is distributed in bfloat16 precision, which the P40 cannot compute natively; Pascal predates bfloat16 support entirely. PyTorch silently promotes bf16 operations to fp32 on the P40, roughly doubling the computation per operation and halving the effective throughput.

The P40 also lacks the SDPA (Scaled Dot-Product Attention) hardware acceleration that newer architectures provide. On the M3 Max, MPS routes attention through Metal's optimized primitives. On the AMD, ROCm's AOTriton provides experimental flash attention support. On the P40, attention runs through standard CUDA kernels without any of these accelerations. For a model that generates thousands of autoregressive steps per audio clip, each involving a full attention pass over the growing sequence, this compounds dramatically.

The P40 is not bad hardware. It is excellent hardware for the workloads it was designed for: batch inference on quantized LLMs where its 24GB of VRAM per card creates a memory advantage. But autoregressive TTS in bfloat16 hits every one of its architectural weaknesses simultaneously.

Unified Memory Wins This Workload

Both the M3 Max and the Radeon 8060S use unified memory architectures, where the CPU and GPU share the same physical memory pool. The M3 Max has 64GB of unified LPDDR5. The Radeon 8060S shares 128GB of DDR5 with the CPU, with roughly 96GB addressable as VRAM.

For a 1.7B parameter model in bf16, the weights occupy roughly 3.4GB. The model fits comfortably on all three machines. But the autoregressive generation pattern creates a stream of intermediate activations (KV cache entries, attention scores, feedforward intermediates) that grow with the sequence length. On a unified memory architecture, these intermediates exist in the same memory space as the model weights, avoiding any PCIe transfer overhead. On the P40, every interaction between CPU and GPU crosses a PCIe 3.0 bus.

For LLM inference, where the bottleneck is token generation throughput and the KV cache fits in VRAM, the P40's discrete memory is fine. For TTS, where the model generates hundreds of audio tokens per second of speech and the attention window grows continuously, the memory access pattern favors unified architectures.

This is not a universal statement about unified versus discrete memory. A modern discrete GPU with HBM2e or GDDR6X and PCIe 4.0 or 5.0 would likely outperform both the M3 Max and the Radeon 8060S on this workload. The P40's problem is not that its memory is discrete; it is that its memory is slow and its bus is narrow by 2026 standards.

The Model Architecture Question

While benchmarking Qwen TTS, I also ran a quick comparison with F5-TTS on the AMD machine to sanity-check the results. F5-TTS is a flow-matching model, fundamentally different from Qwen's autoregressive approach. Where Qwen generates audio tokens sequentially, each conditioned on all previous tokens, F5 generates audio in parallel through an iterative refinement process.

The difference is stark. On the same Radeon 8060S, the same text, the same hardware:

Model	Generation Time	Audio Length	RTF
Qwen3-TTS	579.1s (avg)	197.5s	3.00
F5-TTS	17.4s	27.2s	0.64

F5-TTS is faster than real-time. Qwen3-TTS takes three times longer than the audio it produces. On normalized terms, F5 is roughly five times faster than Qwen at steady-state, and the gap widens on shorter content where Qwen's warmup overhead is proportionally larger.

This is not an apples-to-apples quality comparison. Qwen3-TTS generally produces more natural prosody, better handling of complex sentence structures, and more consistent speaker identity across long passages. F5-TTS is excellent but can occasionally drift in voice character or pacing on very long content. For blog narration, both are well above the threshold of "good enough," and the quality difference is smaller than you might expect given the architectural gap.

The point is that hardware is only half the story. The choice of model architecture can matter more than the choice of GPU. A flow-matching model on integrated AMD graphics outperforms an autoregressive model on Apple's best laptop silicon by a wide margin. If generation speed is the constraint, switching models gains more than switching hardware.

What This Costs in Practice

The abstract benchmark numbers translate into concrete time and electricity costs when you are generating audio for a library of blog posts.

A typical TinyComputers post runs 3,000 to 5,000 words, producing 15 to 25 minutes of narrated audio. At steady-state RTF:

Machine	15 min audio	25 min audio	System Power
M3 Max	~38 min	~63 min	~50W
Radeon 8060S	~38 min	~63 min	~100W
Tesla P40	~110 min	~183 min	~400W

The M3 Max and Radeon 8060S are tied on generation time, but the M3 Max draws roughly half the system power. For a single post, the electricity cost difference is negligible, a fraction of a cent. For batch processing a backlog of thirty posts, the M3 Max costs about \$0.18 in electricity versus \$0.36 for the AMD and \$3.50 for the P40.

None of these numbers are alarming. Even the P40, at nearly two and a half hours per post and 400 watts from the wall, costs under fifteen cents in electricity per narration at Minnesota residential rates. The equivalent Google Cloud TTS job would cost \$4 to \$16 per post depending on the voice quality tier.

To put cloud costs in perspective: I recently ran a fiction novel through Google's Chirp3-HD voice: 82,000 words, roughly 500,000 characters of text plus SSML markup. The bill came to \$17.25 at Google's rate of \$30 per million characters. That is not unreasonable for a one-off project, but it adds up quickly if you are generating audio regularly. The entire library of TinyComputers narrations (dozens of posts, hours of audio) has cost me nothing beyond the electricity to run the machines I already own. The economics of local TTS are favorable on every machine in the comparison.

The real cost is time. If I am generating audio for a single new post, I start it on whichever machine is idle and check back in an hour. If I am regenerating audio for twenty posts after changing the speaker voice or updating the pipeline, the M3 Max or AMD will finish overnight. The P40 would take most of a weekend.

The Right Machine for the Job

After running these benchmarks, my workflow has shifted. The M3 Max is the default for new post narration; it is fast, quiet, and I am usually sitting in front of it when I finish writing. The AMD handles batch jobs and overnight processing, where its slightly higher power draw does not matter and its equivalent speed makes it interchangeable with the Mac. The P40 server is reserved for what it does best: running large language models through Ollama, where its 96GB of aggregate VRAM gives it an advantage that neither the Mac nor the AMD can match.

The P40 can still generate TTS in a pinch, and it does; when both other machines are occupied, I will queue a job on the P40 and accept the longer wait. But for a workload that is inherently autoregressive, memory-bandwidth sensitive, and dependent on bf16 precision, a ten-year-old Pascal GPU is the wrong tool.

What surprised me most is how well the AMD performs. The Radeon 8060S is an integrated GPU sharing system memory with the CPU. It has no HBM, no dedicated VRAM, no NVLink. Its ROCm software stack requires environment variable hacks, pre-release PyTorch wheels, and a GFX version override to function at all. And yet, once the kernels warm up, it matches Apple's best laptop silicon stride for stride. The raw hardware is there: 40 RDNA 3.5 compute units with access to a deep pool of DDR5 memory. The software just needs to get out of the way, and on run 2 and beyond, it does.

Lessons

Three takeaways from this exercise that generalize beyond TTS:

Short benchmarks lie. Kernel compilation overhead on MPS and ROCm is large enough to dominate a short test. If you are evaluating a new model on non-CUDA hardware, run it at least twice before drawing conclusions. The first run is measuring the software stack, not the hardware.

Architecture matters more than clock speed. The P40 has more raw FLOPS than the Radeon 8060S. It does not matter. The P40 lacks native bf16, lacks efficient attention primitives, and sits behind a PCIe 3.0 bus. The Radeon has all three, and ties a chip designed by Apple's custom silicon team. For autoregressive models, the architectural fit between model and hardware dominates everything else.

Model choice can outweigh hardware choice. F5-TTS running on the weakest GPU in this comparison is five times faster than Qwen3-TTS running on the strongest. If your constraint is generation speed and you can accept a modest quality trade-off, switching to a flow-matching architecture gains more than any hardware upgrade short of a data center GPU.

The audio player at the top of each post on this site represents a few minutes of machine time on one of these three machines. Which machine generated it depends on the day, the workload, and what else is running. The listener cannot tell the difference. The audio sounds the same regardless of whether it was generated on a laptop, a mini desktop, or a rack-mount server in a cold Minnesota shop. That is the real benchmark: not which machine is fastest, but that all three are fast enough.