The rise of high-quality text-to-speech models has opened new possibilities for content creators, accessibility advocates, and developers alike. Qwen3-TTS, developed by Alibaba's Qwen team, represents a significant leap forward in neural TTS technology, offering natural-sounding speech synthesis with multiple speaker voices. In this guide, we'll walk through setting up Qwen3-TTS on AMD's Strix Halo platform—specifically the AI Max+ 395 with its integrated Radeon 8060S graphics—and demonstrate how we use it to generate audio narrations for blog posts right here on TinyComputers.
Why Qwen3-TTS?
The text-to-speech landscape has evolved dramatically over the past few years. While cloud-based services like Amazon Polly, Google Cloud TTS, and ElevenLabs offer impressive quality, they come with ongoing costs, privacy considerations, and internet dependency. Local TTS solutions have historically lagged behind in quality, often producing robotic or unnatural speech.
Qwen3-TTS changes this equation. The model produces remarkably natural speech with proper intonation, pacing, and emphasis. It supports multiple pre-trained speaker voices—including options like Eric, Aiden, Dylan, Serena, and others—each with distinct characteristics suitable for different content types. For technical content like our blog posts, the Eric voice provides clear, professional narration that listeners find easy to follow.
The model we're using, Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, weighs in at 1.7 billion parameters. While not small, this is manageable on modern hardware and runs efficiently on GPU. The 12Hz designation refers to the audio frame rate used during generation, balancing quality with computational requirements.
The Hardware: AMD AI Max+ 395
AMD's Strix Halo architecture represents their latest push into the high-performance APU market, combining powerful CPU cores with substantial integrated graphics. Our test system features:
- CPU: AMD Ryzen AI Max+ 395 with 16 Zen 5 cores (32 threads)
- GPU: Integrated Radeon 8060S (RDNA 3.5 architecture)
- Memory: 128GB unified DDR5, configured with 96GB VRAM and 32GB system RAM
- Compute Units: 40 CUs dedicated to graphics/compute workloads
Our test system is the Bosgame M5 AI Mini Desktop, one of the first mini PCs to ship with AMD's Strix Halo silicon. The GMKtec EVO-X2 is an extremely similar system if you're looking to replicate this setup. The unified memory architecture is particularly relevant for machine learning workloads. Unlike discrete GPUs with their own VRAM, the Radeon 8060S shares system memory with the CPU. This means no PCIe bottleneck for data transfers, and with 96GB allocated as VRAM, even large models fit comfortably.
For our TTS workload, the 8060S provides adequate performance. The 1.7B parameter model fits comfortably in memory, and inference runs entirely on GPU once loaded. We see 100% GPU utilization during speech synthesis, indicating the hardware is being fully leveraged.
Setting Up the Environment
The first challenge with AMD GPUs is getting PyTorch working correctly with ROCm, AMD's open-source GPU compute stack. The Strix Halo uses a newer GPU architecture (gfx1151) that requires ROCm 6.x and some environment variable overrides.
Step 1: Create a Python Virtual Environment
We'll use a dedicated virtual environment to isolate our TTS dependencies:
mkdir -p ~/qwen-tts cd ~/qwen-tts python3 -m venv venv source venv/bin/activate
Step 2: Install PyTorch with ROCm Support
The standard PyTorch installation won't work—we need the ROCm-enabled build. As of this writing, ROCm 6.4 is the latest stable release:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.4
This downloads PyTorch builds compiled specifically for AMD GPUs. The installation is larger than the standard CUDA builds due to the different compute libraries involved.
Step 3: Install Qwen-TTS
With PyTorch in place, install the Qwen TTS package:
pip install qwen-tts soundfile numpy
The soundfile library handles WAV file I/O, while numpy is needed for audio array manipulation.
Step 4: Install xformers for ROCm (Optional but Recommended)
The xformers library provides optimized attention implementations that can improve performance:
pip install xformers --index-url https://download.pytorch.org/whl/rocm6.4
While Qwen-TTS will work without xformers, having it available enables more efficient memory-attention patterns during inference.
Step 5: Configure Environment Variables
The Strix Halo's gfx1151 architecture isn't explicitly recognized by all ROCm components yet. We need to tell the system to treat it as a compatible architecture:
export HSA_OVERRIDE_GFX_VERSION=11.0.0 export GPU_MAX_ALLOC_PERCENT=100 export GPU_MAX_HEAP_SIZE=100 export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
Let's break down what these do:
- HSA_OVERRIDE_GFX_VERSION=11.0.0: Tells the HSA runtime to report the GPU as gfx1100, which has broader library support
- GPU_MAX_ALLOC_PERCENT=100: Allows the GPU to use up to 100% of available memory for allocations
- GPU_MAX_HEAP_SIZE=100: Similar memory allocation setting for heap operations
- TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1: Enables experimental efficient attention implementations for AMD GPUs
Add these to your .bashrc or create an activation script for convenience.
Step 6: Verify GPU Detection
Before proceeding, confirm PyTorch can see your GPU:
import torch print(f"CUDA available: {torch.cuda.is_available()}") print(f"Device count: {torch.cuda.device_count()}") print(f"Device name: {torch.cuda.get_device_name(0)}")
You should see output like:
CUDA available: True Device count: 1 Device name: AMD Radeon 8060S
Note that PyTorch uses "CUDA" terminology even for AMD GPUs when using ROCm—this is for API compatibility.
Basic TTS Usage
With the environment configured, let's test basic speech synthesis:
from qwen_tts import Qwen3TTSModel import soundfile as sf import torch # Load model on GPU with bfloat16 precision model = Qwen3TTSModel.from_pretrained( 'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', attn_implementation='sdpa', device_map='cuda:0', dtype=torch.bfloat16 ) # Check available speakers print(f"Available speakers: {model.get_supported_speakers()}") # Generate speech text = "Hello, and welcome to TinyComputers. Today we're exploring text-to-speech on AMD hardware." audios, sample_rate = model.generate_custom_voice( text, speaker='eric', language='english' ) # Save to file sf.write('output.wav', audios[0], sample_rate) print(f"Saved audio at {sample_rate}Hz")
A few important notes:
- We use
attn_implementation='sdpa'for scaled dot-product attention, which works on ROCm - The
device_map='cuda:0'explicitly places the model on the GPU - Using
dtype=torch.bfloat16reduces memory usage while maintaining quality - The language parameter must be the full word
'english', not the abbreviation'en'
Building a Blog-to-Speech Pipeline
For our use case—generating audio versions of blog posts—we need more than basic TTS. Blog posts contain markdown formatting, code blocks, images, and other elements that shouldn't be read aloud. We built a complete pipeline that handles these challenges.
The Blog Cleaner
Our cleaning process strips out non-spoken content while preserving the narrative flow:
import re def clean_markdown(content): # Remove YAML frontmatter if content.startswith('---'): end = content.find('---', 3) if end != -1: content = content[end + 3:] # Strip HTML tags (audio, video, images) content = re.sub(r'<audio[^>]*>[\s\S]*?</audio>', '', content, flags=re.IGNORECASE) content = re.sub(r'<video[^>]*>[\s\S]*?</video>', '', content, flags=re.IGNORECASE) content = re.sub(r'<img[^>]*/?>', '', content, flags=re.IGNORECASE) content = re.sub(r'<[^>]+>', '', content) # Remove markdown images and convert links to just text content = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', content) content = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', content) # Remove code blocks content = re.sub(r'```[\s\S]*?```', '', content) content = re.sub(r'`[^`]+`', '', content) # Convert headers to sentences content = re.sub(r'^(#{1,6})\s+(.+)$', r'\2.', content, flags=re.MULTILINE) # Remove emphasis markers content = re.sub(r'\*\*([^*]+)\*\*', r'\1', content) content = re.sub(r'\*([^*]+)\*', r'\1', content) return content
Unit Conversion for Speech
Technical content often includes abbreviations that sound awkward when read literally. We convert common units to their spoken forms:
def convert_units_for_speech(text): text = re.sub(r'(\d+)\s*GB\b', r'\1 gigabytes', text) text = re.sub(r'(\d+)\s*MB\b', r'\1 megabytes', text) text = re.sub(r'(\d+)\s*GHz\b', r'\1 gigahertz', text) text = re.sub(r'(\d+)\s*MHz\b', r'\1 megahertz', text) text = re.sub(r'(\d+)\s*KB\b', r'\1 kilobytes', text) return text
Chunking Long Content
TTS models work best with moderate-length inputs. Very long passages can cause quality degradation or memory issues. We split content into chunks at sentence boundaries:
def chunk_text(text, max_chars=500): chunks = [] sentences = re.split(r'(?<=[.!?])\s+', text) current = "" for sentence in sentences: if len(current) + len(sentence) < max_chars: current += sentence + " " else: if current: chunks.append(current.strip()) current = sentence + " " if current: chunks.append(current.strip()) return chunks
The Complete Script
Putting it all together, here's our blog_to_speech.py script:
#!/usr/bin/env python3 import argparse import re import torch from pathlib import Path from qwen_tts import Qwen3TTSModel import soundfile as sf import numpy as np def clean_blog_post(source): content = Path(source).read_text() # Apply cleaning functions... return cleaned_text def synthesize_speech(text, output_file, speaker="eric"): model = Qwen3TTSModel.from_pretrained( 'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', attn_implementation='sdpa', device_map='cuda:0', dtype=torch.bfloat16 ) chunks = chunk_text(text) all_audio = [] for i, chunk in enumerate(chunks): print(f"Processing chunk {i+1}/{len(chunks)}") audios, sample_rate = model.generate_custom_voice( chunk, speaker=speaker, language='english' ) all_audio.append(audios[0]) combined = np.concatenate(all_audio) sf.write(output_file, combined, sample_rate) duration = len(combined) / sample_rate print(f"Saved {duration:.1f}s audio to: {output_file}") if __name__ == '__main__': parser = argparse.ArgumentParser() parser.add_argument('source', help='Blog post markdown file') parser.add_argument('-o', '--output', default='output.wav') parser.add_argument('--speaker', default='eric') args = parser.parse_args() text = clean_blog_post(args.source) synthesize_speech(text, args.output, args.speaker)
Choosing the Right Speaker Voice
Qwen3-TTS ships with nine pre-trained speaker voices, each with distinct characteristics:
| Speaker | Characteristics | Best For |
|---|---|---|
| Eric | Clear, professional male voice with measured pacing | Technical content, tutorials, documentation |
| Aiden | Younger male voice, slightly more casual | Blog posts, conversational content |
| Dylan | Deeper male voice with authoritative tone | Formal presentations, announcements |
| Ryan | Energetic male voice | Marketing content, product demos |
| Serena | Clear female voice, professional | Corporate content, tutorials |
| Vivian | Warm female voice | Storytelling, narrative content |
| Ono Anna | Female voice with distinct character | Creative content |
| Sohee | Female voice, versatile | General purpose |
| Uncle Fu | Character voice | Specialized applications |
For our technical blog content, we primarily use Eric. His clear enunciation and measured pacing work well for complex technical explanations. The voice handles acronyms, numbers, and technical terminology naturally, making it ideal for content about hardware, programming, and system administration.
You can easily switch voices by changing the speaker parameter:
audios, sample_rate = model.generate_custom_voice( text, speaker='serena', # Try different voices language='english' )
Consider matching voice characteristics to content type. A hardware review might work better with Eric's authoritative tone, while a personal essay might benefit from Aiden's more conversational style.
Comparing TTS Options
Before settling on Qwen3-TTS, we evaluated several alternatives. Here's how they compare for our use case:
Cloud Services
Amazon Polly and Google Cloud TTS offer excellent quality with minimal setup. However, costs accumulate quickly for long-form content. At roughly \$4-16 per million characters (depending on voice quality), a 3000-word blog post costs \$0.10-0.40 per generation. For a site with dozens of posts requiring periodic regeneration, this adds up.
ElevenLabs produces arguably the most natural voices available, with impressive emotional range. But their pricing model—based on character quotas—makes it expensive for regular content generation. The quality is exceptional, but overkill for straightforward narration.
Local Alternatives
Coqui TTS (now deprecated) was a popular open-source option but development has stalled. Bark from Suno produces impressive results but runs slowly and lacks fine-grained control. XTTS offers voice cloning but requires more setup and compute resources.
Piper deserves special mention as a lightweight option. It runs quickly even on CPU and produces acceptable quality for many applications. However, the voices sound noticeably synthetic compared to Qwen3-TTS—fine for notifications or short snippets, but fatiguing for 30-minute narrations.
Qwen3-TTS hits a sweet spot: quality approaching cloud services, reasonable compute requirements, and fully local operation. The 1.7B parameter model is large enough for natural prosody but small enough to run on consumer hardware.
Batch Processing for Multiple Posts
When generating audio for multiple blog posts, efficiency matters. Loading the model takes 15-30 seconds, so we keep it loaded while processing multiple files:
#!/usr/bin/env python3 """Batch TTS processing for multiple blog posts""" import os from pathlib import Path from qwen_tts import Qwen3TTSModel import torch import soundfile as sf import numpy as np # Load model once print("Loading model...") model = Qwen3TTSModel.from_pretrained( 'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', attn_implementation='sdpa', device_map='cuda:0', dtype=torch.bfloat16 ) posts = [ 'post1.md', 'post2.md', 'post3.md', ] for post in posts: print(f"\n{'='*50}") print(f"Processing: {post}") print(f"{'='*50}") text = clean_blog_post(post) output = f"/tmp/{Path(post).stem}_tts.wav" # Process chunks chunks = chunk_text(text) all_audio = [] for i, chunk in enumerate(chunks): print(f" Chunk {i+1}/{len(chunks)}") audios, sr = model.generate_custom_voice( chunk, speaker='eric', language='english' ) all_audio.append(audios[0]) combined = np.concatenate(all_audio) sf.write(output, combined, sr) print(f"Saved: {output}")
This approach processes our five-post backlog overnight, with results ready for review in the morning.
Performance Characteristics
On the AI Max+ 395, speech synthesis runs at roughly real-time to 0.5x real-time speed—meaning a 30-minute audio file takes 30-60 minutes to generate. This is slower than high-end discrete GPUs but perfectly acceptable for batch processing.
For reference, here's how different content lengths performed in our testing:
| Content | Characters | Chunks | Audio Duration | Generation Time |
|---|---|---|---|---|
| Short post | 5,000 | 12 | ~5 min | ~15 min |
| Medium post | 15,000 | 35 | ~15 min | ~45 min |
| Long post | 25,000 | 55 | ~27 min | ~90 min |
| Very long | 40,000 | 85 | ~45 min | ~150 min |
The relationship between content length and generation time is roughly linear after the initial model warmup.
Some observations from our testing:
- First chunk latency: The first chunk takes longer due to GPU kernel compilation and caching
- Memory usage: Peak usage around 8-10GB during inference
- GPU utilization: Consistent 100% during active synthesis
- Quality: Indistinguishable from cloud TTS services for most content
The MIOpen library sometimes logs workspace warnings during execution. These don't affect output quality and can be safely ignored:
MIOpen(HIP): Warning [IsEnoughWorkspace] Solver <GemmFwdRest>, workspace required: 103133184
Integrating Audio into Blog Posts
Once we have the WAV file, we convert to MP3 for web delivery and embed an HTML5 audio player:
ffmpeg -i blog_post.wav -codec:a libmp3lame -qscale:a 2 blog_post.mp3
For reviewing TTS output quality, we recommend using studio monitor headphones that reveal any artifacts or unnatural tones in the generated speech.
The player HTML is straightforward:
<div style="background: #f8f9fa; border: 1px solid #e9ecef; border-radius: 8px; padding: 16px 20px; margin: 20px 0;"> <div style="display: flex; align-items: center; margin-bottom: 10px;"> <span style="font-size: 20px; margin-right: 8px;">🎧</span> <span style="color: #495057; font-weight: 600;">Listen to this article</span> </div> <audio controls preload="metadata" style="width: 100%;"> <source src="/audio/blog_post.mp3" type="audio/mpeg"> </audio> <div style="color: #6c757d; font-size: 12px; margin-top: 8px;"> 27 min · AI-generated narration </div> </div>
Why We're Doing This
Adding audio narration to blog posts serves multiple purposes:
- Accessibility: Readers with visual impairments or reading difficulties can consume content aurally
- Convenience: Listeners can enjoy posts during commutes, workouts, or other activities
- Engagement: Audio content creates a more personal connection with the audience
- Reach: Some audiences prefer audio format, expanding our potential readership
Running TTS locally rather than using cloud services gives us:
- Cost control: No per-character or per-minute fees
- Privacy: Content never leaves our infrastructure
- Consistency: Same voice and quality across all posts
- Flexibility: Full control over processing pipeline
Troubleshooting Common Issues
"CUDA not available" despite GPU present
Ensure you've installed the ROCm version of PyTorch, not the standard build:
pip uninstall torch torchvision torchaudio pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.4
Model runs on CPU instead of GPU
Check that device_map='cuda:0' is specified when loading the model. Also verify the environment variables are set before starting Python.
"Unsupported language 'en'"
Use the full language name: language='english' not language='en'.
Out of memory errors
Try reducing chunk size or using a smaller batch. The model should fit in 16GB, but very long chunks can spike memory usage.
Slow first chunk
This is normal—ROCm compiles GPU kernels on first use. Subsequent chunks process faster.
Future Improvements
Our current pipeline works well but has room for enhancement. Some improvements we're considering:
Voice cloning: Qwen3-TTS supports custom voice training. With sufficient audio samples, we could create a unique voice for TinyComputers rather than using the stock speakers. This would provide brand consistency and differentiation.
Automatic post detection: Currently we manually select posts for TTS generation. A CI/CD integration could automatically generate audio for new posts when they're published, keeping the audio library current without manual intervention.
Chapter markers: For longer posts, embedding chapter markers in the audio file would allow listeners to skip to specific sections. This requires parsing the markdown headers and mapping them to audio timestamps.
Multiple format export: Beyond MP3, offering Opus or AAC formats could reduce file sizes while maintaining quality, benefiting listeners on metered connections.
Speed adjustment: Some listeners prefer 1.25x or 1.5x playback speed. Pre-generating speed-adjusted versions could provide better quality than real-time speed adjustment in the browser.
Conclusion
Running Qwen3-TTS on AMD's Strix Halo platform demonstrates that high-quality local TTS is now accessible beyond NVIDIA hardware. While setup requires some ROCm-specific configuration, the results are impressive—natural-sounding narration suitable for professional content.
The democratization of AI capabilities continues apace. What once required expensive cloud subscriptions or high-end NVIDIA GPUs now runs on integrated graphics. The AI Max+ 395's Radeon 8060S, primarily designed for gaming and general compute tasks, handles a 1.7-billion parameter language model without breaking a sweat.
We're actively using this pipeline to generate audio versions of posts across TinyComputers, making our technical content more accessible and convenient for our readers. As of this writing, we've processed our retrocomputing series, hardware reviews, and technical tutorials—dozens of hours of content generated entirely on local hardware.
The combination of AMD's capable integrated graphics and Qwen's excellent TTS model proves that you don't need expensive discrete GPUs or cloud subscriptions to achieve broadcast-quality speech synthesis. For content creators, educators, and accessibility advocates, this opens new possibilities for enriching written content with audio without ongoing service costs.
If you're running AMD hardware and want to add audio narration to your own content, this guide should get you started. The initial setup investment pays dividends in ongoing cost savings and the satisfaction of running capable AI models entirely on your own infrastructure. And if you encounter issues along the way, the troubleshooting section above addresses the most common pitfalls we discovered during our own setup process.
The audio player at the top of many TinyComputers posts now represents a small but meaningful step toward making technical content more accessible. Every post you can listen to while commuting, exercising, or doing dishes is content that might otherwise go unread. That's the real value of local TTS—not just cost savings, but expanded reach for the ideas we share.
