Running Qwen TTS on AMD Strix Halo: A Complete Guide to Local Text-to-Speech

A.C. Jokela

2026-01-24

The rise of high-quality text-to-speech models has opened new possibilities for content creators, accessibility advocates, and developers alike. Qwen3-TTS, developed by Alibaba's Qwen team, represents a significant leap forward in neural TTS technology, offering natural-sounding speech synthesis with multiple speaker voices. In this guide, we'll walk through setting up Qwen3-TTS on AMD's Strix Halo platform—specifically the AI Max+ 395 with its integrated Radeon 8060S graphics—and demonstrate how we use it to generate audio narrations for blog posts right here on TinyComputers.

Why Qwen3-TTS?

The text-to-speech landscape has evolved dramatically over the past few years. While cloud-based services like Amazon Polly, Google Cloud TTS, and ElevenLabs offer impressive quality, they come with ongoing costs, privacy considerations, and internet dependency. Local TTS solutions have historically lagged behind in quality, often producing robotic or unnatural speech.

Qwen3-TTS changes this equation. The model produces remarkably natural speech with proper intonation, pacing, and emphasis. It supports multiple pre-trained speaker voices—including options like Eric, Aiden, Dylan, Serena, and others—each with distinct characteristics suitable for different content types. For technical content like our blog posts, the Eric voice provides clear, professional narration that listeners find easy to follow.

The model we're using, Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, weighs in at 1.7 billion parameters. While not small, this is manageable on modern hardware and runs efficiently on GPU. The 12Hz designation refers to the audio frame rate used during generation, balancing quality with computational requirements.

The Hardware: AMD AI Max+ 395

AMD's Strix Halo architecture represents their latest push into the high-performance APU market, combining powerful CPU cores with substantial integrated graphics. Our test system features:

CPU: AMD Ryzen AI Max+ 395 with 16 Zen 5 cores (32 threads)
GPU: Integrated Radeon 8060S (RDNA 3.5 architecture)
Memory: 128GB unified DDR5, configured with 96GB VRAM and 32GB system RAM
Compute Units: 40 CUs dedicated to graphics/compute workloads

Our test system is the Bosgame M5 AI Mini Desktop, one of the first mini PCs to ship with AMD's Strix Halo silicon. The GMKtec EVO-X2 is an extremely similar system if you're looking to replicate this setup. The unified memory architecture is particularly relevant for machine learning workloads. Unlike discrete GPUs with their own VRAM, the Radeon 8060S shares system memory with the CPU. This means no PCIe bottleneck for data transfers, and with 96GB allocated as VRAM, even large models fit comfortably.

For our TTS workload, the 8060S provides adequate performance. The 1.7B parameter model fits comfortably in memory, and inference runs entirely on GPU once loaded. We see 100% GPU utilization during speech synthesis, indicating the hardware is being fully leveraged.

Setting Up the Environment

The first challenge with AMD GPUs is getting PyTorch working correctly with ROCm, AMD's open-source GPU compute stack. The Strix Halo uses a newer GPU architecture (gfx1151) that requires ROCm 6.x and some environment variable overrides.

Step 1: Create a Python Virtual Environment

We'll use a dedicated virtual environment to isolate our TTS dependencies:

mkdir -p ~/qwen-tts
cd ~/qwen-tts
python3 -m venv venv
source venv/bin/activate

Step 2: Install PyTorch with ROCm Support

The standard PyTorch installation won't work—we need the ROCm-enabled build. As of this writing, ROCm 6.4 is the latest stable release:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.4

This downloads PyTorch builds compiled specifically for AMD GPUs. The installation is larger than the standard CUDA builds due to the different compute libraries involved.

Step 3: Install Qwen-TTS

With PyTorch in place, install the Qwen TTS package:

pip install qwen-tts soundfile numpy

The soundfile library handles WAV file I/O, while numpy is needed for audio array manipulation.

Step 4: Install xformers for ROCm (Optional but Recommended)

The xformers library provides optimized attention implementations that can improve performance:

pip install xformers --index-url https://download.pytorch.org/whl/rocm6.4

While Qwen-TTS will work without xformers, having it available enables more efficient memory-attention patterns during inference.

Step 5: Configure Environment Variables

The Strix Halo's gfx1151 architecture isn't explicitly recognized by all ROCm components yet. We need to tell the system to treat it as a compatible architecture:

export HSA_OVERRIDE_GFX_VERSION=11.0.0
export GPU_MAX_ALLOC_PERCENT=100
export GPU_MAX_HEAP_SIZE=100
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1

Let's break down what these do:

HSA_OVERRIDE_GFX_VERSION=11.0.0: Tells the HSA runtime to report the GPU as gfx1100, which has broader library support
GPU_MAX_ALLOC_PERCENT=100: Allows the GPU to use up to 100% of available memory for allocations
GPU_MAX_HEAP_SIZE=100: Similar memory allocation setting for heap operations
TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1: Enables experimental efficient attention implementations for AMD GPUs

Add these to your .bashrc or create an activation script for convenience.

Step 6: Verify GPU Detection

Before proceeding, confirm PyTorch can see your GPU:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Device name: {torch.cuda.get_device_name(0)}")

You should see output like:

CUDA available: True
Device count: 1
Device name: AMD Radeon 8060S

Note that PyTorch uses "CUDA" terminology even for AMD GPUs when using ROCm—this is for API compatibility.

Basic TTS Usage

With the environment configured, let's test basic speech synthesis:

from qwen_tts import Qwen3TTSModel
import soundfile as sf
import torch

# Load model on GPU with bfloat16 precision
model = Qwen3TTSModel.from_pretrained(
    'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice',
    attn_implementation='sdpa',
    device_map='cuda:0',
    dtype=torch.bfloat16
)

# Check available speakers
print(f"Available speakers: {model.get_supported_speakers()}")

# Generate speech
text = "Hello, and welcome to TinyComputers. Today we're exploring text-to-speech on AMD hardware."
audios, sample_rate = model.generate_custom_voice(
    text,
    speaker='eric',
    language='english'
)

# Save to file
sf.write('output.wav', audios[0], sample_rate)
print(f"Saved audio at {sample_rate}Hz")

A few important notes:

We use attn_implementation='sdpa' for scaled dot-product attention, which works on ROCm
The device_map='cuda:0' explicitly places the model on the GPU
Using dtype=torch.bfloat16 reduces memory usage while maintaining quality
The language parameter must be the full word 'english', not the abbreviation 'en'

Building a Blog-to-Speech Pipeline

For our use case—generating audio versions of blog posts—we need more than basic TTS. Blog posts contain markdown formatting, code blocks, images, and other elements that shouldn't be read aloud. We built a complete pipeline that handles these challenges.

The Blog Cleaner

Our cleaning process strips out non-spoken content while preserving the narrative flow:

import re

def clean_markdown(content):
    # Remove YAML frontmatter
    if content.startswith('---'):
        end = content.find('---', 3)
        if end != -1:
            content = content[end + 3:]

    # Strip HTML tags (audio, video, images)
    content = re.sub(r'<audio[^>]*>[\s\S]*?</audio>', '', content, flags=re.IGNORECASE)
    content = re.sub(r'<video[^>]*>[\s\S]*?</video>', '', content, flags=re.IGNORECASE)
    content = re.sub(r'<img[^>]*/?>', '', content, flags=re.IGNORECASE)
    content = re.sub(r'<[^>]+>', '', content)

    # Remove markdown images and convert links to just text
    content = re.sub(r'!\[[^\]]*\]\([^)]+\)', '', content)
    content = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', content)

    # Remove code blocks
    content = re.sub(r'```[\s\S]*?```', '', content)
    content = re.sub(r'`[^`]+`', '', content)

    # Convert headers to sentences
    content = re.sub(r'^(#{1,6})\s+(.+)$', r'\2.', content, flags=re.MULTILINE)

    # Remove emphasis markers
    content = re.sub(r'\*\*([^*]+)\*\*', r'\1', content)
    content = re.sub(r'\*([^*]+)\*', r'\1', content)

    return content

Unit Conversion for Speech

Technical content often includes abbreviations that sound awkward when read literally. We convert common units to their spoken forms:

def convert_units_for_speech(text):
    text = re.sub(r'(\d+)\s*GB\b', r'\1 gigabytes', text)
    text = re.sub(r'(\d+)\s*MB\b', r'\1 megabytes', text)
    text = re.sub(r'(\d+)\s*GHz\b', r'\1 gigahertz', text)
    text = re.sub(r'(\d+)\s*MHz\b', r'\1 megahertz', text)
    text = re.sub(r'(\d+)\s*KB\b', r'\1 kilobytes', text)
    return text

Chunking Long Content

TTS models work best with moderate-length inputs. Very long passages can cause quality degradation or memory issues. We split content into chunks at sentence boundaries:

def chunk_text(text, max_chars=500):
    chunks = []
    sentences = re.split(r'(?<=[.!?])\s+', text)
    current = ""

    for sentence in sentences:
        if len(current) + len(sentence) < max_chars:
            current += sentence + " "
        else:
            if current:
                chunks.append(current.strip())
            current = sentence + " "

    if current:
        chunks.append(current.strip())

    return chunks

The Complete Script

Putting it all together, here's our blog_to_speech.py script:

#!/usr/bin/env python3
import argparse
import re
import torch
from pathlib import Path
from qwen_tts import Qwen3TTSModel
import soundfile as sf
import numpy as np

def clean_blog_post(source):
    content = Path(source).read_text()
    # Apply cleaning functions...
    return cleaned_text

def synthesize_speech(text, output_file, speaker="eric"):
    model = Qwen3TTSModel.from_pretrained(
        'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice',
        attn_implementation='sdpa',
        device_map='cuda:0',
        dtype=torch.bfloat16
    )

    chunks = chunk_text(text)
    all_audio = []

    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i+1}/{len(chunks)}")
        audios, sample_rate = model.generate_custom_voice(
            chunk,
            speaker=speaker,
            language='english'
        )
        all_audio.append(audios[0])

    combined = np.concatenate(all_audio)
    sf.write(output_file, combined, sample_rate)
    duration = len(combined) / sample_rate
    print(f"Saved {duration:.1f}s audio to: {output_file}")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('source', help='Blog post markdown file')
    parser.add_argument('-o', '--output', default='output.wav')
    parser.add_argument('--speaker', default='eric')
    args = parser.parse_args()

    text = clean_blog_post(args.source)
    synthesize_speech(text, args.output, args.speaker)

Choosing the Right Speaker Voice

Qwen3-TTS ships with nine pre-trained speaker voices, each with distinct characteristics:

Speaker	Characteristics	Best For
Eric	Clear, professional male voice with measured pacing	Technical content, tutorials, documentation
Aiden	Younger male voice, slightly more casual	Blog posts, conversational content
Dylan	Deeper male voice with authoritative tone	Formal presentations, announcements
Ryan	Energetic male voice	Marketing content, product demos
Serena	Clear female voice, professional	Corporate content, tutorials
Vivian	Warm female voice	Storytelling, narrative content
Ono Anna	Female voice with distinct character	Creative content
Sohee	Female voice, versatile	General purpose
Uncle Fu	Character voice	Specialized applications

For our technical blog content, we primarily use Eric. His clear enunciation and measured pacing work well for complex technical explanations. The voice handles acronyms, numbers, and technical terminology naturally, making it ideal for content about hardware, programming, and system administration.

You can easily switch voices by changing the speaker parameter:

audios, sample_rate = model.generate_custom_voice(
    text,
    speaker='serena',  # Try different voices
    language='english'
)

Consider matching voice characteristics to content type. A hardware review might work better with Eric's authoritative tone, while a personal essay might benefit from Aiden's more conversational style.

Comparing TTS Options

Before settling on Qwen3-TTS, we evaluated several alternatives. Here's how they compare for our use case:

Cloud Services

Amazon Polly and Google Cloud TTS offer excellent quality with minimal setup. However, costs accumulate quickly for long-form content. At roughly \$4-16 per million characters (depending on voice quality), a 3000-word blog post costs \$0.10-0.40 per generation. For a site with dozens of posts requiring periodic regeneration, this adds up.

ElevenLabs produces arguably the most natural voices available, with impressive emotional range. But their pricing model—based on character quotas—makes it expensive for regular content generation. The quality is exceptional, but overkill for straightforward narration.

Local Alternatives

Coqui TTS (now deprecated) was a popular open-source option but development has stalled. Bark from Suno produces impressive results but runs slowly and lacks fine-grained control. XTTS offers voice cloning but requires more setup and compute resources.

Piper deserves special mention as a lightweight option. It runs quickly even on CPU and produces acceptable quality for many applications. However, the voices sound noticeably synthetic compared to Qwen3-TTS—fine for notifications or short snippets, but fatiguing for 30-minute narrations.

Qwen3-TTS hits a sweet spot: quality approaching cloud services, reasonable compute requirements, and fully local operation. The 1.7B parameter model is large enough for natural prosody but small enough to run on consumer hardware.

Batch Processing for Multiple Posts

When generating audio for multiple blog posts, efficiency matters. Loading the model takes 15-30 seconds, so we keep it loaded while processing multiple files:

#!/usr/bin/env python3
"""Batch TTS processing for multiple blog posts"""

import os
from pathlib import Path
from qwen_tts import Qwen3TTSModel
import torch
import soundfile as sf
import numpy as np

# Load model once
print("Loading model...")
model = Qwen3TTSModel.from_pretrained(
    'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice',
    attn_implementation='sdpa',
    device_map='cuda:0',
    dtype=torch.bfloat16
)

posts = [
    'post1.md',
    'post2.md',
    'post3.md',
]

for post in posts:
    print(f"\n{'='*50}")
    print(f"Processing: {post}")
    print(f"{'='*50}")

    text = clean_blog_post(post)
    output = f"/tmp/{Path(post).stem}_tts.wav"

    # Process chunks
    chunks = chunk_text(text)
    all_audio = []

    for i, chunk in enumerate(chunks):
        print(f"  Chunk {i+1}/{len(chunks)}")
        audios, sr = model.generate_custom_voice(
            chunk, speaker='eric', language='english'
        )
        all_audio.append(audios[0])

    combined = np.concatenate(all_audio)
    sf.write(output, combined, sr)
    print(f"Saved: {output}")

This approach processes our five-post backlog overnight, with results ready for review in the morning.

Performance Characteristics

On the AI Max+ 395, speech synthesis runs at roughly real-time to 0.5x real-time speed—meaning a 30-minute audio file takes 30-60 minutes to generate. This is slower than high-end discrete GPUs but perfectly acceptable for batch processing.

For reference, here's how different content lengths performed in our testing:

Content	Characters	Chunks	Audio Duration	Generation Time
Short post	5,000	12	~5 min	~15 min
Medium post	15,000	35	~15 min	~45 min
Long post	25,000	55	~27 min	~90 min
Very long	40,000	85	~45 min	~150 min

The relationship between content length and generation time is roughly linear after the initial model warmup.

Some observations from our testing:

First chunk latency: The first chunk takes longer due to GPU kernel compilation and caching
Memory usage: Peak usage around 8-10GB during inference
GPU utilization: Consistent 100% during active synthesis
Quality: Indistinguishable from cloud TTS services for most content

The MIOpen library sometimes logs workspace warnings during execution. These don't affect output quality and can be safely ignored:

MIOpen(HIP): Warning [IsEnoughWorkspace] Solver <GemmFwdRest>, workspace required: 103133184

Integrating Audio into Blog Posts

Once we have the WAV file, we convert to MP3 for web delivery and embed an HTML5 audio player:

ffmpeg -i blog_post.wav -codec:a libmp3lame -qscale:a 2 blog_post.mp3

For reviewing TTS output quality, we recommend using studio monitor headphones that reveal any artifacts or unnatural tones in the generated speech.

The player HTML is straightforward:

<div style="background: #f8f9fa; border: 1px solid #e9ecef;
            border-radius: 8px; padding: 16px 20px; margin: 20px 0;">
  <div class="audio-widget-header">
    <span class="audio-widget-icon">🎧</span>
    <span style="color: #495057; font-weight: 600;">Listen to this article</span>
  </div>
  <audio controls preload="metadata" style="width: 100%;">
    <source src="/audio/blog_post.mp3" type="audio/mpeg">
  </audio>
  <div class="audio-widget-footer">
    27 min · AI-generated narration
  </div>
</div>

Why We're Doing This

Adding audio narration to blog posts serves multiple purposes:

Accessibility: Readers with visual impairments or reading difficulties can consume content aurally
Convenience: Listeners can enjoy posts during commutes, workouts, or other activities
Engagement: Audio content creates a more personal connection with the audience
Reach: Some audiences prefer audio format, expanding our potential readership

Running TTS locally rather than using cloud services gives us:

Cost control: No per-character or per-minute fees
Privacy: Content never leaves our infrastructure
Consistency: Same voice and quality across all posts
Flexibility: Full control over processing pipeline

Troubleshooting Common Issues

"CUDA not available" despite GPU present

Ensure you've installed the ROCm version of PyTorch, not the standard build:

pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.4

Model runs on CPU instead of GPU

Check that device_map='cuda:0' is specified when loading the model. Also verify the environment variables are set before starting Python.

"Unsupported language 'en'"

Use the full language name: language='english' not language='en'.

Out of memory errors

Try reducing chunk size or using a smaller batch. The model should fit in 16GB, but very long chunks can spike memory usage.

Slow first chunk

This is normal—ROCm compiles GPU kernels on first use. Subsequent chunks process faster.

Future Improvements

Our current pipeline works well but has room for enhancement. Some improvements we're considering:

Voice cloning: Qwen3-TTS supports custom voice training. With sufficient audio samples, we could create a unique voice for TinyComputers rather than using the stock speakers. This would provide brand consistency and differentiation.

Automatic post detection: Currently we manually select posts for TTS generation. A CI/CD integration could automatically generate audio for new posts when they're published, keeping the audio library current without manual intervention.

Chapter markers: For longer posts, embedding chapter markers in the audio file would allow listeners to skip to specific sections. This requires parsing the markdown headers and mapping them to audio timestamps.

Multiple format export: Beyond MP3, offering Opus or AAC formats could reduce file sizes while maintaining quality, benefiting listeners on metered connections.

Speed adjustment: Some listeners prefer 1.25x or 1.5x playback speed. Pre-generating speed-adjusted versions could provide better quality than real-time speed adjustment in the browser.

Conclusion

Running Qwen3-TTS on AMD's Strix Halo platform demonstrates that high-quality local TTS is now accessible beyond NVIDIA hardware. While setup requires some ROCm-specific configuration, the results are impressive—natural-sounding narration suitable for professional content.

The democratization of AI capabilities continues apace. What once required expensive cloud subscriptions or high-end NVIDIA GPUs now runs on integrated graphics. The AI Max+ 395's Radeon 8060S, primarily designed for gaming and general compute tasks, handles a 1.7-billion parameter language model without breaking a sweat.

We're actively using this pipeline to generate audio versions of posts across TinyComputers, making our technical content more accessible and convenient for our readers. As of this writing, we've processed our retrocomputing series, hardware reviews, and technical tutorials—dozens of hours of content generated entirely on local hardware.

The combination of AMD's capable integrated graphics and Qwen's excellent TTS model proves that you don't need expensive discrete GPUs or cloud subscriptions to achieve broadcast-quality speech synthesis. For content creators, educators, and accessibility advocates, this opens new possibilities for enriching written content with audio without ongoing service costs.

If you're running AMD hardware and want to add audio narration to your own content, this guide should get you started. The initial setup investment pays dividends in ongoing cost savings and the satisfaction of running capable AI models entirely on your own infrastructure. And if you encounter issues along the way, the troubleshooting section above addresses the most common pitfalls we discovered during our own setup process.

The audio player at the top of many TinyComputers posts now represents a small but meaningful step toward making technical content more accessible. Every post you can listen to while commuting, exercising, or doing dishes is content that might otherwise go unread. That's the real value of local TTS—not just cost savings, but expanded reach for the ideas we share.