<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TinyComputers.io (Posts about strix halo)</title><link>https://tinycomputers.io/</link><description></description><atom:link href="https://tinycomputers.io/categories/strix-halo.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 A.C. Jokela 
&lt;!-- div style="width: 100%" --&gt;
&lt;a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"&gt;&lt;img alt="" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /&gt; Creative Commons Attribution-ShareAlike&lt;/a&gt;&amp;nbsp;|&amp;nbsp;
&lt;!-- /div --&gt;
</copyright><lastBuildDate>Wed, 11 Mar 2026 00:05:51 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Upgrading ROCm 7.0 to 7.2 on AMD Strix Halo (gfx1151)</title><link>https://tinycomputers.io/posts/upgrading-rocm-7.0-to-7.2-on-amd-strix-halo-gfx1151.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/upgrading-rocm-7.0-to-7.2-on-amd-strix-halo-gfx1151_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;15 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;h3&gt;Introduction&lt;/h3&gt;
&lt;p&gt;If you're running AMD's Strix Halo hardware -- specifically the Ryzen AI MAX+ 395 with its integrated Radeon 8060S GPU -- you already know the software ecosystem is a moving target. The gfx1151 architecture sits in an awkward spot: powerful hardware that isn't officially listed on AMD's ROCm support matrix, yet functional enough to run real workloads with the right driver stack. When ROCm 7.2 landed in early 2026, upgrading from 7.0.2 was a priority. The newer stack brings an updated HSA runtime, a refreshed amdgpu kernel module, and broader compatibility improvements that matter on bleeding-edge silicon.&lt;/p&gt;
&lt;p&gt;This post documents the complete upgrade procedure from ROCm 7.0.2 to 7.2 on a production Ubuntu 24.04 system. It's not a theoretical exercise -- this was performed on a live server running QEMU virtual machines and network services, with the expectation that everything would come back online after a single reboot.&lt;/p&gt;
&lt;p&gt;AMD's official documentation states that in-place ROCm upgrades are not supported. The recommended path is a full uninstall followed by a clean reinstall. That's exactly what we did, and the entire process took about 20 minutes of wall-clock time (excluding the reboot).&lt;/p&gt;
&lt;h3&gt;System Overview&lt;/h3&gt;
&lt;p&gt;The target system is a &lt;a href="https://baud.rs/WZgnl1"&gt;Bosgame mini PC&lt;/a&gt; running the Ryzen AI MAX+ 395 APU. If you've read the &lt;a href="https://tinycomputers.io/posts/amd-ai-max+-395-system-review-a-comprehensive-analysis/"&gt;earlier review&lt;/a&gt; of this hardware, you'll be familiar with the specs. For context on this upgrade, here's what matters:&lt;/p&gt;
&lt;h4&gt;Hardware&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: AMD Ryzen AI MAX+ 395, 16 cores / 32 threads, Zen 5&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU&lt;/strong&gt;: Integrated Radeon 8060S, 40 Compute Units, RDNA 3.5 (gfx1151)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt;: 32 GB DDR5, unified architecture with 96 GB allocatable to GPU&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Peak GPU Clock&lt;/strong&gt;: 2,900 MHz&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Software (Pre-Upgrade)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OS&lt;/strong&gt;: Ubuntu 24.04.3 LTS (Noble Numbat)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kernel&lt;/strong&gt;: 6.14.0-37-generic (HWE, pinned)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROCm&lt;/strong&gt;: 7.0.2&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;amdgpu-dkms&lt;/strong&gt;: 6.14.14 (from &lt;code&gt;repo.radeon.com/amdgpu/30.10.2&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROCk Module&lt;/strong&gt;: 6.14.14&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Running Services&lt;/h4&gt;
&lt;p&gt;The system was actively serving several roles during the upgrade:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Five QEMU virtual machines (three x86, two aarch64)&lt;/li&gt;
&lt;li&gt;A PXE boot server (dnsmasq) for the local network&lt;/li&gt;
&lt;li&gt;Docker daemon with various containers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;None of these services are tied to the GPU driver stack, so the plan was to perform the upgrade and reboot without shutting them down first. The VMs and network services would come back automatically after the reboot.&lt;/p&gt;
&lt;h3&gt;Why Upgrade&lt;/h3&gt;
&lt;p&gt;ROCm 7.0.2 worked on this hardware. Models loaded, inference ran, &lt;code&gt;rocminfo&lt;/code&gt; detected the GPU. So why bother upgrading?&lt;/p&gt;
&lt;p&gt;Three reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Driver maturity for gfx1151&lt;/strong&gt;: The amdgpu kernel module jumped from 6.14.14 to 6.16.13 between the two releases. That's not a minor revision -- it represents months of kernel driver development. On hardware that isn't officially supported, newer drivers tend to bring meaningful stability improvements as AMD's internal teams encounter and fix issues on adjacent architectures.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;HSA Runtime improvements&lt;/strong&gt;: ROCm 7.2 ships HSA Runtime Extension version 1.15, up from 1.11 in ROCm 7.0.2. The HSA (Heterogeneous System Architecture) runtime is the lowest layer of the ROCm software stack -- it handles device discovery, memory management, and kernel dispatch. Improvements here affect everything built on top of it.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Ecosystem alignment&lt;/strong&gt;: PyTorch wheels, Ollama builds, and other ROCm-dependent tools increasingly target 7.2 as the baseline. Running 7.0.2 was becoming an exercise in version pinning and compatibility workarounds.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;The Kernel Hold: Why It Matters&lt;/h3&gt;
&lt;p&gt;Before diving into the procedure, a note on kernel management. This system runs the Ubuntu HWE (Hardware Enablement) kernel, which provides newer kernel versions on LTS releases. At the time of this upgrade, the HWE kernel was 6.14.0-37-generic. The upstream kernel had already moved to 6.17, but we didn't want the ROCm upgrade to pull in a kernel that AMD's DKMS module might not build against.&lt;/p&gt;
&lt;p&gt;The solution is &lt;code&gt;apt-mark hold&lt;/code&gt;:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;apt-mark&lt;span class="w"&gt; &lt;/span&gt;hold&lt;span class="w"&gt; &lt;/span&gt;linux-generic-hwe-24.04&lt;span class="w"&gt; &lt;/span&gt;linux-headers-generic-hwe-24.04&lt;span class="w"&gt; &lt;/span&gt;linux-image-generic-hwe-24.04
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This prevents &lt;code&gt;apt&lt;/code&gt; from upgrading the kernel meta-packages, effectively pinning the system to 6.14.0-37-generic. The hold was already in place before the upgrade and remained untouched throughout. After the upgrade, we confirmed it was still active:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;apt-mark&lt;span class="w"&gt; &lt;/span&gt;showhold
&lt;/pre&gt;&lt;/div&gt;

&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;linux-generic-hwe-24.04
linux-headers-generic-hwe-24.04
linux-image-generic-hwe-24.04
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;If you're running Strix Halo or any other hardware where kernel compatibility with &lt;code&gt;amdgpu-dkms&lt;/code&gt; is uncertain, kernel holds are essential. A kernel upgrade that breaks the DKMS build means no GPU driver after reboot.&lt;/p&gt;
&lt;h3&gt;Upgrade Procedure&lt;/h3&gt;
&lt;h4&gt;Step 1: Uninstall the Current ROCm Stack&lt;/h4&gt;
&lt;p&gt;AMD provides the &lt;code&gt;amdgpu-uninstall&lt;/code&gt; script for exactly this purpose. It removes all ROCm userspace packages and the amdgpu-dkms kernel module in a single operation:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;amdgpu-uninstall&lt;span class="w"&gt; &lt;/span&gt;-y
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This command removed approximately 120 packages, including the full HIP runtime, rocBLAS, MIOpen, MIGraphX, ROCm SMI, the LLVM-based compiler toolchain, and the Mesa graphics drivers that ship with ROCm. The DKMS module was purged, which means the amdgpu kernel module was removed from the 6.14.0-37-generic kernel's module tree.&lt;/p&gt;
&lt;p&gt;After the ROCm stack was removed, we purged the &lt;code&gt;amdgpu-install&lt;/code&gt; meta-package itself:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;purge&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;amdgpu-install
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This also cleaned up the APT repository entries that &lt;code&gt;amdgpu-install&lt;/code&gt; had configured in &lt;code&gt;/etc/apt/sources.list.d/&lt;/code&gt;. The old repos -- &lt;code&gt;repo.radeon.com/amdgpu/30.10.2&lt;/code&gt;, &lt;code&gt;repo.radeon.com/rocm/apt/7.0.2&lt;/code&gt;, and &lt;code&gt;repo.radeon.com/graphics/7.0.2&lt;/code&gt; -- were all removed automatically.&lt;/p&gt;
&lt;h4&gt;Step 2: Clean Up Leftover Files&lt;/h4&gt;
&lt;p&gt;The package removal was thorough but not perfect. A few leftover directories remained in &lt;code&gt;/opt/&lt;/code&gt;:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;ls&lt;span class="w"&gt; &lt;/span&gt;/opt/&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;grep&lt;span class="w"&gt; &lt;/span&gt;rocm
&lt;/pre&gt;&lt;/div&gt;

&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;rocm-7.0.0
rocm-7.0.2
rocm-7.9.0
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;rocm-7.0.0&lt;/code&gt; directory was from a previous installation attempt. The &lt;code&gt;rocm-7.9.0&lt;/code&gt; was from an earlier experiment with a release candidate build. The &lt;code&gt;rocm-7.0.2&lt;/code&gt; directory contained a single orphaned shared library (&lt;code&gt;libamdhip64.so.6&lt;/code&gt;) that dpkg couldn't remove because the directory wasn't empty. All three were cleaned up manually:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;rm&lt;span class="w"&gt; &lt;/span&gt;-rf&lt;span class="w"&gt; &lt;/span&gt;/opt/rocm-7.0.0&lt;span class="w"&gt; &lt;/span&gt;/opt/rocm-7.0.2&lt;span class="w"&gt; &lt;/span&gt;/opt/rocm-7.9.0
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;It's worth checking for stale ROCm directories after any uninstall. They consume negligible disk space but can confuse build systems and scripts that scan &lt;code&gt;/opt/rocm*&lt;/code&gt; for active installations.&lt;/p&gt;
&lt;h4&gt;Step 3: Install the ROCm 7.2 Installer&lt;/h4&gt;
&lt;p&gt;AMD distributes ROCm through a meta-package called &lt;code&gt;amdgpu-install&lt;/code&gt;. Each ROCm release has its own version of this package, which configures the appropriate APT repositories. The 7.2 installer was downloaded directly from AMD's repository:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;/tmp
wget&lt;span class="w"&gt; &lt;/span&gt;https://repo.radeon.com/amdgpu-install/7.2/ubuntu/noble/amdgpu-install_7.2.70200-1_all.deb
sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;./amdgpu-install_7.2.70200-1_all.deb
sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;update
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;After installation and &lt;code&gt;apt update&lt;/code&gt;, three new repositories were active:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;https://repo.radeon.com/amdgpu/30.30/ubuntu noble&lt;/code&gt; -- the kernel driver and Mesa components&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://repo.radeon.com/rocm/apt/7.2 noble&lt;/code&gt; -- the ROCm userspace stack&lt;/li&gt;
&lt;li&gt;&lt;code&gt;https://repo.radeon.com/graphics/7.2/ubuntu noble&lt;/code&gt; -- graphics libraries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The version numbering can be confusing. The &lt;code&gt;amdgpu-install&lt;/code&gt; package version is &lt;code&gt;30.30.0.0.30300000-2278356.24.04&lt;/code&gt;, which maps to the amdgpu driver release 30.30. The ROCm version is 7.2.0. These are different version tracks that AMD maintains in parallel.&lt;/p&gt;
&lt;h4&gt;Step 4: Install ROCm 7.2&lt;/h4&gt;
&lt;p&gt;With the repositories configured, the actual installation was a single command:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;amdgpu-install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;--usecase&lt;span class="o"&gt;=&lt;/span&gt;graphics,rocm
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;--usecase=graphics,rocm&lt;/code&gt; flag tells the installer to include both the Mesa graphics drivers and the full ROCm compute stack. This is the right choice for a system that needs both display output and GPU compute capabilities.&lt;/p&gt;
&lt;p&gt;The installation took approximately 10 minutes and included:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;amdgpu-dkms 6.16.13&lt;/strong&gt;: The kernel module, compiled via DKMS against the running kernel&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Full ROCm 7.2 stack&lt;/strong&gt;: HIP runtime, hipcc compiler, rocBLAS, rocFFT, MIOpen, MIGraphX, RCCL, ROCm SMI, ROCProfiler, and dozens of other libraries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mesa graphics&lt;/strong&gt;: Updated EGL, OpenGL, and Vulkan drivers from the amdgpu Mesa fork&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;ROCm LLVM toolchain&lt;/strong&gt;: The LLVM-based compiler infrastructure that HIP uses for kernel compilation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The DKMS build is the critical step. During installation, DKMS compiled the amdgpu module against the kernel headers for 6.14.0-37-generic. The output confirmed a successful build:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;depmod...
update-initramfs: Generating /boot/initrd.img-6.14.0-37-generic
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The initramfs was regenerated to include the new module, ensuring it would be loaded at boot.&lt;/p&gt;
&lt;h4&gt;Step 5: Verify DKMS&lt;/h4&gt;
&lt;p&gt;Before rebooting, we confirmed the DKMS status:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;dkms&lt;span class="w"&gt; &lt;/span&gt;status
&lt;/pre&gt;&lt;/div&gt;

&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;amdgpu/6.16.13-2278356.24.04, 6.14.0-37-generic, x86_64: installed
virtualbox/7.0.16, 6.14.0-36-generic, x86_64: installed
virtualbox/7.0.16, 6.14.0-37-generic, x86_64: installed
virtualbox/7.0.16, 6.8.0-100-generic, x86_64: installed
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The new amdgpu module (6.16.13) was built and installed for 6.14.0-37-generic. Note that it only built for the currently running kernel, unlike VirtualBox which had modules built for older kernels as well. This is expected -- DKMS builds against available kernel headers, and the old kernel headers for 6.14.0-36 and 6.8.0-100 were still present from earlier installations.&lt;/p&gt;
&lt;h4&gt;Step 6: Reboot&lt;/h4&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;reboot
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The server came back online in approximately 50 seconds.&lt;/p&gt;
&lt;h3&gt;Post-Reboot Verification&lt;/h3&gt;
&lt;h4&gt;rocminfo&lt;/h4&gt;
&lt;p&gt;The first check after reboot was &lt;code&gt;rocminfo&lt;/code&gt;, which queries the HSA runtime for available agents:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;rocminfo
&lt;/pre&gt;&lt;/div&gt;

&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;ROCk&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;6.16&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loaded&lt;/span&gt;
&lt;span class="o"&gt;=====================&lt;/span&gt;
&lt;span class="n"&gt;HSA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Attributes&lt;/span&gt;
&lt;span class="o"&gt;=====================&lt;/span&gt;
&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;         &lt;/span&gt;&lt;span class="mf"&gt;1.18&lt;/span&gt;
&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Ext&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="mf"&gt;1.15&lt;/span&gt;
&lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="o"&gt;==========&lt;/span&gt;
&lt;span class="n"&gt;HSA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Agents&lt;/span&gt;
&lt;span class="o"&gt;==========&lt;/span&gt;
&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;RYZEN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AI&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MAX&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;395&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Radeon&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8060&lt;/span&gt;&lt;span class="n"&gt;S&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CPU&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;gfx1151&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;GPU&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Marketing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;          &lt;/span&gt;&lt;span class="n"&gt;AMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Radeon&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Graphics&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Compute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Unit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;            &lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Clock&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Freq&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MHz&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="mi"&gt;2900&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;Memory&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Properties&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt;       &lt;/span&gt;&lt;span class="n"&gt;APU&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ISA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;amdgcn&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;amd&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;amdhsa&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;gfx1151&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="n"&gt;ISA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;amdgcn&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;amd&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;amdhsa&lt;/span&gt;&lt;span class="o"&gt;--&lt;/span&gt;&lt;span class="n"&gt;gfx11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;generic&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Key observations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ROCk module 6.16.13&lt;/strong&gt;: The new kernel module loaded successfully.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Runtime Ext Version 1.15&lt;/strong&gt;: Upgraded from 1.11 in ROCm 7.0.2.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;gfx1151 detected&lt;/strong&gt;: The GPU was recognized with its correct ISA identifier.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;gfx11-generic ISA&lt;/strong&gt;: ROCm 7.2 also exposes a generic gfx11 ISA, which allows software compiled for the broader RDNA 3 family to run on this device without gfx1151-specific builds.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;APU memory&lt;/strong&gt;: The memory properties correctly identify this as an APU with unified memory.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;ROCm SMI&lt;/h4&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;rocm-smi
&lt;/pre&gt;&lt;/div&gt;

&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;Device  Node  Temp    Power     SCLK  MCLK     Fan  Perf  VRAM%  GPU%
0       1     33.0C   9.087W    N/A   1000Mhz  0%   auto  0%     0%
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The GPU was visible and reporting telemetry. The 0% VRAM reading is expected on an APU -- &lt;code&gt;rocm-smi&lt;/code&gt; reports dedicated VRAM usage, but on a unified memory architecture, GPU memory allocations come from system RAM and aren't reflected in this counter.&lt;/p&gt;
&lt;h4&gt;ROCm Version&lt;/h4&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;cat&lt;span class="w"&gt; &lt;/span&gt;/opt/rocm/.info/version
&lt;/pre&gt;&lt;/div&gt;

&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="mf"&gt;7.2.0&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;DKMS&lt;/h4&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;dkms&lt;span class="w"&gt; &lt;/span&gt;status
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Confirmed &lt;code&gt;amdgpu/6.16.13&lt;/code&gt; remained installed for 6.14.0-37-generic after reboot.&lt;/p&gt;
&lt;h3&gt;PyTorch Validation&lt;/h3&gt;
&lt;p&gt;With the driver stack verified, the next step was confirming that PyTorch could see and use the GPU. ROCm 7.2 ships with prebuilt PyTorch wheels on AMD's repository.&lt;/p&gt;
&lt;h4&gt;Installing PyTorch for ROCm 7.2&lt;/h4&gt;
&lt;p&gt;We set up a Python virtual environment and installed the ROCm-specific wheels:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;python3&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;venv&lt;span class="w"&gt; &lt;/span&gt;.venv
&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;.venv/bin/activate
pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;--upgrade&lt;span class="w"&gt; &lt;/span&gt;pip
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The PyTorch wheel for ROCm 7.2 requires a matching ROCm-specific build of Triton. Both are available from AMD's manylinux repository. The order matters -- Triton must be installed first, since the PyTorch wheel declares it as a dependency with a specific version that doesn't exist on PyPI:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/triton-3.5.1%2Brocm7.2.0.gita272dfa8-cp312-cp312-linux_x86_64.whl
pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torch-2.9.1%2Brocm7.2.0.lw.git7e1940d4-cp312-cp312-linux_x86_64.whl
pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;https://repo.radeon.com/rocm/manylinux/rocm-rel-7.2/torchvision-0.24.0%2Brocm7.2.0.gitb919bd0c-cp312-cp312-linux_x86_64.whl
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These are the ROCm 7.2 builds for Python 3.12. AMD also provides wheels for Python 3.10, 3.11, and 3.13.&lt;/p&gt;
&lt;h4&gt;Smoke Test&lt;/h4&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;torch&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"PyTorch:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;__version__&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"CUDA available:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Device:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_device_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"VRAM:"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_device_properties&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_memory&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="s2"&gt;"GB"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;PyTorch&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.9&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="n"&gt;rocm7&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;git7e1940d4&lt;/span&gt;
&lt;span class="n"&gt;CUDA&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;available&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;True&lt;/span&gt;
&lt;span class="n"&gt;Device&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Radeon&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Graphics&lt;/span&gt;
&lt;span class="n"&gt;VRAM&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;103.1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;GB&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;PyTorch detected the GPU through ROCm's HIP-to-CUDA translation layer. The 103.1 GB figure represents the total addressable memory on this unified-memory APU, which includes both the 96 GB GPU allocation and additional system memory accessible through the HSA runtime.&lt;/p&gt;
&lt;p&gt;Note the use of &lt;code&gt;torch.cuda&lt;/code&gt; despite this being an AMD GPU. ROCm's HIP runtime presents itself through PyTorch's CUDA interface, so all CUDA API calls in PyTorch (device selection, memory management, kernel launches) work transparently with AMD hardware.&lt;/p&gt;
&lt;h3&gt;Before and After Summary&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;ROCm 7.0.2&lt;/th&gt;
&lt;th&gt;ROCm 7.2.0&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ROCm Version&lt;/td&gt;
&lt;td&gt;7.0.2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.2.0&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;amdgpu-dkms&lt;/td&gt;
&lt;td&gt;6.14.14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.16.13&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ROCk Module&lt;/td&gt;
&lt;td&gt;6.14.14&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.16.13&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HSA Runtime Ext&lt;/td&gt;
&lt;td&gt;1.11&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.15&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;amdgpu Repo&lt;/td&gt;
&lt;td&gt;30.10.2&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;30.30&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PyTorch&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;2.9.1+rocm7.2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Triton&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;3.5.1+rocm7.2.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel&lt;/td&gt;
&lt;td&gt;6.14.0-37-generic&lt;/td&gt;
&lt;td&gt;6.14.0-37-generic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel Holds&lt;/td&gt;
&lt;td&gt;In place&lt;/td&gt;
&lt;td&gt;In place&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;h3&gt;Notes on gfx1151 Support&lt;/h3&gt;
&lt;p&gt;It's worth being explicit about the support situation. As of February 2026, gfx1151 (Strix Halo) is &lt;strong&gt;not listed&lt;/strong&gt; on AMD's official ROCm support matrix. The supported RDNA 3 targets are gfx1100 (Navi 31, RX 7900 XTX) and gfx1101 (Navi 32). Strix Halo's gfx1151 is an RDNA 3.5 derivative that shares much of the ISA with gfx1100 but has architectural differences in the memory subsystem and compute unit layout.&lt;/p&gt;
&lt;p&gt;In practice, ROCm 7.2 works on gfx1151. The kernel driver loads, &lt;code&gt;rocminfo&lt;/code&gt; detects the GPU, and PyTorch can allocate tensors and dispatch compute kernels. The &lt;code&gt;gfx11-generic&lt;/code&gt; ISA target in ROCm 7.2 is particularly helpful -- it provides a compatibility path for software that hasn't been explicitly compiled for gfx1151.&lt;/p&gt;
&lt;p&gt;However, "works" and "fully supported" are different things. There are known quirks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;rocm-smi VRAM reporting&lt;/strong&gt;: Always shows 0% on the APU since it only tracks discrete VRAM&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No official PyTorch gfx1151 builds&lt;/strong&gt;: The ROCm PyTorch wheels target gfx1100. They run on gfx1151 through ISA compatibility, but performance may not be optimal&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Large model loading latency&lt;/strong&gt;: Moving large models to the GPU device can be slow on the unified memory architecture, as the HSA runtime handles page migration differently than discrete GPU DMA transfers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you're considering this hardware for production AI workloads, treat ROCm support as "functional but experimental." It works well enough for development, testing, and moderate inference workloads. For production training or latency-sensitive deployment, stick with hardware on AMD's official support list.&lt;/p&gt;
&lt;h3&gt;Rollback Plan&lt;/h3&gt;
&lt;p&gt;If the upgrade fails -- the DKMS module doesn't build, the GPU isn't detected after reboot, or something else goes wrong -- the rollback path is straightforward:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Uninstall ROCm 7.2:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;sudo&lt;span class="w"&gt; &lt;/span&gt;amdgpu-uninstall&lt;span class="w"&gt; &lt;/span&gt;-y
sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;purge&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;amdgpu-install
&lt;/pre&gt;&lt;/div&gt;

&lt;ol&gt;
&lt;li&gt;Reinstall ROCm 7.0.2:&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;wget&lt;span class="w"&gt; &lt;/span&gt;https://repo.radeon.com/amdgpu-install/30.10.2/ubuntu/noble/amdgpu-install_30.10.2.0.30100200-2226257.24.04_all.deb
sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;./amdgpu-install_30.10.2.0.30100200-2226257.24.04_all.deb
sudo&lt;span class="w"&gt; &lt;/span&gt;apt&lt;span class="w"&gt; &lt;/span&gt;update
sudo&lt;span class="w"&gt; &lt;/span&gt;amdgpu-install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;--usecase&lt;span class="o"&gt;=&lt;/span&gt;graphics,rocm
sudo&lt;span class="w"&gt; &lt;/span&gt;reboot
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The entire rollback takes about 15 minutes. Keep the old &lt;code&gt;amdgpu-install&lt;/code&gt; deb URL handy -- it's not linked from AMD's current download pages once a newer version is published.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Upgrading ROCm on hardware that isn't officially supported always carries some risk, but this upgrade from 7.0.2 to 7.2 on gfx1151 was uneventful. The procedure follows AMD's documented uninstall-reinstall approach with no deviations. The kernel hold strategy kept the kernel stable, the DKMS module built cleanly against 6.14.0-37-generic, and all post-reboot checks passed.&lt;/p&gt;
&lt;p&gt;The improvements in ROCm 7.2 -- particularly the HSA runtime bump to 1.15 and the introduction of the &lt;code&gt;gfx11-generic&lt;/code&gt; ISA target -- represent meaningful progress for Strix Halo users. The ecosystem is slowly catching up to the hardware. It's not there yet, but each release closes the gap.&lt;/p&gt;
&lt;p&gt;For anyone running a Ryzen AI MAX+ 395 or similar Strix Halo hardware on Ubuntu 24.04, this upgrade is worth doing. The procedure is well-defined, the rollback path is clear, and the newer driver stack brings tangible benefits. Just remember to hold your kernel first.&lt;/p&gt;
&lt;h3&gt;Recommended Resources&lt;/h3&gt;
&lt;h4&gt;Hardware&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://baud.rs/WZgnl1"&gt;Bosgame M5 AI Mini PC (Ryzen AI MAX+ 395)&lt;/a&gt; - The system used in this post&lt;/li&gt;
&lt;li&gt;&lt;a href="https://baud.rs/q87EAZ"&gt;GMKtec EVO X2 (Ryzen AI MAX+ 395)&lt;/a&gt; - Another Strix Halo mini PC option on Amazon&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Books&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://baud.rs/NTAPGg"&gt;&lt;em&gt;Deep Learning with PyTorch&lt;/em&gt;&lt;/a&gt; by Stevens, Antiga, Huang, Viehmann - Comprehensive guide to building, training, and tuning neural networks with PyTorch&lt;/li&gt;
&lt;li&gt;&lt;a href="https://baud.rs/Iu8KR4"&gt;&lt;em&gt;Programming PyTorch for Deep Learning&lt;/em&gt;&lt;/a&gt; by Ian Pointer - Practical guide to creating and deploying deep learning applications&lt;/li&gt;
&lt;li&gt;&lt;a href="https://baud.rs/zmKSQj"&gt;&lt;em&gt;Understanding Deep Learning&lt;/em&gt;&lt;/a&gt; by Simon Prince - Modern treatment of deep learning fundamentals&lt;/li&gt;
&lt;/ul&gt;</description><category>amd</category><category>amdgpu</category><category>dkms</category><category>driver upgrade</category><category>gfx1151</category><category>gpu computing</category><category>linux</category><category>pytorch</category><category>rocm</category><category>ryzen ai</category><category>strix halo</category><category>ubuntu</category><guid>https://tinycomputers.io/posts/upgrading-rocm-7.0-to-7.2-on-amd-strix-halo-gfx1151.html</guid><pubDate>Wed, 18 Feb 2026 16:00:00 GMT</pubDate></item><item><title>Partial LLM Loading: Running Models Too Big for VRAM</title><link>https://tinycomputers.io/posts/partial-llm-loading-running-models-too-big-for-vram.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/partial-llm-loading-running-models-too-big-for-vram.mp3" type="audio/mpeg"&gt;
Your browser does not support the audio element.
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;6:59 · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;What happens when you want to run a 70B parameter model but only have 24GB of VRAM? Traditionally, you either quantize aggressively, &lt;a href="https://baud.rs/BJLKdd"&gt;rent cloud GPUs&lt;/a&gt;, or accept that the model is simply out of reach. But there's a third option that's becoming increasingly viable: partial loading, where you keep some layers on the CPU or disk and stream them to the GPU on demand.&lt;/p&gt;
&lt;p&gt;I spent a couple days testing partial loading strategies on an &lt;a href="https://baud.rs/3vAejv"&gt;AMD Strix Halo APU with 128GB of unified memory&lt;/a&gt;, configured with 96GB allocated to VRAM, trying to answer a simple question: can you actually run models that don't fit in VRAM, and if so, how much performance do you sacrifice?&lt;/p&gt;
&lt;p&gt;The answer turns out to be: yes, you can, and the performance penalty is more nuanced than I expected.&lt;/p&gt;
&lt;h3&gt;The Memory Problem&lt;/h3&gt;
&lt;p&gt;Large language models are memory hogs. A 7B parameter model in bfloat16 needs roughly 14GB just for the weights. A 70B model needs 140GB. An 80B MoE model might need 160GB or more. Most consumer GPUs max out at 24GB, with only a handful of prosumer cards reaching 48GB.&lt;/p&gt;
&lt;p&gt;The traditional solutions each have trade-offs:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Quantization&lt;/strong&gt; reduces memory requirements by storing weights in lower precision formats. INT8 cuts memory in half. INT4 cuts it to a quarter. But quantization also reduces quality, sometimes significantly for complex reasoning tasks.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Model sharding&lt;/strong&gt; across multiple GPUs works if you have multiple GPUs. Most people don't.  Early on (e.g. two years ago), this is how I experimented with models, a handful of Pascal chipset NVIDIA GPUs in a former crypto mining server.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Cloud inference&lt;/strong&gt; works but adds latency, costs money per token, and means your data leaves your machine.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Partial loading&lt;/strong&gt; offers a fourth path: keep the model weights somewhere other than VRAM (CPU RAM, disk, NVMe) and load them into the GPU only when needed. You take a latency hit on every layer that needs to be fetched, but you can run models that would otherwise be impossible.&lt;/p&gt;
&lt;h3&gt;Understanding Transformer Layer Architecture&lt;/h3&gt;
&lt;p&gt;To understand why partial loading works, you need to understand how transformers process information. A typical LLM consists of:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Embedding layer&lt;/strong&gt;: Converts input tokens to vectors. Relatively small.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Decoder layers&lt;/strong&gt;: The bulk of the model. A 70B parameter model might have 80+ decoder layers, each containing attention heads and a feed-forward network (FFN).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Final layer norm and output projection&lt;/strong&gt;: Converts the final hidden states back to token probabilities. Relatively small.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The key insight is that inference is sequential through the layers. When processing a token, you go through layer 0, then layer 1, then layer 2, and so on. You never need layer 5 while you're processing layer 3. This means you can theoretically keep only one layer's weights in VRAM at a time, loading the next layer while processing the current one.&lt;/p&gt;
&lt;p&gt;In practice, keeping &lt;em&gt;all&lt;/em&gt; layers streaming from disk adds too much latency. The sweet spot is typically keeping some layers resident in VRAM (usually the first few and last few, which see the most traffic) while streaming the middle layers on demand.&lt;/p&gt;
&lt;h3&gt;The Hardware Setup&lt;/h3&gt;
&lt;p&gt;My test machine is an AMD Strix Halo APU:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;Hardware Configuration:
&lt;span class="k"&gt;-&lt;/span&gt; AMD Radeon 8060S (gfx1151)
&lt;span class="k"&gt;-&lt;/span&gt; 128GB unified memory (96GB VRAM / 32GB system)
&lt;span class="k"&gt;-&lt;/span&gt; ROCm 7.0 with HSA_OVERRIDE_GFX_VERSION=11.0.0
&lt;span class="k"&gt;-&lt;/span&gt; PyTorch 2.9.1+rocm6.3
&lt;span class="k"&gt;-&lt;/span&gt; NVMe SSD: Samsung 990 Pro 2TB (PCIe 4.0, 7450 MB/s read)
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The unified memory architecture is interesting for this experiment. On a discrete GPU, moving weights from CPU RAM to VRAM requires crossing the PCIe bus, which tops out at around 64GB/s for PCIe 4.0 x16. On the Strix Halo APU, both "GPU memory" and "CPU memory" share the same physical RAM—it's just a question of which pages are mapped for GPU access.&lt;/p&gt;
&lt;p&gt;This should give partial loading an advantage on APUs, since there's no physical data movement—just page table updates. The actual numbers bear this out, as we'll see.&lt;/p&gt;
&lt;h3&gt;Three Approaches to Partial Loading&lt;/h3&gt;
&lt;p&gt;I tested three different strategies for loading models that exceed VRAM:&lt;/p&gt;
&lt;h4&gt;1. llama.cpp with Partial GPU Offloading&lt;/h4&gt;
&lt;p&gt;The simplest approach uses &lt;a href="https://baud.rs/llamacpp"&gt;llama.cpp's&lt;/a&gt; &lt;code&gt;-ngl&lt;/code&gt; (number of GPU layers) flag. This lets you specify exactly how many transformer layers go on the GPU, with the rest staying on CPU.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;./main&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;models/llama-70b-chat.gguf&lt;span class="w"&gt; &lt;/span&gt;-ngl&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;35&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The capital of France is"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-n&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;50&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;With a 70B model that has 80 layers, setting &lt;code&gt;-ngl 35&lt;/code&gt; puts roughly 44% of the model on the GPU and 56% on CPU. The GPU handles the compute-intensive matrix multiplications, while the CPU layers run on AMD's Zen cores.&lt;/p&gt;
&lt;p&gt;Advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simple to configure&lt;/li&gt;
&lt;li&gt;Automatic handling of which layers go where&lt;/li&gt;
&lt;li&gt;Works with GGUF quantized models&lt;/li&gt;
&lt;li&gt;CPU layers use optimized AVX-512 implementations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Disadvantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Static partitioning—layers stay where they're assigned&lt;/li&gt;
&lt;li&gt;CPU inference is much slower than GPU&lt;/li&gt;
&lt;li&gt;Limited to llama.cpp's supported architectures&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;2. HuggingFace Accelerate Disk Offloading&lt;/h4&gt;
&lt;p&gt;&lt;a href="https://baud.rs/accelerate"&gt;HuggingFace's Accelerate library&lt;/a&gt; provides &lt;code&gt;device_map="auto"&lt;/code&gt; with disk offloading:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;transformers&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;accelerate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;infer_auto_device_map&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;"meta-llama/Llama-3.2-70B-Instruct"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"auto"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;offload_folder&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./offload"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When VRAM is insufficient, Accelerate automatically spills layers to disk (or CPU RAM if you use &lt;code&gt;offload_buffers=True&lt;/code&gt;). During inference, layers are loaded as needed.&lt;/p&gt;
&lt;p&gt;Advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Works with any HuggingFace model&lt;/li&gt;
&lt;li&gt;Automatic layer placement decisions&lt;/li&gt;
&lt;li&gt;Can use disk for infinite capacity&lt;/li&gt;
&lt;li&gt;Integrates with the broader HuggingFace ecosystem&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Disadvantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Disk I/O is slow (even NVMe)&lt;/li&gt;
&lt;li&gt;Layer loading happens synchronously&lt;/li&gt;
&lt;li&gt;Each token generation can require full model traversal&lt;/li&gt;
&lt;li&gt;Memory peaks during layer swaps&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;3. oLLM Layer Streaming&lt;/h4&gt;
&lt;p&gt;The most sophisticated approach I tested was &lt;a href="https://baud.rs/ollm"&gt;oLLM&lt;/a&gt;, a library designed specifically for layer-by-layer streaming from SSD to GPU. Unlike HuggingFace's approach, oLLM implements asynchronous layer prefetching—while one layer is processing on the GPU, the next layer is being loaded.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;ollm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Inference&lt;/span&gt;

&lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"llama3-1B-chat"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"cuda:0"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ini_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./models/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Offload half the layers to CPU&lt;/span&gt;
&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;offload_layers_to_cpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;layers_num&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"The capital of France is"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The library instruments each layer load, giving you visibility into the streaming behavior:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;layer_load&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.002&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.004&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.004&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.004&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.004&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;0.242&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This tells you each layer took about 4ms to load, and the total token generation time was 242ms.&lt;/p&gt;
&lt;p&gt;Advantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Asynchronous prefetching reduces latency&lt;/li&gt;
&lt;li&gt;Per-layer timing instrumentation&lt;/li&gt;
&lt;li&gt;Designed specifically for memory-constrained scenarios&lt;/li&gt;
&lt;li&gt;Can leverage GPU Direct Storage (GDS) for faster NVMe-to-GPU transfers&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Disadvantages:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Limited model architecture support&lt;/li&gt;
&lt;li&gt;Requires transformers 4.x (incompatible with 5.0)&lt;/li&gt;
&lt;li&gt;Less mature than llama.cpp or HuggingFace&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Benchmarking Partial Loading&lt;/h3&gt;
&lt;p&gt;I ran a series of tests with Llama 3.2 1B (16 layers, 2.8GB model size) to measure the impact of partial loading:&lt;/p&gt;
&lt;h4&gt;Test Configuration&lt;/h4&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;model_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"llama3-1B-chat"&lt;/span&gt;
&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"The capital of France is"&lt;/span&gt;
&lt;span class="n"&gt;max_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="n"&gt;configurations&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"gpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"cpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;# Full GPU&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"gpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"cpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;# 75% GPU&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"gpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"cpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;    &lt;span class="c1"&gt;# 50% GPU&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"gpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"cpu_layers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;# 25% GPU&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Results: oLLM Layer Streaming&lt;/h4&gt;
&lt;p&gt;With the oLLM library and 8 of 16 layers offloaded to CPU:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;GPU&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CPU&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;
&lt;span class="n"&gt;Model&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;2.3&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;seconds&lt;/span&gt;
&lt;span class="n"&gt;First&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;242&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;
&lt;span class="n"&gt;Per&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;layer&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;load&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;ms&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;average&lt;/span&gt;
&lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Correct&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"The capital of France is Paris."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The layer load times are interesting. At 4ms per layer, you might expect significant overhead when 8 layers need to be fetched from CPU RAM. But because oLLM prefetches the next layer while the current one is executing, the effective latency impact is much smaller.&lt;/p&gt;
&lt;p&gt;On a discrete GPU with PCIe transfers, these numbers would be different. Loading a 200MB layer across PCIe 4.0 x16 takes about 3ms at full bandwidth. But PCIe rarely achieves full bandwidth due to protocol overhead, so real-world numbers are typically 4-6ms per layer—similar to what I measured on the APU.&lt;/p&gt;
&lt;h4&gt;The Quality Question&lt;/h4&gt;
&lt;p&gt;A critical question with partial loading: does offloading layers affect output quality?&lt;/p&gt;
&lt;p&gt;The answer is no, with an important caveat. Partial loading doesn't change the weights—it just changes where they're stored. The same matrices participate in the same computations. The outputs are bit-identical to full GPU inference.&lt;/p&gt;
&lt;p&gt;The caveat is that some partial loading implementations use reduced precision for CPU layers (FP32 instead of bfloat16, or even FP16) to speed up CPU computation. This can introduce small numerical differences. In my testing with oLLM, both GPU and CPU layers used the same bfloat16 precision, so outputs matched exactly.&lt;/p&gt;
&lt;h3&gt;Practical Performance Analysis&lt;/h3&gt;
&lt;p&gt;Let's break down what partial loading actually costs in terms of latency.&lt;/p&gt;
&lt;h4&gt;Layer Loading Overhead&lt;/h4&gt;
&lt;p&gt;For a model with N layers, where K layers are on CPU:
- Each token generation requires K layer loads
- If each load takes T_load milliseconds
- The total added latency per token is approximately K * T_load&lt;/p&gt;
&lt;p&gt;With oLLM's prefetching, the effective latency is lower because loads overlap with computation. In my tests:
- K = 8 layers on CPU
- T_load = 4ms per layer
- Naive overhead = 32ms per token
- Actual overhead (with prefetching) = ~10-15ms per token&lt;/p&gt;
&lt;h4&gt;Memory Bandwidth Bottleneck&lt;/h4&gt;
&lt;p&gt;The real constraint isn't CPU speed—it's memory bandwidth. A single transformer layer in a 70B model might be 800MB-1.2GB. Loading this from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;NVMe SSD: 7.4GB/s = 108-162ms per layer&lt;/li&gt;
&lt;li&gt;DDR5 RAM: 80GB/s = 10-15ms per layer&lt;/li&gt;
&lt;li&gt;PCIe 4.0 x16: 32GB/s = 25-37ms per layer (in practice)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This is why oLLM's authors recommend fast NVMe SSDs (&lt;a href="https://baud.rs/pqpsCq"&gt;Samsung 990 Pro&lt;/a&gt;, &lt;a href="https://baud.rs/XluQ37"&gt;WD SN850X&lt;/a&gt;) and ideally GPU Direct Storage, which bypasses the CPU entirely for disk-to-GPU transfers.&lt;/p&gt;
&lt;h4&gt;Token Generation Speed Comparison&lt;/h4&gt;
&lt;p&gt;For the Llama 3.2 1B model (16 layers total), I ran benchmarks across multiple prompts and averaging the results:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Avg Tokens/sec&lt;/th&gt;
&lt;th&gt;Avg Inference Time&lt;/th&gt;
&lt;th&gt;Load Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full GPU (0 offloaded)&lt;/td&gt;
&lt;td&gt;1.92 tok/s&lt;/td&gt;
&lt;td&gt;13.90s&lt;/td&gt;
&lt;td&gt;0.46s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4 layers on CPU&lt;/td&gt;
&lt;td&gt;2.23 tok/s&lt;/td&gt;
&lt;td&gt;11.09s&lt;/td&gt;
&lt;td&gt;0.55s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8 layers on CPU&lt;/td&gt;
&lt;td&gt;2.26 tok/s&lt;/td&gt;
&lt;td&gt;10.87s&lt;/td&gt;
&lt;td&gt;0.65s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12 layers on CPU&lt;/td&gt;
&lt;td&gt;3.36 tok/s&lt;/td&gt;
&lt;td&gt;7.30s&lt;/td&gt;
&lt;td&gt;0.75s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Wait—that's backwards from what you'd expect. More layers on CPU resulted in &lt;em&gt;faster&lt;/em&gt; inference?&lt;/p&gt;
&lt;p&gt;This counterintuitive result makes sense when you consider the Strix Halo's unified memory architecture. Unlike a discrete GPU where CPU-to-GPU transfers cross the PCIe bus, the APU's "CPU memory" and "GPU memory" are the same physical RAM. Moving layers between them is essentially just a page table operation, not a data copy.&lt;/p&gt;
&lt;p&gt;The performance improvement with more offloading likely comes from reduced memory bandwidth contention. When all layers are "on GPU," they're competing for the same memory channels. With layer streaming, only the active layer's weights occupy high-bandwidth GPU memory paths, while inactive layers sit in lower-priority memory regions.&lt;/p&gt;
&lt;p&gt;This finding suggests that on unified memory systems (AMD APUs, Apple Silicon), partial loading might actually be &lt;em&gt;preferable&lt;/em&gt; to full GPU loading for memory-bandwidth-bound workloads. The conventional wisdom—that GPU is always faster—doesn't hold when there's no physical separation between GPU and CPU memory.&lt;/p&gt;
&lt;h3&gt;Transformer Version Compatibility Issues&lt;/h3&gt;
&lt;p&gt;One challenge I encountered was library compatibility. oLLM was designed for transformers 4.x, and when I initially ran it with &lt;a href="https://baud.rs/transformers"&gt;transformers 5.0&lt;/a&gt;, I hit several errors:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="ne"&gt;TypeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'Qwen3NextExperts'&lt;/span&gt; &lt;span class="nb"&gt;object&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;iterable&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This error occurred because transformers 5.0 changed how Mixture of Experts (MoE) layers expose their expert modules. The oLLM library's layer streaming code assumed it could iterate over &lt;code&gt;self.mlp.experts&lt;/code&gt;, but the new implementation uses a different structure.&lt;/p&gt;
&lt;p&gt;There were also weight shape mismatches:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;model.layers.0.input_layernorm.weight: found shape torch.Size([2048])
in the checkpoint and torch.Size([0]) in the model instantiated
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This happened because oLLM creates placeholder layers with zero-size tensors to save memory, then loads the actual weights on demand. The new transformers version changed how these placeholder shapes were inferred.&lt;/p&gt;
&lt;p&gt;The solution was straightforward: pin transformers to version 4.57.6:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'transformers&amp;lt;5.0'&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is a common pattern with cutting-edge ML libraries. The ecosystem moves fast, and specialized tools often lag behind major version updates.&lt;/p&gt;
&lt;h3&gt;Storage and System Requirements&lt;/h3&gt;
&lt;p&gt;Before diving into partial loading, it's worth understanding the storage requirements. Unlike full GPU loading where you only need enough VRAM, partial loading requires sufficient storage capacity and bandwidth.&lt;/p&gt;
&lt;h4&gt;Disk Space Calculations&lt;/h4&gt;
&lt;p&gt;Model files on disk are typically stored in safetensors or GGUF format. A rough calculation:
- 7B model (bfloat16): ~14GB
- 13B model (bfloat16): ~26GB
- 70B model (bfloat16): ~140GB
- 70B model (GGUF Q4_K_M): ~40GB&lt;/p&gt;
&lt;p&gt;For oLLM's layer streaming, you also need the model to be split into per-layer shards, which the library handles automatically during the first load. This adds temporary storage overhead during the conversion process.&lt;/p&gt;
&lt;h4&gt;RAM Requirements&lt;/h4&gt;
&lt;p&gt;CPU offloading means the offloaded layers live in system RAM. If you're offloading 40 of 80 layers from a 70B model, you need roughly 70GB of system RAM available—in addition to whatever the operating system and other applications need.&lt;/p&gt;
&lt;p&gt;On my Strix Halo system with 128GB unified memory (96GB allocated to VRAM, 32GB to system), this gets interesting. The "CPU" portion of memory and the "GPU" portion share the same physical DIMMs. Allocating layers to "CPU" really just means they're in a different memory region that the GPU can still access, but through a different (slower) path.&lt;/p&gt;
&lt;h4&gt;SSD Endurance Considerations&lt;/h4&gt;
&lt;p&gt;If you're streaming weights from disk rather than CPU RAM, consider your SSD's endurance. A 70B model with 80 layers means moving roughly 1.75GB per token generated (all layers traversed once). Generate 1000 tokens and you've read 1.75TB from the SSD.&lt;/p&gt;
&lt;p&gt;For occasional use, this is fine. For continuous operation (like a chatbot running 24/7), you might wear out a consumer SSD within months. Enterprise SSDs with higher TBW (Terabytes Written) ratings are worth considering for heavy use cases, or preferring CPU RAM offloading over disk offloading.&lt;/p&gt;
&lt;h4&gt;Memory Mapping and Page Tables&lt;/h4&gt;
&lt;p&gt;Under the hood, partial loading relies on the operating system's memory management. When a layer is "loaded" to the GPU, this typically involves:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Reading the layer from storage (if not already in RAM)&lt;/li&gt;
&lt;li&gt;Pinning the memory pages so they can't be swapped&lt;/li&gt;
&lt;li&gt;Mapping the pages into GPU-accessible memory space&lt;/li&gt;
&lt;li&gt;Synchronizing to ensure the GPU sees the updated data&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;On Linux, this uses &lt;code&gt;mmap()&lt;/code&gt; and &lt;code&gt;mlock()&lt;/code&gt; syscalls. The &lt;code&gt;vm.max_map_count&lt;/code&gt; sysctl may need to be increased for very large models:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# Check current value&lt;/span&gt;
cat&lt;span class="w"&gt; &lt;/span&gt;/proc/sys/vm/max_map_count

&lt;span class="c1"&gt;# Increase if needed&lt;/span&gt;
sudo&lt;span class="w"&gt; &lt;/span&gt;sysctl&lt;span class="w"&gt; &lt;/span&gt;-w&lt;span class="w"&gt; &lt;/span&gt;vm.max_map_count&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1048576&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I hit this limit when testing 70B+ models and saw cryptic "cannot allocate memory" errors until increasing the map count.&lt;/p&gt;
&lt;h3&gt;When Partial Loading Makes Sense&lt;/h3&gt;
&lt;p&gt;Based on my testing, here's when partial loading is a good fit:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Good Use Cases:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Batch processing where latency isn't critical (overnight analysis, embedding generation)&lt;/li&gt;
&lt;li&gt;Interactive use with smaller models where the overhead is manageable&lt;/li&gt;
&lt;li&gt;Running larger models occasionally without investing in more VRAM&lt;/li&gt;
&lt;li&gt;Testing different model sizes before committing to hardware&lt;/li&gt;
&lt;li&gt;APU systems where CPU-GPU transfer costs are minimal&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Poor Use Cases:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Real-time applications (chatbots, live transcription)&lt;/li&gt;
&lt;li&gt;High-throughput production systems&lt;/li&gt;
&lt;li&gt;When quantization gives acceptable quality with lower overhead&lt;/li&gt;
&lt;li&gt;Systems with slow storage (spinning disks, older SSDs)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The break-even point depends heavily on your specific hardware. On my APU system with unified memory, offloading 50% of layers costs about 33% of throughput. On a discrete GPU with PCIe 3.0, the same configuration might cost 60-70%.&lt;/p&gt;
&lt;h3&gt;Future Directions&lt;/h3&gt;
&lt;p&gt;Several developments could make partial loading more practical:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GPU Direct Storage (GDS):&lt;/strong&gt; NVIDIA's GDS and AMD's equivalent allow direct SSD-to-GPU transfers, bypassing the CPU and PCIe. Early implementations show 3-4x improvements in layer load times.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Better Prefetching Algorithms:&lt;/strong&gt; Current implementations use simple next-layer prefetching. More sophisticated approaches could predict multiple layers ahead, or prioritize layers that are accessed most frequently (relevant for some architectures with skip connections).&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Hardware Evolution:&lt;/strong&gt; Unified memory architectures like Apple Silicon and AMD APUs eliminate the CPU-GPU transfer bottleneck entirely. As these architectures gain more memory capacity, partial loading becomes increasingly attractive.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Compression:&lt;/strong&gt; Applying neural compression to stored weights (not quantization, but actual neural codecs) could reduce the bandwidth requirements by 2-4x without quality loss.&lt;/p&gt;
&lt;h3&gt;Building a Benchmark Framework&lt;/h3&gt;
&lt;p&gt;For those who want to measure partial loading on their own hardware, here's the framework I developed:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;dataclasses&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dataclass&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;typing&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;time&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;torch&lt;/span&gt;

&lt;span class="nd"&gt;@dataclass&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;BenchmarkResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;total_layers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;gpu_layers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;cpu_layers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;loading_mode&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;  &lt;span class="c1"&gt;# 'partial' or 'full'&lt;/span&gt;
    &lt;span class="n"&gt;load_time_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;tokens_generated&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;
    &lt;span class="n"&gt;total_inference_time&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;tokens_per_second&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;
    &lt;span class="n"&gt;per_layer_load_times&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;peak_vram_gb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;benchmark_ollm_inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;offload_layers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;BenchmarkResult&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;ollm&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Inference&lt;/span&gt;

    &lt;span class="c1"&gt;# Measure loading time&lt;/span&gt;
    &lt;span class="n"&gt;load_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;o&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Inference&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"cuda:0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;logging&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ini_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;models_dir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"./models/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;force_download&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;total_layers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;offload_layers&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;offload_layers_to_cpu&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;layers_num&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;offload_layers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;load_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;load_start&lt;/span&gt;

    &lt;span class="c1"&gt;# Measure inference time&lt;/span&gt;
    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;infer_start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;synchronize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;infer_time&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;perf_counter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;infer_start&lt;/span&gt;

    &lt;span class="c1"&gt;# Count tokens&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;o&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;BenchmarkResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;total_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;total_layers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;gpu_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;total_layers&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;offload_layers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;cpu_layers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;offload_layers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;loading_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'partial'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;offload_layers&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s1"&gt;'full'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;load_time_seconds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;load_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tokens_generated&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;total_inference_time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;infer_time&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tokens_per_second&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;infer_time&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;infer_time&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;peak_vram_gb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_memory_allocated&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;1e9&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This framework measures the key metrics: loading time, inference time, tokens per second, and VRAM usage. Run it with different &lt;code&gt;offload_layers&lt;/code&gt; values to map out the performance curve for your specific hardware.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Partial LLM loading isn't a silver bullet, but it's a valuable technique for expanding what's possible on memory-constrained hardware. On my 128GB APU system, I found something unexpected: partial loading with 12 of 16 layers on CPU actually &lt;em&gt;outperformed&lt;/em&gt; full GPU loading by 75% (3.36 tok/s vs 1.92 tok/s).&lt;/p&gt;
&lt;p&gt;The key takeaways:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Unified memory changes everything.&lt;/strong&gt; On APUs and Apple Silicon, the conventional wisdom that "GPU is always faster" doesn't hold. Reduced memory bandwidth contention can make partial loading preferable even when you have enough VRAM.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Prefetching is essential.&lt;/strong&gt; Naive layer loading is too slow. Libraries like oLLM that prefetch the next layer during current layer computation can reduce overhead by 50% or more.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Memory bandwidth matters more than CPU speed.&lt;/strong&gt; The bottleneck is getting bytes from storage/RAM to the GPU, not processing them once they're there.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Library maturity varies.&lt;/strong&gt; Expect compatibility issues with newer transformers versions. Pin your dependencies—oLLM requires transformers 4.x.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Quality is preserved.&lt;/strong&gt; Partial loading changes where weights live, not what they are. Outputs match full GPU inference exactly (assuming matching precision).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Benchmark your specific hardware.&lt;/strong&gt; My results on a Strix Halo APU won't match discrete GPU performance. The only way to know what works best is to measure it.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For batch processing and experimentation, partial loading lets you access models that would otherwise require more expensive hardware. For unified memory systems specifically, partial loading might be the &lt;em&gt;optimal&lt;/em&gt; configuration, not just a fallback.&lt;/p&gt;
&lt;p&gt;The era of "if it doesn't fit in VRAM, you can't run it" is ending. With the right techniques, nearly any model becomes accessible—and on the right hardware, you might even get a performance bonus for your trouble.&lt;/p&gt;</description><category>amd</category><category>gpu memory</category><category>layer streaming</category><category>llm</category><category>machine learning</category><category>memory optimization</category><category>ollm</category><category>partial loading</category><category>pytorch</category><category>rocm</category><category>strix halo</category><category>transformers</category><category>vram</category><guid>https://tinycomputers.io/posts/partial-llm-loading-running-models-too-big-for-vram.html</guid><pubDate>Thu, 05 Feb 2026 16:00:00 GMT</pubDate></item><item><title>Running Qwen TTS on AMD Strix Halo: A Complete Guide to Local Text-to-Speech</title><link>https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;p&gt;The rise of high-quality text-to-speech models has opened new possibilities for content creators, accessibility advocates, and developers alike. Qwen3-TTS, developed by Alibaba's Qwen team, represents a significant leap forward in neural TTS technology, offering natural-sounding speech synthesis with multiple speaker voices. In this guide, we'll walk through setting up Qwen3-TTS on AMD's Strix Halo platform—specifically the AI Max+ 395 with its integrated Radeon 8060S graphics—and demonstrate how we use it to generate audio narrations for blog posts right here on TinyComputers.&lt;/p&gt;
&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/qwen-tts-on-amd-strix-halo_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;16 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;h3&gt;Why Qwen3-TTS?&lt;/h3&gt;
&lt;p&gt;The text-to-speech landscape has evolved dramatically over the past few years. While cloud-based services like Amazon Polly, Google Cloud TTS, and ElevenLabs offer impressive quality, they come with ongoing costs, privacy considerations, and internet dependency. Local TTS solutions have historically lagged behind in quality, often producing robotic or unnatural speech.&lt;/p&gt;
&lt;p&gt;Qwen3-TTS changes this equation. The model produces remarkably natural speech with proper intonation, pacing, and emphasis. It supports multiple pre-trained speaker voices—including options like Eric, Aiden, Dylan, Serena, and others—each with distinct characteristics suitable for different content types. For technical content like our blog posts, the Eric voice provides clear, professional narration that listeners find easy to follow.&lt;/p&gt;
&lt;p&gt;The model we're using, &lt;code&gt;Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice&lt;/code&gt;, weighs in at 1.7 billion parameters. While not small, this is manageable on modern hardware and runs efficiently on GPU. The 12Hz designation refers to the audio frame rate used during generation, balancing quality with computational requirements.&lt;/p&gt;
&lt;h3&gt;The Hardware: AMD AI Max+ 395&lt;/h3&gt;
&lt;p&gt;AMD's Strix Halo architecture represents their latest push into the high-performance APU market, combining powerful CPU cores with substantial integrated graphics. Our test system features:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;CPU&lt;/strong&gt;: AMD Ryzen AI Max+ 395 with 16 Zen 5 cores (32 threads)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU&lt;/strong&gt;: Integrated Radeon 8060S (RDNA 3.5 architecture)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory&lt;/strong&gt;: 128GB unified DDR5, configured with 96GB VRAM and 32GB system RAM&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute Units&lt;/strong&gt;: 40 CUs dedicated to graphics/compute workloads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Our test system is the Bosgame M5 AI Mini Desktop, one of the first mini PCs to ship with AMD's Strix Halo silicon. The &lt;a href="https://baud.rs/gmVPEI"&gt;GMKtec EVO-X2&lt;/a&gt; is an extremely similar system if you're looking to replicate this setup. The unified memory architecture is particularly relevant for machine learning workloads. Unlike discrete GPUs with their own VRAM, the Radeon 8060S shares system memory with the CPU. This means no PCIe bottleneck for data transfers, and with 96GB allocated as VRAM, even large models fit comfortably.&lt;/p&gt;
&lt;p&gt;For our TTS workload, the 8060S provides adequate performance. The 1.7B parameter model fits comfortably in memory, and inference runs entirely on GPU once loaded. We see 100% GPU utilization during speech synthesis, indicating the hardware is being fully leveraged.&lt;/p&gt;
&lt;h3&gt;Setting Up the Environment&lt;/h3&gt;
&lt;p&gt;The first challenge with AMD GPUs is getting PyTorch working correctly with ROCm, AMD's open-source GPU compute stack. The Strix Halo uses a newer GPU architecture (gfx1151) that requires ROCm 6.x and some environment variable overrides.&lt;/p&gt;
&lt;h4&gt;Step 1: Create a Python Virtual Environment&lt;/h4&gt;
&lt;p&gt;We'll use a dedicated virtual environment to isolate our TTS dependencies:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;mkdir&lt;span class="w"&gt; &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;~/qwen-tts
&lt;span class="nb"&gt;cd&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;~/qwen-tts
python3&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;venv&lt;span class="w"&gt; &lt;/span&gt;venv
&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;venv/bin/activate
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Step 2: Install PyTorch with ROCm Support&lt;/h4&gt;
&lt;p&gt;The standard PyTorch installation won't work—we need the ROCm-enabled build. As of this writing, ROCm 6.4 is the latest stable release:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;torch&lt;span class="w"&gt; &lt;/span&gt;torchvision&lt;span class="w"&gt; &lt;/span&gt;torchaudio&lt;span class="w"&gt; &lt;/span&gt;--index-url&lt;span class="w"&gt; &lt;/span&gt;https://download.pytorch.org/whl/rocm6.4
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This downloads PyTorch builds compiled specifically for AMD GPUs. The installation is larger than the standard CUDA builds due to the different compute libraries involved.&lt;/p&gt;
&lt;h4&gt;Step 3: Install Qwen-TTS&lt;/h4&gt;
&lt;p&gt;With PyTorch in place, install the Qwen TTS package:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;qwen-tts&lt;span class="w"&gt; &lt;/span&gt;soundfile&lt;span class="w"&gt; &lt;/span&gt;numpy
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;soundfile&lt;/code&gt; library handles WAV file I/O, while &lt;code&gt;numpy&lt;/code&gt; is needed for audio array manipulation.&lt;/p&gt;
&lt;h4&gt;Step 4: Install xformers for ROCm (Optional but Recommended)&lt;/h4&gt;
&lt;p&gt;The xformers library provides optimized attention implementations that can improve performance:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;xformers&lt;span class="w"&gt; &lt;/span&gt;--index-url&lt;span class="w"&gt; &lt;/span&gt;https://download.pytorch.org/whl/rocm6.4
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;While Qwen-TTS will work without xformers, having it available enables more efficient memory-attention patterns during inference.&lt;/p&gt;
&lt;h4&gt;Step 5: Configure Environment Variables&lt;/h4&gt;
&lt;p&gt;The Strix Halo's gfx1151 architecture isn't explicitly recognized by all ROCm components yet. We need to tell the system to treat it as a compatible architecture:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;HSA_OVERRIDE_GFX_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;.0.0
&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;GPU_MAX_ALLOC_PERCENT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;
&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;GPU_MAX_HEAP_SIZE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;
&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let's break down what these do:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HSA_OVERRIDE_GFX_VERSION=11.0.0&lt;/strong&gt;: Tells the HSA runtime to report the GPU as gfx1100, which has broader library support&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU_MAX_ALLOC_PERCENT=100&lt;/strong&gt;: Allows the GPU to use up to 100% of available memory for allocations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU_MAX_HEAP_SIZE=100&lt;/strong&gt;: Similar memory allocation setting for heap operations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1&lt;/strong&gt;: Enables experimental efficient attention implementations for AMD GPUs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Add these to your &lt;code&gt;.bashrc&lt;/code&gt; or create an activation script for convenience.&lt;/p&gt;
&lt;h4&gt;Step 6: Verify GPU Detection&lt;/h4&gt;
&lt;p&gt;Before proceeding, confirm PyTorch can see your GPU:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;torch&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"CUDA available: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;is_available&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Device count: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;device_count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Device name: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_device_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should see output like:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;CUDA available: True
Device count: 1
Device name: AMD Radeon 8060S
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Note that PyTorch uses "CUDA" terminology even for AMD GPUs when using ROCm—this is for API compatibility.&lt;/p&gt;
&lt;h3&gt;Basic TTS Usage&lt;/h3&gt;
&lt;p&gt;With the environment configured, let's test basic speech synthesis:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;qwen_tts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Qwen3TTSModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;soundfile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# Load model on GPU with bfloat16 precision&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Qwen3TTSModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attn_implementation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'sdpa'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'cuda:0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Check available speakers&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Available speakers: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;get_supported_speakers&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Generate speech&lt;/span&gt;
&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Hello, and welcome to TinyComputers. Today we're exploring text-to-speech on AMD hardware."&lt;/span&gt;
&lt;span class="n"&gt;audios&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate_custom_voice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'eric'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save to file&lt;/span&gt;
&lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'output.wav'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;audios&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Saved audio at &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;Hz"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;A few important notes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;We use &lt;code&gt;attn_implementation='sdpa'&lt;/code&gt; for scaled dot-product attention, which works on ROCm&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;device_map='cuda:0'&lt;/code&gt; explicitly places the model on the GPU&lt;/li&gt;
&lt;li&gt;Using &lt;code&gt;dtype=torch.bfloat16&lt;/code&gt; reduces memory usage while maintaining quality&lt;/li&gt;
&lt;li&gt;The language parameter must be the full word &lt;code&gt;'english'&lt;/code&gt;, not the abbreviation &lt;code&gt;'en'&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Building a Blog-to-Speech Pipeline&lt;/h3&gt;
&lt;p&gt;For our use case—generating audio versions of blog posts—we need more than basic TTS. Blog posts contain markdown formatting, code blocks, images, and other elements that shouldn't be read aloud. We built a complete pipeline that handles these challenges.&lt;/p&gt;
&lt;h4&gt;The Blog Cleaner&lt;/h4&gt;
&lt;p&gt;Our cleaning process strips out non-spoken content while preserving the narrative flow:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;re&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;clean_markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove YAML frontmatter&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'---'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;find&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'---'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;:]&lt;/span&gt;

    &lt;span class="c1"&gt;# Strip HTML tags (audio, video, images)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;audio[^&amp;gt;]*&amp;gt;[\s\S]*?&amp;lt;/audio&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;video[^&amp;gt;]*&amp;gt;[\s\S]*?&amp;lt;/video&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;img[^&amp;gt;]*/?&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IGNORECASE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'&amp;lt;[^&amp;gt;]+&amp;gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove markdown images and convert links to just text&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'!\[[^\]]*\]\([^)]+\)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\[([^\]]+)\]\([^)]+\)'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove code blocks&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'```[\s\S]*?```'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'`[^`]+`'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Convert headers to sentences&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'^(#{1,6})\s+(.+)$'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\2.'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MULTILINE&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Remove emphasis markers&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\*\*([^*]+)\*\*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\*([^*]+)\*'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Unit Conversion for Speech&lt;/h4&gt;
&lt;p&gt;Technical content often includes abbreviations that sound awkward when read literally. We convert common units to their spoken forms:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;convert_units_for_speech&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'(\d+)\s*GB\b'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1 gigabytes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'(\d+)\s*MB\b'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1 megabytes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'(\d+)\s*GHz\b'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1 gigahertz'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'(\d+)\s*MHz\b'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1 megahertz'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'(\d+)\s*KB\b'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'\1 kilobytes'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Chunking Long Content&lt;/h4&gt;
&lt;p&gt;TTS models work best with moderate-length inputs. Very long passages can cause quality degradation or memory issues. We split content into chunks at sentence boundaries:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_chars&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="s1"&gt;'(?&amp;lt;=[.!?])\s+'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;""&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;max_chars&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s2"&gt;" "&lt;/span&gt;
        &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
            &lt;span class="n"&gt;current&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s2"&gt;" "&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;current&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;chunks&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;The Complete Script&lt;/h4&gt;
&lt;p&gt;Putting it all together, here's our &lt;code&gt;blog_to_speech.py&lt;/code&gt; script:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="ch"&gt;#!/usr/bin/env python3&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;argparse&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pathlib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;qwen_tts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Qwen3TTSModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;soundfile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;clean_blog_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Apply cleaning functions...&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cleaned_text&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;synthesize_speech&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"eric"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Qwen3TTSModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s1"&gt;'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;attn_implementation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'sdpa'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'cuda:0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;all_audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Processing chunk &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;audios&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate_custom_voice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;all_audio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audios&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_audio&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;duration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Saved &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;.1f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;s audio to: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output_file&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="vm"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s1"&gt;'__main__'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;argparse&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ArgumentParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'source'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;help&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Blog post markdown file'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'-o'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'--output'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'output.wav'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add_argument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'--speaker'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'eric'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;parser&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parse_args&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clean_blog_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;synthesize_speech&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Choosing the Right Speaker Voice&lt;/h3&gt;
&lt;p&gt;Qwen3-TTS ships with nine pre-trained speaker voices, each with distinct characteristics:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Speaker&lt;/th&gt;
&lt;th&gt;Characteristics&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Eric&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clear, professional male voice with measured pacing&lt;/td&gt;
&lt;td&gt;Technical content, tutorials, documentation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Aiden&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Younger male voice, slightly more casual&lt;/td&gt;
&lt;td&gt;Blog posts, conversational content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dylan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Deeper male voice with authoritative tone&lt;/td&gt;
&lt;td&gt;Formal presentations, announcements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ryan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Energetic male voice&lt;/td&gt;
&lt;td&gt;Marketing content, product demos&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Serena&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Clear female voice, professional&lt;/td&gt;
&lt;td&gt;Corporate content, tutorials&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vivian&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Warm female voice&lt;/td&gt;
&lt;td&gt;Storytelling, narrative content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ono Anna&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Female voice with distinct character&lt;/td&gt;
&lt;td&gt;Creative content&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sohee&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Female voice, versatile&lt;/td&gt;
&lt;td&gt;General purpose&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Uncle Fu&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Character voice&lt;/td&gt;
&lt;td&gt;Specialized applications&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For our technical blog content, we primarily use Eric. His clear enunciation and measured pacing work well for complex technical explanations. The voice handles acronyms, numbers, and technical terminology naturally, making it ideal for content about hardware, programming, and system administration.&lt;/p&gt;
&lt;p&gt;You can easily switch voices by changing the &lt;code&gt;speaker&lt;/code&gt; parameter:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;audios&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sample_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate_custom_voice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'serena'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Try different voices&lt;/span&gt;
    &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Consider matching voice characteristics to content type. A hardware review might work better with Eric's authoritative tone, while a personal essay might benefit from Aiden's more conversational style.&lt;/p&gt;
&lt;h3&gt;Comparing TTS Options&lt;/h3&gt;
&lt;p&gt;Before settling on Qwen3-TTS, we evaluated several alternatives. Here's how they compare for our use case:&lt;/p&gt;
&lt;h4&gt;Cloud Services&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Amazon Polly&lt;/strong&gt; and &lt;strong&gt;Google Cloud TTS&lt;/strong&gt; offer excellent quality with minimal setup. However, costs accumulate quickly for long-form content. At roughly \$4-16 per million characters (depending on voice quality), a 3000-word blog post costs \$0.10-0.40 per generation. For a site with dozens of posts requiring periodic regeneration, this adds up.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;ElevenLabs&lt;/strong&gt; produces arguably the most natural voices available, with impressive emotional range. But their pricing model—based on character quotas—makes it expensive for regular content generation. The quality is exceptional, but overkill for straightforward narration.&lt;/p&gt;
&lt;h4&gt;Local Alternatives&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Coqui TTS&lt;/strong&gt; (now deprecated) was a popular open-source option but development has stalled. &lt;strong&gt;Bark&lt;/strong&gt; from Suno produces impressive results but runs slowly and lacks fine-grained control. &lt;strong&gt;XTTS&lt;/strong&gt; offers voice cloning but requires more setup and compute resources.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Piper&lt;/strong&gt; deserves special mention as a lightweight option. It runs quickly even on CPU and produces acceptable quality for many applications. However, the voices sound noticeably synthetic compared to Qwen3-TTS—fine for notifications or short snippets, but fatiguing for 30-minute narrations.&lt;/p&gt;
&lt;p&gt;Qwen3-TTS hits a sweet spot: quality approaching cloud services, reasonable compute requirements, and fully local operation. The 1.7B parameter model is large enough for natural prosody but small enough to run on consumer hardware.&lt;/p&gt;
&lt;h3&gt;Batch Processing for Multiple Posts&lt;/h3&gt;
&lt;p&gt;When generating audio for multiple blog posts, efficiency matters. Loading the model takes 15-30 seconds, so we keep it loaded while processing multiple files:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="ch"&gt;#!/usr/bin/env python3&lt;/span&gt;
&lt;span class="sd"&gt;"""Batch TTS processing for multiple blog posts"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;pathlib&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;qwen_tts&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Qwen3TTSModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;torch&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;soundfile&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;sf&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;numpy&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Load model once&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Loading model..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Qwen3TTSModel&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s1"&gt;'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;attn_implementation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'sdpa'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;device_map&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'cuda:0'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;posts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s1"&gt;'post1.md'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'post2.md'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s1"&gt;'post3.md'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;post&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;posts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'='&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Processing: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'='&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;clean_blog_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"/tmp/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stem&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_tts.wav"&lt;/span&gt;

    &lt;span class="c1"&gt;# Process chunks&lt;/span&gt;
    &lt;span class="n"&gt;chunks&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chunk_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;all_audio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nb"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"  Chunk &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunks&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;audios&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;generate_custom_voice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'eric'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;all_audio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audios&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;concatenate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;all_audio&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"Saved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This approach processes our five-post backlog overnight, with results ready for review in the morning.&lt;/p&gt;
&lt;h3&gt;Performance Characteristics&lt;/h3&gt;
&lt;p&gt;On the AI Max+ 395, speech synthesis runs at roughly real-time to 0.5x real-time speed—meaning a 30-minute audio file takes 30-60 minutes to generate. This is slower than high-end discrete GPUs but perfectly acceptable for batch processing.&lt;/p&gt;
&lt;p&gt;For reference, here's how different content lengths performed in our testing:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Content&lt;/th&gt;
&lt;th&gt;Characters&lt;/th&gt;
&lt;th&gt;Chunks&lt;/th&gt;
&lt;th&gt;Audio Duration&lt;/th&gt;
&lt;th&gt;Generation Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Short post&lt;/td&gt;
&lt;td&gt;5,000&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;~5 min&lt;/td&gt;
&lt;td&gt;~15 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Medium post&lt;/td&gt;
&lt;td&gt;15,000&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;~15 min&lt;/td&gt;
&lt;td&gt;~45 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Long post&lt;/td&gt;
&lt;td&gt;25,000&lt;/td&gt;
&lt;td&gt;55&lt;/td&gt;
&lt;td&gt;~27 min&lt;/td&gt;
&lt;td&gt;~90 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Very long&lt;/td&gt;
&lt;td&gt;40,000&lt;/td&gt;
&lt;td&gt;85&lt;/td&gt;
&lt;td&gt;~45 min&lt;/td&gt;
&lt;td&gt;~150 min&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The relationship between content length and generation time is roughly linear after the initial model warmup.&lt;/p&gt;
&lt;p&gt;Some observations from our testing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;First chunk latency&lt;/strong&gt;: The first chunk takes longer due to GPU kernel compilation and caching&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory usage&lt;/strong&gt;: Peak usage around 8-10GB during inference&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU utilization&lt;/strong&gt;: Consistent 100% during active synthesis&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Quality&lt;/strong&gt;: Indistinguishable from cloud TTS services for most content&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The MIOpen library sometimes logs workspace warnings during execution. These don't affect output quality and can be safely ignored:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;MIOpen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;HIP&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Warning&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;IsEnoughWorkspace&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Solver&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;GemmFwdRest&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;required&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;103133184&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Integrating Audio into Blog Posts&lt;/h3&gt;
&lt;p&gt;Once we have the WAV file, we convert to MP3 for web delivery and embed an HTML5 audio player:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;ffmpeg&lt;span class="w"&gt; &lt;/span&gt;-i&lt;span class="w"&gt; &lt;/span&gt;blog_post.wav&lt;span class="w"&gt; &lt;/span&gt;-codec:a&lt;span class="w"&gt; &lt;/span&gt;libmp3lame&lt;span class="w"&gt; &lt;/span&gt;-qscale:a&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;blog_post.mp3
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;For reviewing TTS output quality, we recommend using &lt;a href="https://baud.rs/tn2v8w"&gt;studio monitor headphones&lt;/a&gt; that reveal any artifacts or unnatural tones in the generated speech.&lt;/p&gt;
&lt;p&gt;The player HTML is straightforward:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"background: #f8f9fa; border: 1px solid #e9ecef;&lt;/span&gt;
&lt;span class="s"&gt;            border-radius: 8px; padding: 16px 20px; margin: 20px 0;"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"audio-widget-header"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"audio-widget-icon"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;🎧&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt; &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"color: #495057; font-weight: 600;"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;Listen to this article&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;span&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;audio&lt;/span&gt; &lt;span class="na"&gt;controls&lt;/span&gt; &lt;span class="na"&gt;preload&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"metadata"&lt;/span&gt; &lt;span class="na"&gt;style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"width: 100%;"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;source&lt;/span&gt; &lt;span class="na"&gt;src&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"/audio/blog_post.mp3"&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"audio/mpeg"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;audio&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="p"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt; &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"audio-widget-footer"&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
    27 min · AI-generated narration
  &lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;&amp;lt;/&lt;/span&gt;&lt;span class="nt"&gt;div&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Why We're Doing This&lt;/h3&gt;
&lt;p&gt;Adding audio narration to blog posts serves multiple purposes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Accessibility&lt;/strong&gt;: Readers with visual impairments or reading difficulties can consume content aurally&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Convenience&lt;/strong&gt;: Listeners can enjoy posts during commutes, workouts, or other activities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Engagement&lt;/strong&gt;: Audio content creates a more personal connection with the audience&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reach&lt;/strong&gt;: Some audiences prefer audio format, expanding our potential readership&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Running TTS locally rather than using cloud services gives us:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Cost control&lt;/strong&gt;: No per-character or per-minute fees&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Privacy&lt;/strong&gt;: Content never leaves our infrastructure&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Same voice and quality across all posts&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Flexibility&lt;/strong&gt;: Full control over processing pipeline&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Troubleshooting Common Issues&lt;/h3&gt;
&lt;h4&gt;"CUDA not available" despite GPU present&lt;/h4&gt;
&lt;p&gt;Ensure you've installed the ROCm version of PyTorch, not the standard build:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;pip&lt;span class="w"&gt; &lt;/span&gt;uninstall&lt;span class="w"&gt; &lt;/span&gt;torch&lt;span class="w"&gt; &lt;/span&gt;torchvision&lt;span class="w"&gt; &lt;/span&gt;torchaudio
pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;torch&lt;span class="w"&gt; &lt;/span&gt;torchvision&lt;span class="w"&gt; &lt;/span&gt;torchaudio&lt;span class="w"&gt; &lt;/span&gt;--index-url&lt;span class="w"&gt; &lt;/span&gt;https://download.pytorch.org/whl/rocm6.4
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Model runs on CPU instead of GPU&lt;/h4&gt;
&lt;p&gt;Check that &lt;code&gt;device_map='cuda:0'&lt;/code&gt; is specified when loading the model. Also verify the environment variables are set before starting Python.&lt;/p&gt;
&lt;h4&gt;"Unsupported language 'en'"&lt;/h4&gt;
&lt;p&gt;Use the full language name: &lt;code&gt;language='english'&lt;/code&gt; not &lt;code&gt;language='en'&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Out of memory errors&lt;/h4&gt;
&lt;p&gt;Try reducing chunk size or using a smaller batch. The model should fit in 16GB, but very long chunks can spike memory usage.&lt;/p&gt;
&lt;h4&gt;Slow first chunk&lt;/h4&gt;
&lt;p&gt;This is normal—ROCm compiles GPU kernels on first use. Subsequent chunks process faster.&lt;/p&gt;
&lt;h3&gt;Future Improvements&lt;/h3&gt;
&lt;p&gt;Our current pipeline works well but has room for enhancement. Some improvements we're considering:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Voice cloning&lt;/strong&gt;: Qwen3-TTS supports custom voice training. With sufficient audio samples, we could create a unique voice for TinyComputers rather than using the stock speakers. This would provide brand consistency and differentiation.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Automatic post detection&lt;/strong&gt;: Currently we manually select posts for TTS generation. A CI/CD integration could automatically generate audio for new posts when they're published, keeping the audio library current without manual intervention.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Chapter markers&lt;/strong&gt;: For longer posts, embedding chapter markers in the audio file would allow listeners to skip to specific sections. This requires parsing the markdown headers and mapping them to audio timestamps.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Multiple format export&lt;/strong&gt;: Beyond MP3, offering Opus or AAC formats could reduce file sizes while maintaining quality, benefiting listeners on metered connections.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Speed adjustment&lt;/strong&gt;: Some listeners prefer 1.25x or 1.5x playback speed. Pre-generating speed-adjusted versions could provide better quality than real-time speed adjustment in the browser.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Running Qwen3-TTS on AMD's Strix Halo platform demonstrates that high-quality local TTS is now accessible beyond NVIDIA hardware. While setup requires some ROCm-specific configuration, the results are impressive—natural-sounding narration suitable for professional content.&lt;/p&gt;
&lt;p&gt;The democratization of AI capabilities continues apace. What once required expensive cloud subscriptions or high-end NVIDIA GPUs now runs on integrated graphics. The AI Max+ 395's Radeon 8060S, primarily designed for gaming and general compute tasks, handles a 1.7-billion parameter language model without breaking a sweat.&lt;/p&gt;
&lt;p&gt;We're actively using this pipeline to generate audio versions of posts across TinyComputers, making our technical content more accessible and convenient for our readers. As of this writing, we've processed our retrocomputing series, hardware reviews, and technical tutorials—dozens of hours of content generated entirely on local hardware.&lt;/p&gt;
&lt;p&gt;The combination of AMD's capable integrated graphics and Qwen's excellent TTS model proves that you don't need expensive discrete GPUs or cloud subscriptions to achieve broadcast-quality speech synthesis. For content creators, educators, and accessibility advocates, this opens new possibilities for enriching written content with audio without ongoing service costs.&lt;/p&gt;
&lt;p&gt;If you're running AMD hardware and want to add audio narration to your own content, this guide should get you started. The initial setup investment pays dividends in ongoing cost savings and the satisfaction of running capable AI models entirely on your own infrastructure. And if you encounter issues along the way, the troubleshooting section above addresses the most common pitfalls we discovered during our own setup process.&lt;/p&gt;
&lt;p&gt;The audio player at the top of many TinyComputers posts now represents a small but meaningful step toward making technical content more accessible. Every post you can listen to while commuting, exercising, or doing dishes is content that might otherwise go unread. That's the real value of local TTS—not just cost savings, but expanded reach for the ideas we share.&lt;/p&gt;</description><category>ai max+ 395</category><category>amd</category><category>audio</category><category>machine learning</category><category>pytorch</category><category>qwen</category><category>rocm</category><category>strix halo</category><category>text-to-speech</category><category>tts</category><guid>https://tinycomputers.io/posts/qwen-tts-on-amd-strix-halo.html</guid><pubDate>Sat, 24 Jan 2026 18:00:00 GMT</pubDate></item><item><title>A Bespoke LLM Code Scanner</title><link>https://tinycomputers.io/posts/a-bespoke-llm-code-scanner.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;h2&gt;Building a Nightly AI Code Scanner with vLLM, ROCm, and JIRA Integration&lt;/h2&gt;
&lt;p&gt;I've been running a ballistics calculation engine — a Rust physics library with several components, like a Flask app wrapper with machine learning capabilities, bindings for a python library as well as a Ruby gem library.  There are also Android and iOS apps, too.  The codebase has grown to about 15,000 lines of Rust and another 10,000 lines of Python. At this scale, bugs hide in edge cases: division by zero, floating-point precision issues in transonic drag calculations, unwrap() panics on unexpected input. &lt;/p&gt;
&lt;p&gt;What if I could run an AI code reviewer every night while I sleep? Not a cloud API with per-token billing that could run up a $500 bill scanning 50 files, but a local model running on my own hardware, grinding through the codebase and filing JIRA tickets for anything suspicious.&lt;/p&gt;
&lt;p&gt;This is the story of building that system.&lt;/p&gt;
&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/a-bespoke-llm-code-scanner_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;17 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;h3&gt;The Hardware: AMD Strix Halo on ROCm 7.0&lt;/h3&gt;
&lt;p&gt;I'm running this on a server with an AMD Radeon 8060S (Strix Halo APU) — specifically the &lt;code&gt;gfx1151&lt;/code&gt; architecture. This isn't a data center GPU. It's essentially an integrated GPU with 128GB of shared memory; configured to give 96GB to VRAM and the rest to system RAM. Not the 80GB of HBM3 you'd get on an H100, but enough to run a 32B parameter model comfortably.&lt;/p&gt;
&lt;p&gt;The key insight: for batch processing where latency doesn't matter, you don't need bleeding-edge hardware. A nightly scan can take hours. I'm not serving production traffic; I'm analyzing code files one at a time with a 30-second cooldown between requests. The APU handles this fine.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;Hardware Configuration:
&lt;span class="k"&gt;-&lt;/span&gt; AMD Radeon 8060S (gfx1151 Strix Halo APU)
&lt;span class="k"&gt;-&lt;/span&gt; 96GB shared memory
&lt;span class="k"&gt;-&lt;/span&gt; ROCm 7.0 with HSA_OVERRIDE_GFX_VERSION=11.5.1
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;HSA_OVERRIDE_GFX_VERSION&lt;/code&gt; environment variable is critical. Without it, ROCm doesn't recognize the Strix Halo architecture. This is the kind of sharp edge you hit running ML on AMD consumer hardware.&lt;/p&gt;
&lt;h3&gt;Model Selection: Qwen2.5-Coder-7B-Instruct&lt;/h3&gt;
&lt;p&gt;I tested several models:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-Coder-V2-Lite&lt;/td&gt;
&lt;td&gt;16B&lt;/td&gt;
&lt;td&gt;32k&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Requires flash_attn (ROCm issues)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder-30B&lt;/td&gt;
&lt;td&gt;30B&lt;/td&gt;
&lt;td&gt;32k&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Too slow on APU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-Coder-7B-Instruct&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;td&gt;Sweet spot&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TinyLlama-1.1B&lt;/td&gt;
&lt;td&gt;1.1B&lt;/td&gt;
&lt;td&gt;4k&lt;/td&gt;
&lt;td&gt;Poor&lt;/td&gt;
&lt;td&gt;Too small for code review&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Qwen2.5-Coder-7B-Instruct hits the sweet spot. It understands Rust and Python well enough to spot real issues, runs fast enough to process 50 files per night, and doesn't require flash attention (which has ROCm compatibility issues on consumer hardware).&lt;/p&gt;
&lt;h3&gt;vLLM Setup&lt;/h3&gt;
&lt;p&gt;vLLM provides an OpenAI-compatible API server that makes integration trivial. Here's the startup command:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nb"&gt;source&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;~/vllm-rocm7-venv/bin/activate
&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;HSA_OVERRIDE_GFX_VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;11&lt;/span&gt;.5.1
python&lt;span class="w"&gt; &lt;/span&gt;-m&lt;span class="w"&gt; &lt;/span&gt;vllm.entrypoints.openai.api_server&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--model&lt;span class="w"&gt; &lt;/span&gt;Qwen/Qwen2.5-Coder-7B-Instruct&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--host&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.0.0.0&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--port&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8000&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--trust-remote-code&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--max-model-len&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;16384&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;--gpu-memory-utilization&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;.85
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;--max-model-len 16384&lt;/code&gt; limits context to 16k tokens. My code files rarely exceed 500 lines (truncated), so this is plenty. The &lt;code&gt;--gpu-memory-utilization 0.85&lt;/code&gt; leaves headroom for the system.&lt;/p&gt;
&lt;p&gt;I run this in a Python venv rather than Docker because ROCm device passthrough with Docker on Strix Halo is finicky. Sometimes you have to choose pragmatism over elegance.&lt;/p&gt;
&lt;h3&gt;Docker Configuration (When It Works)&lt;/h3&gt;
&lt;p&gt;For reference, here's the Docker Compose configuration I initially built. It works on dedicated AMD GPUs but has issues on integrated APUs:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nt"&gt;services&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;vllm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;rocm/vllm-dev:latest&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;vllm-code-scanner&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;devices&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/dev/kfd:/dev/kfd&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/dev/dri:/dev/dri&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;group_add&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;video&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;render&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;security_opt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;seccomp:unconfined&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;cap_add&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;SYS_PTRACE&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;ipc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;host&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;HSA_OVERRIDE_GFX_VERSION=11.5.1&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;PYTORCH_ROCM_ARCH=gfx1151&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;HIP_VISIBLE_DEVICES=0&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/alex/models:/models&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/alex/.cache/huggingface:/root/.cache/huggingface&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;ports&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"8000:8000"&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;command&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;python -m vllm.entrypoints.openai.api_server&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;--model Qwen/Qwen2.5-Coder-7B-Instruct&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;--host 0.0.0.0&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;--port 8000&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;--trust-remote-code&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;--max-model-len 16384&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="no"&gt;--gpu-memory-utilization 0.85&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;healthcheck&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;test&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"CMD"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"curl"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"-f"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"http://localhost:8000/health"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;interval&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;30s&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;10s&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;5&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;start_period&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;120s&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;scanner&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;build&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;.&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;container_name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;code-scanner-agent&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;depends_on&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;vllm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;        &lt;/span&gt;&lt;span class="nt"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;service_healthy&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;environment&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;VLLM_HOST=vllm&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;VLLM_PORT=8000&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;JIRA_EMAIL=${JIRA_EMAIL}&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;JIRA_API_KEY=${JIRA_API_KEY}&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;volumes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/alex/projects:/projects:ro&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;./config:/app/config:ro&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;/home/alex/projects/code-scanner-results:/app/results&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;ipc: host&lt;/code&gt; and &lt;code&gt;seccomp:unconfined&lt;/code&gt; are necessary for ROCm to function properly. The &lt;code&gt;depends_on&lt;/code&gt; with &lt;code&gt;service_healthy&lt;/code&gt; ensures the scanner waits for vLLM to be fully loaded before starting — important since model loading can take 2-3 minutes.&lt;/p&gt;
&lt;p&gt;The scanner Dockerfile is minimal:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;python:3.11-slim&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;/app&lt;/span&gt;

&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;update&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;apt-get&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-y&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;git&lt;span class="w"&gt; &lt;/span&gt;curl&lt;span class="w"&gt; &lt;/span&gt;ripgrep&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;rm&lt;span class="w"&gt; &lt;/span&gt;-rf&lt;span class="w"&gt; &lt;/span&gt;/var/lib/apt/lists/*

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;requirements.txt&lt;span class="w"&gt; &lt;/span&gt;.
&lt;span class="k"&gt;RUN&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;--no-cache-dir&lt;span class="w"&gt; &lt;/span&gt;-r&lt;span class="w"&gt; &lt;/span&gt;requirements.txt

&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;agent/&lt;span class="w"&gt; &lt;/span&gt;/app/agent/
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;prompts/&lt;span class="w"&gt; &lt;/span&gt;/app/prompts/
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;config/&lt;span class="w"&gt; &lt;/span&gt;/app/config/

&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"python"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"-m"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agent.scanner"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Including &lt;code&gt;ripgrep&lt;/code&gt; in the container enables fast pattern matching when the scanner needs to search for related code.&lt;/p&gt;
&lt;h3&gt;The Scanner Architecture&lt;/h3&gt;
&lt;p&gt;The system has three main components:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Systemd       │     │    vLLM         │     │     JIRA        │
│   Timer         │────▶│    Server       │────▶│     API         │
│   (11pm daily)  │     │  (Qwen 7B)      │     │   (tickets)     │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │   Scanner Agent     │
                    │ - File discovery    │
                    │ - Code analysis     │
                    │ - Finding validation│
                    │ - JIRA integration  │
                    └─────────────────────┘
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Configuration&lt;/h4&gt;
&lt;p&gt;Everything is driven by a YAML configuration file:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nt"&gt;vllm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;host&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"10.1.1.27"&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;port&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;8000&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"Qwen/Qwen2.5-Coder-7B-Instruct"&lt;/span&gt;

&lt;span class="nt"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;start_hour&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;23&lt;/span&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="c1"&gt;# 11pm&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;end_hour&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;6&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="c1"&gt;# 6am&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;50&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;cooldown_seconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;30&lt;/span&gt;

&lt;span class="nt"&gt;repositories&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"ballistics-engine"&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/home/alex/projects/ballistics-engine"&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;languages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"rust"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;scan_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"src//*.rs"&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;exclude_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"target/"&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"*.lock"&lt;/span&gt;

&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"ballistics-api"&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"/home/alex/projects/ballistics-api"&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;languages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"python"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"rust"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;scan_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"ballistics//*.py"&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"ballistics_rust/src//*.rs"&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;exclude_patterns&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"__pycache__/"&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"target/"&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;".venv/"&lt;/span&gt;

&lt;span class="nt"&gt;jira&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;true&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;project_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"MBA"&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;confidence_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;0.75&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p p-Indicator"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"ai-detected"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;"code-scanner"&lt;/span&gt;&lt;span class="p p-Indicator"&gt;]&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;max_tickets_per_run&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;10&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;review_threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;5&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;confidence_threshold: 0.75&lt;/code&gt; is crucial. Without it, the model reports every minor style issue. At 75%, it focuses on things it's genuinely concerned about.&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;review_threshold: 5&lt;/code&gt; triggers a different behavior: if the model finds more than 5 issues, it creates a single summary ticket for manual review rather than flooding JIRA with individual tickets. This is a safety valve for when the model goes haywire.&lt;/p&gt;
&lt;h3&gt;Structured Outputs with Pydantic&lt;/h3&gt;
&lt;p&gt;LLMs are great at finding issues but terrible at formatting output consistently. Left to their own devices, they'll return findings as markdown, prose, JSON with missing fields, or creative combinations thereof.&lt;/p&gt;
&lt;p&gt;The solution is structured outputs. I define Pydantic models for exactly what I expect:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;Severity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;CRITICAL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"critical"&lt;/span&gt;
    &lt;span class="n"&gt;HIGH&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"high"&lt;/span&gt;
    &lt;span class="n"&gt;MEDIUM&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"medium"&lt;/span&gt;
    &lt;span class="n"&gt;LOW&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"low"&lt;/span&gt;
    &lt;span class="n"&gt;INFO&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"info"&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;FindingType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Enum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;BUG&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"bug"&lt;/span&gt;
    &lt;span class="n"&gt;PERFORMANCE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"performance"&lt;/span&gt;
    &lt;span class="n"&gt;SECURITY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"security"&lt;/span&gt;
    &lt;span class="n"&gt;CODE_QUALITY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"code_quality"&lt;/span&gt;
    &lt;span class="n"&gt;POTENTIAL_ISSUE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"potential_issue"&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nc"&gt;CodeFinding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Path to the file"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;line_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"Starting line number"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;line_end&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="kc"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;finding_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;FindingType&lt;/span&gt;
    &lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Severity&lt;/span&gt;
    &lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;suggestion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
    &lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ge&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;le&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;code_snippet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;None&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;confidence&lt;/code&gt; field is a float between 0 and 1. The model learns to be honest about uncertainty — "I think this might be a bug (0.6)" versus "This is definitely division by zero (0.95)."&lt;/p&gt;
&lt;p&gt;In a perfect world, I'd use vLLM's Outlines integration for guided JSON generation. In practice, I found that prompting Qwen for JSON and parsing the response works reliably:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;_analyze_code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CodeFinding&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"""Analyze this code for bugs and issues.&lt;/span&gt;

&lt;span class="s2"&gt;File: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;{content}&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;Return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;JSON&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Each&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;must&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;have&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;file_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;line_start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;finding_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bug"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"performance"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"security"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"code_quality"&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"critical"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"medium"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"low"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"info"&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;chars&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;suggestion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;string&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="ow"&gt;or&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb nb-Type"&gt;null&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;If&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;no&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;issues&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;found&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;an&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;empty&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;array&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="s2"&gt;"""}&lt;/span&gt;
&lt;span class="s2"&gt;    ]&lt;/span&gt;

&lt;span class="s2"&gt;    response = self._call_llm(messages)&lt;/span&gt;

&lt;span class="s2"&gt;    # Parse JSON from response (handles markdown code blocks too)&lt;/span&gt;
&lt;span class="s2"&gt;    if response.strip().startswith('['):&lt;/span&gt;
&lt;span class="s2"&gt;        findings_data = json.loads(response)&lt;/span&gt;
&lt;span class="s2"&gt;    elif '```json' in response:&lt;/span&gt;
&lt;span class="s2"&gt;        json_str = response.split('```json')[1].split('```')[0]&lt;/span&gt;
&lt;span class="s2"&gt;        findings_data = json.loads(json_str)&lt;/span&gt;
&lt;span class="s2"&gt;    elif '[' in response:&lt;/span&gt;
&lt;span class="s2"&gt;        start = response.index('[')&lt;/span&gt;
&lt;span class="s2"&gt;        end = response.rindex(']') + 1&lt;/span&gt;
&lt;span class="s2"&gt;        findings_data = json.loads(response[start:end])&lt;/span&gt;
&lt;span class="s2"&gt;    else:&lt;/span&gt;
&lt;span class="s2"&gt;        return []&lt;/span&gt;

&lt;span class="s2"&gt;    # Validate each finding with Pydantic&lt;/span&gt;
&lt;span class="s2"&gt;    findings = []&lt;/span&gt;
&lt;span class="s2"&gt;    for item in findings_data:&lt;/span&gt;
&lt;span class="s2"&gt;        try:&lt;/span&gt;
&lt;span class="s2"&gt;            finding = CodeFinding(item)&lt;/span&gt;
&lt;span class="s2"&gt;            findings.append(finding)&lt;/span&gt;
&lt;span class="s2"&gt;        except ValidationError:&lt;/span&gt;
&lt;span class="s2"&gt;            pass  # Skip malformed findings&lt;/span&gt;

&lt;span class="s2"&gt;    return findings&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;The System Prompt&lt;/h3&gt;
&lt;p&gt;The system prompt is where you teach the model what you care about. Here's mine:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;You are an expert code reviewer specializing in Rust and Python.
Your job is to find bugs, performance issues, security vulnerabilities,
and code quality problems.

You are analyzing code from a ballistics calculation project that includes:
- A Rust physics engine for trajectory calculations
- Python Flask API with ML models
- PyO3 bindings between Rust and Python

Key areas to focus on:
1. Numerical precision issues (floating point errors, rounding)
2. Edge cases in physics calculations (division by zero, negative values)
3. Memory safety in Rust code
4. Error handling (silent failures, unwrap panics)
5. Performance bottlenecks (unnecessary allocations, redundant calculations)
6. Security issues (input validation, injection vulnerabilities)

Be conservative with findings - only report issues you are confident about.
Avoid false positives.
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The phrase "Be conservative with findings" is doing heavy lifting. Without it, the model reports everything that looks slightly unusual. With it, it focuses on actual problems.&lt;/p&gt;
&lt;h3&gt;Timeout Handling&lt;/h3&gt;
&lt;p&gt;Large files (500+ lines) can take a while to analyze. My initial 120-second timeout caused failures on complex files. I bumped it to 600 seconds (10 minutes):&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/chat/completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"application/json"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;600&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I also truncate files to 300 lines. For longer files, the model only sees the first 300 lines. This is a trade-off — I might miss bugs in the back half of long files — but it keeps scans predictable and prevents timeout cascades.  I plan to revisit this in future iterations.&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;lines&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Truncated to 300 lines for analysis"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;JIRA Integration&lt;/h3&gt;
&lt;p&gt;When the scanner finds issues, it creates JIRA tickets automatically. The API is straightforward:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;create_jira_tickets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;CodeFinding&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
    &lt;span class="n"&gt;jira_base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"https://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;jira_domain&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/rest/api/3"&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Map severity to JIRA priority&lt;/span&gt;
        &lt;span class="n"&gt;priority_map&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CRITICAL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Highest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;HIGH&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"High"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;MEDIUM&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;LOW&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Severity&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;INFO&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Lowest"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="s2"&gt;"fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="s2"&gt;"project"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"key"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"MBA"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="s2"&gt;"summary"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"[AI] &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;title&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="s2"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"doc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="s2"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="s2"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"paragraph"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;build_description&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
                    &lt;span class="p"&gt;]}]&lt;/span&gt;
                &lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="s2"&gt;"issuetype"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Bug"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;finding_type&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;FindingType&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BUG&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="s2"&gt;"Task"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="s2"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;priority_map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;finding&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;severity&lt;/span&gt;&lt;span class="p"&gt;]},&lt;/span&gt;
                &lt;span class="s2"&gt;"labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"ai-detected"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"code-scanner"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;jira_base_url&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/issue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;jira_email&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;jira_api_key&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"Content-Type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"application/json"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;[AI]&lt;/code&gt; prefix in the summary makes it obvious these tickets came from the scanner. The &lt;code&gt;ai-detected&lt;/code&gt; label allows filtering.&lt;/p&gt;
&lt;p&gt;I add a 2-second delay between ticket creation to avoid rate limiting:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Rate limit protection&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Systemd Scheduling&lt;/h3&gt;
&lt;p&gt;The scanner runs nightly via systemd timer:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# /etc/systemd/system/code-scanner.timer&lt;/span&gt;
&lt;span class="k"&gt;[Unit]&lt;/span&gt;
&lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Run Code Scanner nightly at 11pm&lt;/span&gt;

&lt;span class="k"&gt;[Timer]&lt;/span&gt;
&lt;span class="na"&gt;OnCalendar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;*-*-* 23:00:00&lt;/span&gt;
&lt;span class="na"&gt;Persistent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;RandomizedDelaySec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;300&lt;/span&gt;

&lt;span class="k"&gt;[Install]&lt;/span&gt;
&lt;span class="na"&gt;WantedBy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;timers.target&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;RandomizedDelaySec=300&lt;/code&gt; adds up to 5 minutes of random delay. This prevents the scanner from always starting at exactly 11:00:00, which helps if multiple services share the same schedule.&lt;/p&gt;
&lt;p&gt;The service unit is a oneshot that runs the scanner script:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# /etc/systemd/system/code-scanner.service&lt;/span&gt;
&lt;span class="k"&gt;[Unit]&lt;/span&gt;
&lt;span class="na"&gt;Description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Code Scanner Agent&lt;/span&gt;
&lt;span class="na"&gt;After&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;docker.service&lt;/span&gt;

&lt;span class="k"&gt;[Service]&lt;/span&gt;
&lt;span class="na"&gt;Type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;oneshot&lt;/span&gt;
&lt;span class="na"&gt;User&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;alex&lt;/span&gt;
&lt;span class="na"&gt;WorkingDirectory&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/alex/projects/ballistics/code-scanner&lt;/span&gt;
&lt;span class="na"&gt;ExecStart&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/alex/projects/ballistics/code-scanner/scripts/start_scanner.sh&lt;/span&gt;
&lt;span class="na"&gt;TimeoutStartSec&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;25200&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;TimeoutStartSec=25200&lt;/code&gt; (7 hours) gives the scanner enough time to complete even if it scans every file.&lt;/p&gt;
&lt;h3&gt;Sample Findings&lt;/h3&gt;
&lt;p&gt;Here's what the scanner actually finds. From a recent run:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"file_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/home/alex/projects/ballistics-engine/src/fast_trajectory.rs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"line_start"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;115&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"finding_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"bug"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Division by zero in fast_integrate when velocity approaches zero"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The division dt / velocity_magnitude could result in division by zero if the projectile stalls (velocity_magnitude = 0). This can happen at the apex of a high-angle shot."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"suggestion"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Add a check for velocity_magnitude &amp;lt; epsilon before division, or clamp to a minimum value."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is a real issue. In ballistics calculations, a projectile fired at a high angle momentarily has zero horizontal velocity at the apex. Without a guard, this causes a panic.&lt;/p&gt;
&lt;p&gt;Not every finding is valid. The model occasionally flags intentional design decisions as "issues." But at a 75% confidence threshold, the false positive rate is manageable — maybe 1 in 10 findings needs to be closed as "not a bug."&lt;/p&gt;
&lt;h3&gt;Trade-offs and Lessons&lt;/h3&gt;
&lt;p&gt;What works well:
- Finding numerical edge cases (division by zero, overflow)
- Spotting unwrap() calls on Options that might be None
- Identifying missing error handling
- Flagging dead code and unreachable branches&lt;/p&gt;
&lt;p&gt;What doesn't work as well:
- Understanding business logic (the model doesn't know physics)
- Spotting subtle race conditions in concurrent code
- False positives on intentional patterns&lt;/p&gt;
&lt;p&gt;Operational lessons:
- Start with a low iteration limit (10-20 files) to test the pipeline
- Monitor the first few runs manually before trusting it
- Keep credentials in &lt;code&gt;.env&lt;/code&gt; files excluded from rsync
- The 300-line truncation is aggressive; consider chunking for long files&lt;/p&gt;
&lt;h3&gt;Handling JSON Parse Failures&lt;/h3&gt;
&lt;p&gt;Despite asking for JSON, LLMs sometimes produce malformed output. I see two failure modes:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Truncated JSON: The model runs out of tokens mid-response, leaving an unterminated string or missing closing brackets.&lt;/li&gt;
&lt;li&gt;Wrapped JSON: The model adds explanatory text around the JSON, like "Here are the findings:" before the array.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;My parser handles both:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="k"&gt;def&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;parse_findings_response&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="sd"&gt;"""Extract JSON from potentially messy LLM output."""&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Best case: raw JSON array&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'['&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;  &lt;span class="c1"&gt;# Fall through to extraction&lt;/span&gt;

    &lt;span class="c1"&gt;# Common case: JSON in markdown code block&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s1"&gt;'```json'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;json_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'```json'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'```'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json_str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="ne"&gt;IndexError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="c1"&gt;# Fallback: extract JSON array from surrounding text&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="s1"&gt;'['&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="s1"&gt;']'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'['&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;end&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rindex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;']'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;start&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;JSONDecodeError&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;pass&lt;/span&gt;

    &lt;span class="c1"&gt;# Give up&lt;/span&gt;
    &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Could not extract JSON from response"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;When parsing fails, I log the error and skip that file rather than crashing the entire scan. In a typical 50-file run, I see 2-3 parse failures — annoying but acceptable.&lt;/p&gt;
&lt;h3&gt;Testing the Pipeline&lt;/h3&gt;
&lt;p&gt;Before trusting the scanner with JIRA ticket creation, I ran it in "dry run" mode:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="c1"&gt;# Set max iterations low and disable JIRA&lt;/span&gt;
&lt;span class="nb"&gt;export&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;MAX_ITERATIONS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;
&lt;span class="c1"&gt;# In config: jira.enabled: false&lt;/span&gt;

python&lt;span class="w"&gt; &lt;/span&gt;run_scanner_direct.py
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This scans just 5 files and prints findings without creating tickets. I manually reviewed each finding:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;True positive: Division by zero in trajectory calculation — good catch&lt;/li&gt;
&lt;li&gt;False positive: Flagged intentional &lt;code&gt;unwrap()&lt;/code&gt; on a guaranteed-Some Option — needs better context&lt;/li&gt;
&lt;li&gt;True positive: Dead code path never executed — valid cleanup suggestion&lt;/li&gt;
&lt;li&gt;Marginal: Style suggestion about variable naming — below my quality threshold&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;After tuning the confidence threshold and system prompt, the true positive rate improved to roughly 90%.&lt;/p&gt;
&lt;h3&gt;Monitoring and Observability&lt;/h3&gt;
&lt;p&gt;The scanner writes detailed logs to stdout and a JSON results file. Sample log output:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;CODE&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;SCANNER&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;AGENT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;STARTING&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nf"&gt;Max&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;iterations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;Model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Qwen&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Qwen2&lt;/span&gt;&lt;span class="mf"&gt;.5&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Coder&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Instruct&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Starting&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scan&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;of&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;ballistics&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;engine&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;Found&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;files&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;scan&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;Scanning&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;trajectory_sampling&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rs&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;48&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="n"&gt;Truncated&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lines&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;analysis&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;49&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt;   &lt;/span&gt;&lt;span class="k"&gt;Found&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;findings&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;49&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;LOW&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Redundant&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;check&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;step_m&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;value&lt;/span&gt;
&lt;span class="mi"&gt;2025&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;49&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="w"&gt;     &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;LOW&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;Potential&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;off&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="k"&gt;by&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The JSON results include full finding details:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"20251126_151136"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"total_findings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="nt"&gt;"repositories"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;"repository"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ballistics-engine"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;"files_scanned"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;"findings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;"duration_seconds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1842.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="nt"&gt;"iterations_used"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;35&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;I keep the last 30 result files (configurable) for historical comparison. Eventually I'll build a dashboard showing finding trends over time.&lt;/p&gt;
&lt;h3&gt;What's Next&lt;/h3&gt;
&lt;p&gt;The current system is batch-oriented: run once per night, file tickets, done. Future improvements I'm considering:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Pre-commit integration: Run on changed files only, fast enough for CI&lt;/li&gt;
&lt;li&gt;Retrieval-augmented context: Include related files when analyzing (e.g., when scanning a function, include its callers)&lt;/li&gt;
&lt;li&gt;Learning from feedback: Track which tickets get closed as "not a bug" and use that to tune prompts&lt;/li&gt;
&lt;li&gt;Multi-model ensemble: Run the same code through two models, only file tickets when both agree&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;For now, though, the simple approach works. Every morning I check JIRA, triage the overnight findings, and fix the real bugs. The model isn't perfect, but it finds things I miss. And unlike a human reviewer, it never gets tired, never skips files, and never has a bad day.&lt;/p&gt;
&lt;h3&gt;Get the Code&lt;/h3&gt;
&lt;p&gt;I've open-sourced the complete scanner implementation on GitHub: &lt;strong&gt;&lt;a href="https://baud.rs/VzUVjf"&gt;llm-code-scanner&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;The project includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Dual scanning modes&lt;/strong&gt;: Fast nightly scans via vLLM and comprehensive weekly analyses through Ollama&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Smart deduplication&lt;/strong&gt;: SQLite database prevents redundant issue tracking across runs&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;JIRA integration&lt;/strong&gt;: Automatically creates tickets for findings above your confidence threshold&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Email reports&lt;/strong&gt;: SendGrid integration for daily/weekly summaries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-language support&lt;/strong&gt;: Python, Rust, TypeScript, Kotlin, Swift, Go, and more&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To get started, clone the repo, configure your &lt;code&gt;scanner_config.yaml&lt;/code&gt; with your vLLM/Ollama server details, and run &lt;code&gt;python -m agent.scanner&lt;/code&gt;. The README has full setup instructions including environment variables for JIRA and SendGrid integration.&lt;/p&gt;</description><category>ai</category><category>amd</category><category>automation</category><category>code review</category><category>jira</category><category>llm</category><category>machine learning</category><category>python</category><category>qwen</category><category>rocm</category><category>rust</category><category>strix halo</category><category>vllm</category><guid>https://tinycomputers.io/posts/a-bespoke-llm-code-scanner.html</guid><pubDate>Wed, 26 Nov 2025 16:49:15 GMT</pubDate></item><item><title>AMD AI Max+ 395 System Review: A Comprehensive Analysis</title><link>https://tinycomputers.io/posts/amd-ai-max%2B-395-system-review-a-comprehensive-analysis.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/amd-ai-max+-395-system-review-a-comprehensive-analysis_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;29 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;h3&gt;Executive Summary&lt;/h3&gt;
&lt;p&gt;The AMD AI Max+ 395 system represents AMD's latest entry into the high-performance computing and AI acceleration market, featuring the company's cutting-edge Strix Halo architecture. This comprehensive review examines the system's performance characteristics, software compatibility, and overall viability for AI workloads and general computing tasks. While the hardware shows impressive potential with its 16-core CPU and integrated Radeon 8060S graphics, significant software ecosystem challenges, particularly with PyTorch/ROCm compatibility for the gfx1151 architecture, present substantial barriers to immediate adoption for AI development workflows.&lt;/p&gt;
&lt;p&gt;&lt;img src="https://tinycomputers.io/images/IMG_3733.jpg" alt="AMD AI Max+ 395 Bosgame" style="float: left; width: 40%; margin: 0 20px 20px 0;"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Note: An Orange Pi 5 Max was photobombing this photograph&lt;/em&gt;&lt;/p&gt;
&lt;h3&gt;System Specifications and Architecture Overview&lt;/h3&gt;
&lt;h4&gt;CPU Specifications&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Processor&lt;/strong&gt;: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Architecture&lt;/strong&gt;: x86_64 with Zen 5 cores&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cores/Threads&lt;/strong&gt;: 16 cores / 32 threads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Base Clock&lt;/strong&gt;: 599 MHz (minimum)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Boost Clock&lt;/strong&gt;: 5,185 MHz (maximum)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cache Configuration&lt;/strong&gt;:&lt;/li&gt;
&lt;li&gt;L1d Cache: 768 KiB (16 instances, 48 KiB per core)&lt;/li&gt;
&lt;li&gt;L1i Cache: 512 KiB (16 instances, 32 KiB per core)&lt;/li&gt;
&lt;li&gt;L2 Cache: 16 MiB (16 instances, 1 MiB per core)&lt;/li&gt;
&lt;li&gt;L3 Cache: 64 MiB (2 instances, 32 MiB per CCX)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Instruction Set Extensions&lt;/strong&gt;: Full AVX-512, AVX-VNNI, BF16 support&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Memory Subsystem&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Total System Memory&lt;/strong&gt;: 32 GB DDR5&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Configuration&lt;/strong&gt;: Unified memory architecture with shared GPU/CPU access&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Bandwidth&lt;/strong&gt;: Achieved ~13.5 GB/s in multi-threaded tests&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Graphics Processing Unit&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;GPU Architecture&lt;/strong&gt;: Strix Halo (RDNA 3.5 based)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPU Designation&lt;/strong&gt;: gfx1151&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute Units&lt;/strong&gt;: 40 CUs (80 reported in ROCm, likely accounting for dual SIMD per CU)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Peak GPU Clock&lt;/strong&gt;: 2,900 MHz&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;VRAM&lt;/strong&gt;: 96 GB shared system memory (103 GB total addressable) - &lt;em&gt;Note: This allocation was intentionally configured to maximize GPU memory for large language model inference&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Bandwidth&lt;/strong&gt;: Shared with system memory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OpenCL Compute Units&lt;/strong&gt;: 20 (as reported by clinfo)&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Platform Details&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Operating System&lt;/strong&gt;: Ubuntu 24.04.3 LTS (Noble)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Kernel Version&lt;/strong&gt;: 6.8.0-83-generic&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Architecture&lt;/strong&gt;: x86_64&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Virtualization&lt;/strong&gt;: AMD-V enabled&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Performance Benchmarks&lt;/h3&gt;
&lt;p&gt;&lt;img alt="AMD AI Max+ 395 System Analysis Dashboard" src="https://tinycomputers.io/images/amd_system_analysis.png"&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Figure 1: Comprehensive performance analysis and compatibility overview of the AMD AI Max+ 395 system&lt;/em&gt;&lt;/p&gt;
&lt;h4&gt;CPU Performance Analysis&lt;/h4&gt;
&lt;h5&gt;Single-Threaded Performance&lt;/h5&gt;
&lt;p&gt;The sysbench CPU benchmark with prime number calculation revealed strong single-threaded performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Events per second&lt;/strong&gt;: 6,368.92&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Average latency&lt;/strong&gt;: 0.16 ms&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;95th percentile latency&lt;/strong&gt;: 0.16 ms&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This performance places the AMD AI Max+ 395 in the upper tier of modern processors for single-threaded workloads, demonstrating the effectiveness of the Zen 5 architecture's IPC improvements and high boost clocks.&lt;/p&gt;
&lt;h5&gt;Multi-Threaded Performance&lt;/h5&gt;
&lt;p&gt;Multi-threaded testing across all 32 threads showed excellent scaling:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Events per second&lt;/strong&gt;: 103,690.35&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scaling efficiency&lt;/strong&gt;: 16.3x improvement over single-threaded (theoretical maximum 32x)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thread fairness&lt;/strong&gt;: Excellent distribution with minimal standard deviation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The scaling efficiency of approximately 51% indicates good multi-threading performance, though there's room for optimization in workloads that can fully utilize all available threads.&lt;/p&gt;
&lt;h4&gt;Memory Performance&lt;/h4&gt;
&lt;h5&gt;Memory Bandwidth Testing&lt;/h5&gt;
&lt;p&gt;Memory performance testing using sysbench revealed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Single-threaded bandwidth&lt;/strong&gt;: 9.3 GB/s&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Multi-threaded bandwidth&lt;/strong&gt;: 13.5 GB/s (16 threads)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency characteristics&lt;/strong&gt;: Sub-millisecond access times&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The memory bandwidth results suggest the system is well-balanced for most workloads, though AI applications requiring extremely high memory bandwidth may find this a limiting factor compared to discrete GPU solutions with dedicated VRAM.&lt;/p&gt;
&lt;h4&gt;GPU Performance and Capabilities&lt;/h4&gt;
&lt;h5&gt;Hardware Specifications&lt;/h5&gt;
&lt;p&gt;The integrated Radeon 8060S GPU presents impressive specifications on paper:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Architecture&lt;/strong&gt;: RDNA 3.5 (Strix Halo)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute Units&lt;/strong&gt;: 40 CUs with 2 SIMDs each&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Access&lt;/strong&gt;: Full 96 GB of shared system memory&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Clock Speed&lt;/strong&gt;: Up to 2.9 GHz&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;OpenCL Capabilities&lt;/h5&gt;
&lt;p&gt;OpenCL enumeration reveals solid compute capabilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Device Type&lt;/strong&gt;: GPU with full OpenCL 2.1 support&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Max Compute Units&lt;/strong&gt;: 20 (OpenCL reporting)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Max Work Group Size&lt;/strong&gt;: 256&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Image Support&lt;/strong&gt;: Full 2D/3D image processing capabilities&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Memory Allocation&lt;/strong&gt;: Up to 87 GB maximum allocation&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Network Performance Testing&lt;/h4&gt;
&lt;p&gt;Network infrastructure testing using iperf3 demonstrated excellent localhost performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Loopback Bandwidth&lt;/strong&gt;: 122 Gbits/sec sustained&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Latency&lt;/strong&gt;: Minimal retransmissions (0 retries)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Consistency&lt;/strong&gt;: Stable performance across 10-second test duration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This indicates robust internal networking capabilities suitable for distributed computing scenarios and high-bandwidth data transfer requirements.&lt;/p&gt;
&lt;h3&gt;PyTorch/ROCm Compatibility Analysis&lt;/h3&gt;
&lt;h4&gt;Current State of ROCm Support&lt;/h4&gt;
&lt;p&gt;We installed ROCm 7.0 and related components:
- &lt;strong&gt;ROCm Version&lt;/strong&gt;: 7.0.0
- &lt;strong&gt;HIP Version&lt;/strong&gt;: 7.0.51831
- &lt;strong&gt;PyTorch Version&lt;/strong&gt;: 2.5.1+rocm6.2&lt;/p&gt;
&lt;h4&gt;gfx1151 Compatibility Issues&lt;/h4&gt;
&lt;p&gt;The most significant finding of this review centers on the gfx1151 architecture compatibility with current AI software stacks. Testing revealed critical limitations:&lt;/p&gt;
&lt;h5&gt;PyTorch Compatibility Problems&lt;/h5&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;rocBLAS error: Cannot read TensileLibrary.dat: Illegal seek for GPU arch : gfx1151
List of available TensileLibrary Files:
&lt;span class="k"&gt;-&lt;/span&gt; TensileLibrary_lazy_gfx1030.dat
&lt;span class="k"&gt;-&lt;/span&gt; TensileLibrary_lazy_gfx906.dat
&lt;span class="k"&gt;-&lt;/span&gt; TensileLibrary_lazy_gfx908.dat
&lt;span class="k"&gt;-&lt;/span&gt; TensileLibrary_lazy_gfx942.dat
&lt;span class="k"&gt;-&lt;/span&gt; TensileLibrary_lazy_gfx900.dat
&lt;span class="k"&gt;-&lt;/span&gt; TensileLibrary_lazy_gfx90a.dat
&lt;span class="k"&gt;-&lt;/span&gt; TensileLibrary_lazy_gfx1100.dat
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This error indicates that PyTorch's ROCm backend lacks pre-compiled optimized kernels for the gfx1151 architecture. The absence of gfx1151 in the TensileLibrary files means:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;No Optimized BLAS Operations&lt;/strong&gt;: Matrix multiplication, convolutions, and other fundamental AI operations cannot leverage GPU acceleration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Training Workflows Broken&lt;/strong&gt;: Most deep learning training pipelines will fail or fall back to CPU execution&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inference Limitations&lt;/strong&gt;: Even basic neural network inference is compromised&lt;/li&gt;
&lt;/ol&gt;
&lt;h5&gt;Root Cause Analysis&lt;/h5&gt;
&lt;p&gt;The gfx1151 architecture represents a newer GPU design that hasn't been fully integrated into the ROCm software stack. While the hardware is detected and basic OpenCL operations function, the optimized compute libraries essential for AI workloads are missing.&lt;/p&gt;
&lt;h5&gt;Workaround Attempts&lt;/h5&gt;
&lt;p&gt;Testing various workarounds yielded limited success:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;HSA_OVERRIDE_GFX_VERSION=11.0.0&lt;/strong&gt;: Failed to resolve compatibility issues&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;CPU Fallback&lt;/strong&gt;: PyTorch operates normally on CPU, but defeats the purpose of GPU acceleration&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Basic GPU Operations&lt;/strong&gt;: Simple tensor allocation succeeds, but compute operations fail&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Software Ecosystem Gaps&lt;/h4&gt;
&lt;p&gt;Beyond PyTorch, the gfx1151 compatibility issues extend to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;TensorFlow&lt;/strong&gt;: Likely similar rocBLAS dependency issues&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;JAX&lt;/strong&gt;: ROCm backend compatibility uncertain&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scientific Computing&lt;/strong&gt;: NumPy/SciPy GPU acceleration unavailable&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Machine Learning Frameworks&lt;/strong&gt;: Most frameworks dependent on rocBLAS will encounter issues&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;AMD GPU Software Support Ecosystem Analysis&lt;/h3&gt;
&lt;h4&gt;Current State Assessment&lt;/h4&gt;
&lt;p&gt;AMD's GPU software ecosystem has made significant strides but remains fragmented compared to NVIDIA's CUDA platform:&lt;/p&gt;
&lt;h5&gt;Strengths&lt;/h5&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Open Source Foundation&lt;/strong&gt;: ROCm's open-source nature enables community contributions&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Standard API Support&lt;/strong&gt;: OpenCL 2.1 and HIP provide industry-standard interfaces&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Linux Integration&lt;/strong&gt;: Strong kernel-level support through AMDGPU drivers&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Professional Tools&lt;/strong&gt;: rocm-smi and related utilities provide comprehensive monitoring&lt;/li&gt;
&lt;/ol&gt;
&lt;h5&gt;Weaknesses&lt;/h5&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Fragmented Architecture Support&lt;/strong&gt;: New architectures like gfx1151 lag behind in software support&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limited Documentation&lt;/strong&gt;: Less comprehensive than CUDA documentation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Smaller Developer Community&lt;/strong&gt;: Fewer third-party tools and optimizations&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compatibility Matrix Complexity&lt;/strong&gt;: Different software versions support different GPU architectures&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Long-term Viability Concerns&lt;/h4&gt;
&lt;p&gt;The gfx1151 compatibility issues highlight broader ecosystem challenges:&lt;/p&gt;
&lt;h5&gt;Release Coordination Problems&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Hardware releases outpace software ecosystem updates&lt;/li&gt;
&lt;li&gt;Critical libraries (rocBLAS, Tensile) require architecture-specific optimization&lt;/li&gt;
&lt;li&gt;Coordination between AMD hardware and software teams appears insufficient&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;Market Adoption Barriers&lt;/h5&gt;
&lt;ul&gt;
&lt;li&gt;Developers hesitant to adopt platform with uncertain software support&lt;/li&gt;
&lt;li&gt;Enterprise customers require guaranteed compatibility&lt;/li&gt;
&lt;li&gt;Academic researchers need stable, well-documented platforms&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Recommendations for AMD&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Accelerated Software Development&lt;/strong&gt;: Prioritize gfx1151 support in rocBLAS and related libraries&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Pre-release Testing&lt;/strong&gt;: Ensure software ecosystem readiness before hardware launches&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Better Documentation&lt;/strong&gt;: Comprehensive compatibility matrices and migration guides&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Community Engagement&lt;/strong&gt;: More responsive developer relations and support channels&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Network Infrastructure and Connectivity&lt;/h3&gt;
&lt;p&gt;The system demonstrates excellent network performance characteristics suitable for modern computing workloads:&lt;/p&gt;
&lt;h4&gt;Internal Performance&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Memory-to-Network Efficiency&lt;/strong&gt;: 122 Gbps loopback performance indicates minimal bottlenecks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;System Integration&lt;/strong&gt;: Unified memory architecture benefits network-intensive applications&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Scalability&lt;/strong&gt;: Architecture suitable for distributed computing scenarios&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;External Connectivity Assessment&lt;/h4&gt;
&lt;p&gt;While specific external network testing wasn't performed, the system's infrastructure suggests:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Support for high-speed Ethernet (2.5GbE+)&lt;/li&gt;
&lt;li&gt;Low-latency interconnects suitable for cluster computing&lt;/li&gt;
&lt;li&gt;Adequate bandwidth for data center deployment scenarios&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Power Efficiency and Thermal Characteristics&lt;/h3&gt;
&lt;p&gt;Limited thermal data was available during testing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Idle Temperature&lt;/strong&gt;: 29°C (GPU sensor)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Idle Power&lt;/strong&gt;: 8.059W (GPU subsystem)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Thermal Management&lt;/strong&gt;: Appears well-controlled under light loads&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The unified architecture's power efficiency represents a significant advantage over discrete GPU solutions, particularly for mobile and edge computing applications.&lt;/p&gt;
&lt;h3&gt;Competitive Analysis&lt;/h3&gt;
&lt;h4&gt;Comparison with Intel Arc&lt;/h4&gt;
&lt;p&gt;Intel's Arc GPUs face similar software ecosystem challenges, though Intel has made more aggressive investments in AI software stack development. The Arc series benefits from Intel's deeper software engineering resources but still lags behind NVIDIA in AI framework support.&lt;/p&gt;
&lt;h4&gt;Comparison with NVIDIA&lt;/h4&gt;
&lt;p&gt;NVIDIA maintains a substantial advantage in:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Software Maturity&lt;/strong&gt;: CUDA ecosystem is mature and well-supported&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Framework Integration&lt;/strong&gt;: Native support across all major frameworks&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Developer Tools&lt;/strong&gt;: Comprehensive profiling and debugging tools&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Documentation&lt;/strong&gt;: Extensive, well-maintained documentation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AMD's advantages include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Open Source Approach&lt;/strong&gt;: More flexible licensing and community development&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Unified Memory&lt;/strong&gt;: Simplified programming model for certain applications&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Cost&lt;/strong&gt;: Potentially more cost-effective solutions&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Market Positioning&lt;/h4&gt;
&lt;p&gt;The AMD AI Max+ 395 occupies a unique position as a high-performance integrated solution, but software limitations significantly impact its competitiveness in AI-focused markets.&lt;/p&gt;
&lt;h3&gt;Use Case Suitability Analysis&lt;/h3&gt;
&lt;h4&gt;Recommended Use Cases&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;General Computing&lt;/strong&gt;: Excellent performance for traditional computational workloads&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Development Platforms&lt;/strong&gt;: Strong for general software development (non-AI)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Edge Computing&lt;/strong&gt;: Unified architecture benefits power-constrained deployments&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Future AI Workloads&lt;/strong&gt;: When software ecosystem matures&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Not Recommended For&lt;/h4&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Current AI Development&lt;/strong&gt;: gfx1151 compatibility issues are blocking&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Production AI Inference&lt;/strong&gt;: Unreliable software support&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Machine Learning Research&lt;/strong&gt;: Limited framework compatibility&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Time-Critical Projects&lt;/strong&gt;: Uncertain timeline for software fixes&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;Large Language Model Performance and Stability&lt;/h3&gt;
&lt;h4&gt;Ollama LLM Inference Testing&lt;/h4&gt;
&lt;p&gt;Testing with Ollama reveals a mixed picture for LLM inference on the AMD AI Max+ 395 system. The platform successfully runs various models through CPU-based inference, though GPU acceleration faces significant challenges.&lt;/p&gt;
&lt;h5&gt;Performance Metrics&lt;/h5&gt;
&lt;p&gt;Testing with various model sizes revealed the following performance characteristics:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;GPT-OSS 20B Model Performance:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prompt evaluation rate: 61.29 tokens/second&lt;/li&gt;
&lt;li&gt;Text generation rate: 8.99 tokens/second&lt;/li&gt;
&lt;li&gt;Total inference time: ~13 seconds for 117 tokens&lt;/li&gt;
&lt;li&gt;Memory utilization: ~54 GB VRAM usage&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Llama 4 (67B) Model:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Successfully loads and runs&lt;/li&gt;
&lt;li&gt;Generation coherent and accurate&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The system demonstrates adequate performance for smaller models (20B parameters and below) when running through Ollama, though performance significantly lags behind NVIDIA GPUs with proper CUDA acceleration. The large unified memory configuration (96 GB VRAM, deliberately maximized for this testing) allows loading of substantial models that would typically require multiple GPUs or extensive system RAM on other platforms. This conscious decision to allocate maximum memory to the GPU was specifically made to evaluate the system's potential for large language model workloads.&lt;/p&gt;
&lt;h4&gt;Critical Stability Issues with Large Models&lt;/h4&gt;
&lt;h5&gt;Driver Crashes with Advanced AI Workloads&lt;/h5&gt;
&lt;p&gt;Testing revealed severe stability issues when attempting to run larger models or when using AI-accelerated development tools:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Affected Scenarios:&lt;/strong&gt;&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Large Model Loading&lt;/strong&gt;: GPT-OSS 120B model causes immediate amdgpu driver crashes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;AI Development Tools&lt;/strong&gt;: Continue.dev with certain LLMs triggers GPU reset&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;OpenAI Codex Integration&lt;/strong&gt;: Consistent driver failures with models exceeding 70B parameters&lt;/li&gt;
&lt;/ol&gt;
&lt;h5&gt;GPU Reset Events&lt;/h5&gt;
&lt;p&gt;System logs reveal frequent GPU reset events during AI workload attempts:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt; 1030.960155&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;amdgpu&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0000&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="nl"&gt;c5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.0&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;amdgpu&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;GPU&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;begin&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt; 1033.972213&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;amdgpu&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0000&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="nl"&gt;c5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.0&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;amdgpu&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;MODE2&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt; 1034.002615&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;amdgpu&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0000&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="nl"&gt;c5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.0&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;amdgpu&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;GPU&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;succeeded&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;trying&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;resume&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt; 1034.003141&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt;drm&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;VRAM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;lost&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;due&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;to&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;GPU&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="n"&gt; 1034.037824&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;amdgpu&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0000&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="nl"&gt;c5&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mf"&gt;00.0&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;amdgpu&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;GPU&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;reset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;succeeded&lt;/span&gt;&lt;span class="err"&gt;!&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;These crashes result in:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Complete loss of VRAM contents&lt;/li&gt;
&lt;li&gt;Application termination&lt;/li&gt;
&lt;li&gt;Potential system instability requiring reboot&lt;/li&gt;
&lt;li&gt;Interrupted workflows and data loss&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Root Cause Analysis&lt;/h4&gt;
&lt;p&gt;The driver instability appears to stem from the same underlying issue as the PyTorch/ROCm incompatibility: &lt;strong&gt;immature driver support for the gfx1151 architecture&lt;/strong&gt;. The drivers struggle with:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Memory Management&lt;/strong&gt;: Large model allocations exceed driver's tested parameters&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Compute Dispatch&lt;/strong&gt;: Complex kernel launches trigger unhandled edge cases&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Power State Transitions&lt;/strong&gt;: Rapid load changes cause driver state machine failures&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Synchronization Issues&lt;/strong&gt;: Multi-threaded inference workloads expose race conditions&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Implications for AI Development&lt;/h4&gt;
&lt;p&gt;The combination of LLM testing results and driver stability issues reinforces that the AMD AI Max+ 395 system, despite impressive hardware specifications, remains unsuitable for production AI workloads. The platform shows promise for future AI applications once driver maturity improves, but current limitations include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Unreliable Large Model Support&lt;/strong&gt;: Models over 70B parameters risk system crashes&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Limited Tool Compatibility&lt;/strong&gt;: Popular AI development tools cause instability&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Workflow Interruptions&lt;/strong&gt;: Frequent crashes disrupt development productivity&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data Loss Risk&lt;/strong&gt;: VRAM resets can lose unsaved work or model states&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Future Outlook and Development Roadmap&lt;/h3&gt;
&lt;h4&gt;Short-term Expectations (3-6 months)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;ROCm updates likely to address gfx1151 compatibility&lt;/li&gt;
&lt;li&gt;PyTorch/TensorFlow support should improve&lt;/li&gt;
&lt;li&gt;Community-driven workarounds may emerge&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Medium-term Prospects (6-18 months)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;Full AI framework support expected&lt;/li&gt;
&lt;li&gt;Optimization improvements for Strix Halo architecture&lt;/li&gt;
&lt;li&gt;Better documentation and developer resources&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Long-term Considerations (18+ months)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;AMD's commitment to open-source ecosystem should pay dividends&lt;/li&gt;
&lt;li&gt;Potential for superior price/performance ratios&lt;/li&gt;
&lt;li&gt;Growing developer community around ROCm platform&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Conclusions and Recommendations&lt;/h3&gt;
&lt;p&gt;The AMD AI Max+ 395 system represents impressive hardware engineering with its unified memory architecture, strong CPU performance, and substantial GPU compute capabilities. However, critical software ecosystem gaps, particularly the gfx1151 compatibility issues with PyTorch and ROCm, severely limit its immediate utility for AI and machine learning workloads.&lt;/p&gt;
&lt;h4&gt;Key Findings Summary&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Hardware Strengths:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Excellent CPU performance with 16 Zen 5 cores&lt;/li&gt;
&lt;li&gt;Innovative unified memory architecture with 96 GB addressable&lt;/li&gt;
&lt;li&gt;Strong integrated GPU with 40 compute units&lt;/li&gt;
&lt;li&gt;Efficient power management and thermal characteristics&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Software Limitations:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Critical gfx1151 architecture support gaps in ROCm ecosystem&lt;/li&gt;
&lt;li&gt;PyTorch integration completely broken for GPU acceleration&lt;/li&gt;
&lt;li&gt;Limited AI framework compatibility across the board&lt;/li&gt;
&lt;li&gt;Insufficient documentation for troubleshooting&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;strong&gt;Market Position:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Competitive hardware specifications&lt;/li&gt;
&lt;li&gt;Unique integrated architecture advantages&lt;/li&gt;
&lt;li&gt;Significant software ecosystem disadvantages versus NVIDIA&lt;/li&gt;
&lt;li&gt;Uncertain timeline for compatibility improvements&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Purchasing Recommendations&lt;/h4&gt;
&lt;p&gt;&lt;strong&gt;Buy If:&lt;/strong&gt;
- Primary use case is general computing or traditional HPC workloads
- Willing to wait 6-12 months for AI software ecosystem maturity
- Value open-source software development approach
- Need power-efficient integrated solution&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Avoid If:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Immediate AI/ML development requirements&lt;/li&gt;
&lt;li&gt;Production AI inference deployments planned&lt;/li&gt;
&lt;li&gt;Time-critical project timelines&lt;/li&gt;
&lt;li&gt;Require guaranteed software support&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Final Verdict&lt;/h4&gt;
&lt;p&gt;The AMD AI Max+ 395 system shows tremendous promise as a unified computing platform, but premature software ecosystem development makes it unsuitable for current AI workloads. Organizations should monitor ROCm development progress closely, as this hardware could become highly competitive once software support matures. For general computing applications, the system offers excellent performance and value, representing AMD's continued progress in processor design and integration.&lt;/p&gt;
&lt;p&gt;The AMD AI Max+ 395 represents a glimpse into the future of integrated computing platforms, but early adopters should be prepared for software ecosystem growing pains. As AMD continues investing in ROCm development and the open-source community contributes solutions, this platform has the potential to become a compelling alternative to NVIDIA's ecosystem dominance.&lt;/p&gt;</description><category>ai hardware</category><category>amd</category><category>benchmarks</category><category>gfx1151</category><category>gpu computing</category><category>machine learning</category><category>pytorch</category><category>rocm</category><category>ryzen ai</category><category>strix halo</category><guid>https://tinycomputers.io/posts/amd-ai-max%2B-395-system-review-a-comprehensive-analysis.html</guid><pubDate>Sun, 21 Sep 2025 20:25:28 GMT</pubDate></item></channel></rss>