Getting PyTorch Working with AMD Radeon Pro W7900 (MAX+ 395): A Comprehensive Guide
Introduction
The AMD Radeon Pro W7900 represents a significant leap forward in professional GPU computing. With 96GB of unified memory and 20 compute units, this workstation-class GPU brings serious computational power to tasks like machine learning, scientific computing, and data analysis. However, getting deep learning frameworks like PyTorch to work with AMD GPUs has historically been more challenging than with NVIDIA's CUDA ecosystem.
Here's a complete walkthrough of setting up PyTorch with ROCm support on the AMD MAX+ 395, including installation, verification, and real-world testing. By the end, you'll have a fully functional PyTorch environment capable of leveraging your AMD GPU's computational power.
Understanding ROCm and PyTorch
What is ROCm?
ROCm (Radeon Open Compute) is AMD's open-source software platform for GPU computing. It serves as AMD's answer to NVIDIA's CUDA, providing:
- Low-level GPU programming interfaces
- Optimized libraries for linear algebra, FFT, and other operations
- Deep learning framework support
- Compatibility with CUDA-based code through HIP (Heterogeneous-compute Interface for Portability)
PyTorch and ROCm Integration
PyTorch has officially supported ROCm since version 1.8, and support has matured significantly over subsequent releases. The ROCm version of PyTorch uses the same API as the CUDA version, making it straightforward to port existing PyTorch code to AMD GPUs. In fact, most PyTorch code written for CUDA will work without modification on ROCm, as the framework abstracts away the underlying GPU platform.
System Specifications
Testing was performed on a system with the following specifications:
- GPU: AMD Radeon Pro W7900 (MAX+ 395)
- GPU Memory: 96 GB
- Compute Units: 20
- CUDA Capability: 11.5 (ROCm compatibility level)
- Operating System: Linux
- Python: 3.12.11
- PyTorch Version: 2.8.0+rocm7.0.0
- ROCm Version: 7.0.0
Installation and Setup
This section provides detailed, step-by-step instructions for bootstrapping a complete ROCm 7.0 + PyTorch 2.8 environment on Ubuntu 24.04.3 LTS. These instructions are based on successful installations on the AMD Ryzen AI Max+395 platform.
Prerequisites
- Ubuntu 24.04.3 LTS (Server or Desktop)
- Administrator/sudo access
- Internet connection for downloading packages
Step 1: Update Linux Kernel
ROCm 7.0 works best with Linux kernel 6.14 or later. Update your kernel:
sudo apt-get install linux-generic-hwe-24.04
Verify the installation:
cat /proc/version
You should see output similar to:
Linux version 6.14.0-33-generic (buildd@lcy02-amd64-026)...
Reboot to load the new kernel:
sudo reboot
Step 2: Install AMDGPU Driver
First, set up the AMD repository:
# Create keyring directory if it doesn't exist sudo mkdir --parents --mode=0755 /etc/apt/keyrings # Download and install AMD GPG key wget https://repo.radeon.com/rocm/rocm.gpg.key -O - | \ gpg --dearmor | sudo tee /etc/apt/keyrings/rocm.gpg > /dev/null # Add AMDGPU repository sudo tee /etc/apt/sources.list.d/amdgpu.list << EOF deb [arch=amd64,i386 signed-by=/etc/apt/keyrings/rocm.gpg] https://repo.radeon.com/amdgpu/latest/ubuntu noble main EOF
Install the AMDGPU DKMS driver:
sudo apt update sudo apt install amdgpu-dkms sudo reboot
Verify the driver installation:
sudo dkms status
You should see output like:
amdgpu/6.14.14-2212064.24.04, 6.14.0-33-generic, x86_64: installed
Step 3: Install ROCm 7.0
Install prerequisites:
sudo apt install python3-setuptools python3-wheel sudo apt update
Download and install the AMD GPU installer:
wget https://repo.radeon.com/amdgpu-install/7.0/ubuntu/noble/amdgpu-install_7.0.70000-1_all.deb sudo apt install ./amdgpu-install_7.0.70000-1_all.deb
Install ROCm with the compute use case (choose Y when prompted to overwrite amdgpu.list):
amdgpu-install -y --usecase=rocm sudo reboot
Add your user to the required groups:
sudo usermod -a -G render,video $LOGNAME sudo reboot
Verify ROCm installation:
rocminfo
You should see your GPU listed as an agent with detailed properties.
Step 4: Configure ROCm Libraries
Configure the system to find ROCm shared libraries:
# Add ROCm library paths sudo tee --append /etc/ld.so.conf.d/rocm.conf <<EOF /opt/rocm/lib /opt/rocm/lib64 EOF sudo ldconfig # Set library path environment variable (add to ~/.bashrc for persistence) export LD_LIBRARY_PATH=/opt/rocm-7.0.0/lib:$LD_LIBRARY_PATH
Install and verify OpenCL runtime:
sudo apt install rocm-opencl-runtime clinfo
The clinfo
command should display information about your AMD GPU.
Step 5: Install PyTorch with ROCm Support
Create a conda environment and install PyTorch:
# Create conda environment conda create -n pt2.8-rocm7 python=3.12 conda activate pt2.8-rocm7 # Install PyTorch 2.8.0 with ROCm 7.0 from AMD's repository pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0/pytorch_triton_rocm-3.2.0%2Brocm7.0.0.4d510c3a44-cp312-cp312-linux_x86_64.whl pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0/torch-2.8.0%2Brocm7.0.0-cp312-cp312-linux_x86_64.whl pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0/torchvision-0.23.0%2Brocm7.0.0-cp312-cp312-linux_x86_64.whl pip install https://repo.radeon.com/rocm/manylinux/rocm-rel-7.0/torchaudio-2.8.0%2Brocm7.0.0-cp312-cp312-linux_x86_64.whl # Install GCC 12.1 (required for some operations) conda install -c conda-forge gcc=12.1.0
Important Notes:
- The URLs above are for Python 3.12 (cp312). Adjust for your Python version if different.
- These wheels are built specifically for ROCm 7.0 and may not work with other ROCm versions.
- The LD_LIBRARY_PATH
must be set correctly, or PyTorch won't find ROCm libraries.
Verifying Installation
After installation, perform a quick verification:
import torch print(f"PyTorch version: {torch.__version__}") print(f"CUDA available: {torch.cuda.is_available()}") print(f"Device count: {torch.cuda.device_count()}") if torch.cuda.is_available(): print(f"Device name: {torch.cuda.get_device_name(0)}")
Note that despite using ROCm, PyTorch still refers to the GPU API as "CUDA" for compatibility reasons. This is intentional and allows CUDA-based code to run on AMD GPUs without modification.
Comprehensive GPU Testing
To thoroughly validate that PyTorch is working correctly with the MAX+ 395, we developed a comprehensive test suite that exercises various aspects of GPU computing.
Test Suite Overview
Our test suite includes five major components:
- Installation Verification: Confirms PyTorch version and GPU detection
- ROCm Availability Check: Validates GPU properties and capabilities
- Tensor Operations: Tests basic tensor creation and mathematical operations
- Neural Network Operations: Validates deep learning functionality
- Memory Management: Tests GPU memory allocation and deallocation
Test Script
Here's the complete test script we developed:
#!/usr/bin/env python3 """ ROCm PyTorch GPU Test POC Tests if ROCm PyTorch can successfully detect and use AMD GPUs """ import torch import sys def print_section(title): """Print a formatted section header""" print(f"\n{'='*60}") print(f" {title}") print(f"{'='*60}") def test_pytorch_installation(): """Test basic PyTorch installation""" print_section("PyTorch Installation Info") print(f"PyTorch Version: {torch.__version__}") print(f"Python Version: {sys.version}") def test_rocm_availability(): """Test ROCm/CUDA availability""" print_section("ROCm/CUDA Availability") cuda_available = torch.cuda.is_available() print(f"CUDA Available: {cuda_available}") if cuda_available: print(f"CUDA Device Count: {torch.cuda.device_count()}") print(f"Current Device: {torch.cuda.current_device()}") print(f"Device Name: {torch.cuda.get_device_name(0)}") props = torch.cuda.get_device_properties(0) print(f"\nDevice Properties:") print(f" - Total Memory: {props.total_memory / 1024**3:.2f} GB") print(f" - Multi Processor Count: {props.multi_processor_count}") print(f" - CUDA Capability: {props.major}.{props.minor}") else: print("No CUDA/ROCm devices detected!") return False return True def test_tensor_operations(): """Test basic tensor operations on GPU""" print_section("Tensor Operations Test") try: cpu_tensor = torch.randn(1000, 1000) print(f"CPU Tensor created: {cpu_tensor.shape}") print(f"CPU Tensor device: {cpu_tensor.device}") gpu_tensor = cpu_tensor.cuda() print(f"\nGPU Tensor created: {gpu_tensor.shape}") print(f"GPU Tensor device: {gpu_tensor.device}") print("\nPerforming matrix multiplication on GPU...") result = torch.matmul(gpu_tensor, gpu_tensor) print(f"Result shape: {result.shape}") print(f"Result device: {result.device}") cpu_result = result.cpu() print(f"Moved result back to CPU: {cpu_result.device}") print("\n✓ Tensor operations successful!") return True except Exception as e: print(f"\n✗ Tensor operations failed: {e}") return False def test_simple_neural_network(): """Test a simple neural network operation on GPU""" print_section("Neural Network Test") try: model = torch.nn.Sequential( torch.nn.Linear(100, 50), torch.nn.ReLU(), torch.nn.Linear(50, 10) ) print("Model created on CPU") print(f"Model device: {next(model.parameters()).device}") model = model.cuda() print(f"Model moved to GPU: {next(model.parameters()).device}") input_data = torch.randn(32, 100).cuda() print(f"\nInput data shape: {input_data.shape}") print(f"Input data device: {input_data.device}") print("Performing forward pass...") output = model(input_data) print(f"Output shape: {output.shape}") print(f"Output device: {output.device}") print("\n✓ Neural network test successful!") return True except Exception as e: print(f"\n✗ Neural network test failed: {e}") return False def test_memory_management(): """Test GPU memory management""" print_section("GPU Memory Management Test") try: if torch.cuda.is_available(): print(f"Allocated Memory: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB") print(f"Cached Memory: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB") tensors = [] for i in range(5): tensors.append(torch.randn(1000, 1000).cuda()) print(f"\nAfter allocating 5 tensors:") print(f"Allocated Memory: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB") print(f"Cached Memory: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB") del tensors torch.cuda.empty_cache() print(f"\nAfter clearing cache:") print(f"Allocated Memory: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB") print(f"Cached Memory: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB") print("\n✓ Memory management test successful!") return True else: print("No GPU available for memory test") return False except Exception as e: print(f"\n✗ Memory management test failed: {e}") return False def main(): """Run all tests""" print("\n" + "="*60) print(" ROCm PyTorch GPU Test POC") print("="*60) test_pytorch_installation() if not test_rocm_availability(): print("\n" + "="*60) print(" FAILED: No ROCm/CUDA devices available") print("="*60) sys.exit(1) results = [] results.append(("Tensor Operations", test_tensor_operations())) results.append(("Neural Network", test_simple_neural_network())) results.append(("Memory Management", test_memory_management())) print_section("Test Summary") all_passed = True for test_name, passed in results: status = "✓ PASSED" if passed else "✗ FAILED" print(f"{test_name}: {status}") if not passed: all_passed = False print("\n" + "="*60) if all_passed: print(" SUCCESS: All tests passed! ROCm GPU is working.") else: print(" PARTIAL SUCCESS: Some tests failed.") print("="*60 + "\n") return 0 if all_passed else 1 if __name__ == "__main__": sys.exit(main())
Test Results and Analysis
Running our comprehensive test suite on the MAX+ 395 yielded excellent results across all categories.
GPU Detection and Properties
The first test confirmed that PyTorch successfully detected the AMD GPU:
CUDA Available: True CUDA Device Count: 1 Current Device: 0 Device Name: AMD Radeon Graphics Device Properties: - Total Memory: 96.00 GB - Multi Processor Count: 20 - CUDA Capability: 11.5
The 96GB of memory is particularly impressive, far exceeding what's available on most consumer or even professional NVIDIA GPUs. This massive memory capacity opens up possibilities for:
- Training larger models without splitting across multiple GPUs
- Processing high-resolution images or long sequences
- Handling larger batch sizes for improved training efficiency
- Running multiple models simultaneously
Tensor Operations Performance
Basic tensor operations executed flawlessly:
CPU Tensor created: torch.Size([1000, 1000]) CPU Tensor device: cpu GPU Tensor created: torch.Size([1000, 1000]) GPU Tensor device: cuda:0 Performing matrix multiplication on GPU... Result shape: torch.Size([1000, 1000]) Result device: cuda:0 Moved result back to CPU: cpu ✓ Tensor operations successful!
The seamless movement of tensors between CPU and GPU memory, along with successful matrix multiplication, confirms that the fundamental PyTorch operations work correctly on ROCm.
Neural Network Operations
Our neural network test validated that PyTorch's high-level APIs work correctly:
Model created on CPU Model device: cpu Model moved to GPU: cuda:0 Input data shape: torch.Size([32, 100]) Input data device: cuda:0 Performing forward pass... Output shape: torch.Size([32, 10]) Output device: cuda:0 ✓ Neural network test successful!
This test confirms that:
- Models can be moved to GPU with the .cuda()
method
- Forward passes execute correctly on GPU
- All layers (Linear, ReLU) are properly accelerated
Memory Management
The memory management test showed efficient allocation and deallocation:
Allocated Memory: 32.00 MB Cached Memory: 54.00 MB After allocating 5 tensors: Allocated Memory: 52.00 MB Cached Memory: 54.00 MB After clearing cache: Allocated Memory: 32.00 MB Cached Memory: 32.00 MB
PyTorch's memory management on ROCm works identically to CUDA, with proper caching behavior and the ability to manually clear cached memory when needed.
Performance Considerations
Memory Bandwidth
The MAX+ 395's 96GB of memory is a significant advantage, but memory bandwidth is equally important for deep learning workloads. The W7900's memory subsystem provides substantial bandwidth for data transfers between GPU memory and compute units.
Compute Performance
With 20 compute units, the MAX+ 395 provides substantial parallel processing capability. While direct comparisons to NVIDIA GPUs depend on the specific workload, ROCm's optimization for AMD architectures ensures efficient utilization of available compute resources.
Software Maturity
ROCm has matured significantly over recent years. Most PyTorch operations that work on CUDA now work seamlessly on ROCm. However, some edge cases and newer features may still have better support on CUDA, so testing your specific workload is recommended.
Practical Tips and Best Practices
Code Portability
To write code that works on both CUDA and ROCm:
# Use device-agnostic code device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) inputs = inputs.to(device)
Monitoring GPU Utilization
Use rocm-smi
to monitor GPU utilization:
watch -n 1 rocm-smi
This provides real-time information about GPU usage, memory consumption, temperature, and power draw.
Optimizing Memory Usage
With 96GB available, you might be tempted to use very large batch sizes. However, optimal batch size depends on many factors:
# Experiment with batch sizes for batch_size in [32, 64, 128, 256]: # Train and measure throughput # Find the sweet spot between memory usage and performance
Debugging
Enable PyTorch's anomaly detection during development:
torch.autograd.set_detect_anomaly(True)
Troubleshooting Common Issues
GPU Not Detected
If torch.cuda.is_available()
returns False
:
- Verify ROCm installation:
rocm-smi
- Check PyTorch was installed with ROCm support:
print(torch.__version__)
should show+rocm
- Ensure ROCm drivers match PyTorch's ROCm version
Out of Memory Errors
Even with 96GB, you can run out of memory:
# Clear cache periodically torch.cuda.empty_cache() # Use gradient checkpointing for large models from torch.utils.checkpoint import checkpoint
Performance Issues
If training is slower than expected:
- Profile your code:
torch.profiler.profile()
- Check for CPU-GPU transfer bottlenecks
- Verify data loading isn't the bottleneck
- Consider using mixed precision training with
torch.cuda.amp
Conclusion
The AMD Radeon Pro W7900 (MAX+ 395) with ROCm provides a robust, capable platform for PyTorch-based machine learning workloads. Our comprehensive testing demonstrated that:
- PyTorch 2.8.0 with ROCm 7.0.0 works seamlessly with the MAX+ 395
- All tested operations (tensors, neural networks, memory management) function correctly
- The massive 96GB memory capacity enables unique use cases
- Code written for CUDA generally works without modification
For organizations invested in AMD hardware or looking for alternatives to NVIDIA's ecosystem, the MAX+ 395 with ROCm represents a viable option for deep learning workloads. The open-source nature of ROCm and PyTorch's strong support for the platform ensure that AMD GPUs are first-class citizens in the deep learning community.
As ROCm continues to evolve and PyTorch support deepens, AMD's GPU offerings will only become more compelling for machine learning practitioners. The MAX+ 395, with its exceptional memory capacity and solid compute performance, stands ready to tackle demanding deep learning tasks.
Acknowledgments
The detailed ROCm 7.0 installation procedure is based on Wei Lu's excellent article "Ultralytics YOLO/SAM with ROCm 7.0 on AMD Ryzen AI Max+395 'Strix Halo'" published on Medium in October 2025. Wei Lu's pioneering work in documenting the complete bootstrapping process for ROCm 7.0 on the Max+395 platform made this possible.
Resources
- PyTorch ROCm Documentation
- ROCm Documentation
- AMD GPUs for Deep Learning
- AMD ROCm Installation Guide
- Wei Lu's Original Article
Based on real-world testing performed on October 10, 2025, using PyTorch 2.8.0 with ROCm 7.0.0 on an AMD Radeon Pro W7900 GPU with 96GB memory. Installation instructions adapted from Wei Lu's documentation of the AMD Ryzen AI Max+395 platform.