If you've been following the AI coding assistant space, you've probably noticed that most tools assume you're running NVIDIA hardware or using a cloud API. But what if you have AMD hardware and want to run large language models locally with full tool-calling support? This guide walks through setting up vLLM with AMD ROCm in Docker and connecting it to Continue.dev's cn command-line coding assistant.
Why vLLM?
There are several options for running LLMs locally: llama.cpp, Ollama, and vLLM being the most popular. I chose vLLM for a specific reason: tool calling support. vLLM implements the OpenAI-compatible API with proper function calling, which means coding assistants can use tools like file reading, code execution, and search. This is critical for getting a capable coding assistant rather than just a chat interface.
vLLM also offers excellent performance through continuous batching, PagedAttention for efficient memory management, and support for a wide range of models. The trade-off is that it's more resource-intensive than llama.cpp, but if you have the VRAM, the capabilities are worth it.
Hardware Setup
For this guide, I'm using an AMD Strix Halo system (Ryzen AI MAX+ 395) with 128GB of unified memory. If you're looking for a similar setup, the GMKtec EVO-X2 is one of the first mini PCs available with this chip. The integrated GPU shows up as gfx1151 in ROCm. However, this guide should work for any AMD GPU supported by ROCm, including discrete cards like the RX 7900 XTX, MI100, or MI250.
The unified memory architecture on Strix Halo is particularly interesting for LLM inference. Unlike discrete GPUs where you're limited by VRAM, the CPU and GPU share the same memory pool. This means you can run models that would normally require multiple high-end GPUs on a single chip, as long as you have enough system RAM.
Prerequisites
Before starting, you'll need:
- An AMD GPU supported by ROCm
- Docker installed on your system
- ROCm drivers installed (version 6.0 or later recommended)
- At least 16GB of RAM (more for larger models)
To verify your ROCm installation, run:
rocminfo | grep gfx
You should see your GPU architecture listed (e.g., gfx1100 for RDNA3, gfx1151 for Strix Halo).
Running vLLM in Docker
The easiest way to get vLLM running with ROCm is through Docker. The ROCm team maintains nightly images that include all necessary dependencies.
Pulling the Image
docker pull rocm/vllm-dev:nightly
This image is large (several GB) as it includes the full ROCm stack, PyTorch, and vLLM with all dependencies.
Starting the Container
Start the container with GPU access and port forwarding:
docker run -d \ --name vllm-dev \ --device=/dev/kfd \ --device=/dev/dri \ --group-add video \ --cap-add=SYS_PTRACE \ --security-opt seccomp=unconfined \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ rocm/vllm-dev:nightly \ tail -f /dev/null
Let me break down the important flags:
-
--device=/dev/kfdand--device=/dev/dri: Give the container access to the GPU -
--group-add video: Required for GPU access permissions -
-p 8000:8000: Expose the vLLM API port -
-v ~/.cache/huggingface:/root/.cache/huggingface: Persist downloaded models between container restarts -
tail -f /dev/null: Keep the container running so we can exec into it
Installing AMD SMI (Important!)
Before starting vLLM, you need to install the AMD SMI Python package inside the container. This is required for vLLM to detect the ROCm platform correctly:
docker exec -it vllm-dev bash pip install /opt/rocm/share/amd_smi exit
Without this step, vLLM will fail with an "UnspecifiedPlatform" error because it can't detect your AMD GPU.
Starting vLLM
Now start the vLLM server inside the container:
docker exec -d vllm-dev bash -c 'vllm serve Qwen/Qwen2.5-7B-Instruct \ --max-model-len 32768 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --host 0.0.0.0 \ --port 8000 > /tmp/vllm.log 2>&1'
The key flags here:
-
Qwen/Qwen2.5-7B-Instruct: The model to serve. Qwen 2.5 is excellent for coding tasks and supports tool calling. -
--max-model-len 32768: Maximum context length. Coding assistants need long contexts for system prompts and code. -
--enable-auto-tool-choice: Enable function calling support -
--tool-call-parser hermes: Use the Hermes format for tool calls, which Qwen supports
The first startup takes a while as vLLM downloads the model weights, compiles CUDA graphs, and warms up. Monitor progress with:
docker exec vllm-dev tail -f /tmp/vllm.log
You'll see it loading model shards, then capturing CUDA graphs. Once you see "Uvicorn running on http://0.0.0.0:8000", the server is ready.
Verifying the Server
Test that the server is responding:
curl http://localhost:8000/v1/models
You should see JSON output showing the loaded model with its maximum context length.
For a quick inference test:
curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 50 }'
Choosing a Model
The model you choose depends on your available memory and performance requirements. Here's a rough guide:
| Model | VRAM Required | Use Case |
|---|---|---|
| Qwen2.5-0.5B-Instruct | ~2GB | Testing, very fast responses |
| Qwen2.5-7B-Instruct | ~16GB | Good balance of speed and capability |
| Qwen2.5-32B-Instruct | ~70GB | Best quality, slower |
| DeepSeek-Coder-33B | ~70GB | Specialized for code |
For my system with 96GB allocated as VRAM, the 7B model leaves about 65GB free for KV cache, allowing concurrent requests with long contexts. The 32B model fits but leaves less headroom.
Setting Up Continue.dev CLI
Continue.dev is primarily known as a VS Code extension, but they also offer a command-line interface called cn that provides an AI coding assistant directly in your terminal.
Installing cn
The cn CLI is available via npm:
npm install -g @anthropic/cn
Or if you prefer not to install globally, you can use npx:
npx @anthropic/cn
Configuring cn for vLLM
Create or edit ~/.continue/config.yaml:
name: Local Assistant version: 1.0.0 schema: v1 models: - name: Qwen2.5-7B-vLLM provider: openai model: Qwen/Qwen2.5-7B-Instruct apiBase: http://localhost:8000/v1 apiKey: none roles: - chat - edit timeout: 60000000 context: - provider: code - provider: docs - provider: diff - provider: terminal - provider: problems - provider: folder - provider: codebase
The important settings:
-
provider: openai: Use the OpenAI-compatible API format -
apiBase: Point to your vLLM server -
apiKey: none: vLLM doesn't require authentication by default -
timeout: Set high for longer operations -
context: Enable various context providers for code understanding
Using cn
Run cn from your project directory:
cn --config ~/.continue/config.yaml
This starts an interactive session where you can ask questions about your codebase, request changes, and have the assistant use tools to explore and modify files.
For quick one-off queries, use the print mode:
echo "What does this project do?" | cn -p --config ~/.continue/config.yaml
The -p flag prints the response and exits, useful for scripting or quick questions.
Example Session
Here's what a typical session looks like:
$ cn --config ~/.continue/config.yaml > What files handle authentication in this project? I'll search the codebase for authentication-related code. [Uses grep tool to search for "auth", "login", "session"] Based on my search, authentication is handled in: - src/middleware/auth.js - JWT verification middleware - src/routes/login.js - Login endpoint - src/models/user.js - User model with password hashing > Add rate limiting to the login endpoint I'll read the current login route and add rate limiting...
The assistant can read files, search code, make edits, and run commands, all while maintaining context about your project.
Troubleshooting
"Context length exceeded" Error
If cn fails with a context length error, your vLLM server's --max-model-len is too low. The Continue CLI adds substantial system prompts. Restart vLLM with at least 32768 tokens:
docker exec vllm-dev pkill -f "vllm serve" docker exec -d vllm-dev bash -c 'vllm serve Qwen/Qwen2.5-7B-Instruct \ --max-model-len 32768 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --host 0.0.0.0 > /tmp/vllm.log 2>&1'
GPU Memory Not Released
If vLLM fails to start due to insufficient memory, the previous instance may not have released GPU memory. The cleanest fix is to restart the container:
docker restart vllm-dev
Then start vLLM again.
Slow Inference
If inference is slow, check that GPU acceleration is actually being used:
watch -n 1 rocm-smi
You should see GPU utilization when generating tokens. If utilization is 0%, there may be a driver or permission issue.
Performance Notes
On Strix Halo with the 7B model, I see around 30-50 tokens per second for generation. The first request after starting is slower due to KV cache warmup. With the 32B model, speed drops to 10-15 tokens per second but quality improves significantly.
The unified memory architecture means there's no PCIe bottleneck for loading model weights, which helps with the initial prompt processing. However, the iGPU compute is slower than a discrete high-end GPU, so this setup prioritizes accessibility over raw speed.
Remote Access
If your vLLM server is running on a different machine (like a dedicated inference server), you'll need to update the apiBase in your config to point to that machine's IP address:
apiBase: http://192.168.1.100:8000/v1
Make sure port 8000 is accessible through any firewalls. For secure remote access over the internet, consider setting up a VPN or SSH tunnel rather than exposing the port directly:
ssh -L 8000:localhost:8000 user@remote-server
This forwards local port 8000 to the remote server, so you can keep using localhost:8000 in your config while the actual inference happens remotely.
Alternative Clients
While this guide focuses on Continue.dev's cn CLI, the vLLM server works with any OpenAI-compatible client. Here are a few alternatives worth considering:
Aider
Aider is another excellent terminal-based coding assistant. Install it with pip:
pip install aider-chat
Then connect to your vLLM server:
aider --model openai/Qwen/Qwen2.5-7B-Instruct \ --openai-api-base http://localhost:8000/v1 \ --openai-api-key none
Aider has a different interaction style than Continue, using git-aware editing and a focus on making commits. It's worth trying both to see which fits your workflow.
Open WebUI
For a graphical interface, Open WebUI provides a ChatGPT-like experience that connects to local LLM servers. It's particularly nice for non-coding conversations or when you want to share access with others who prefer a web interface.
Direct API Calls
For scripting and automation, you can call the vLLM API directly. Here's a Python example:
import requests response = requests.post( "http://localhost:8000/v1/chat/completions", json={ "model": "Qwen/Qwen2.5-7B-Instruct", "messages": [ {"role": "user", "content": "Explain this code: def fib(n): ..."} ], "max_tokens": 500 } ) print(response.json()["choices"][0]["message"]["content"])
This is useful for building custom tools or integrating LLM capabilities into existing scripts.
Keeping the Server Running
For a production-like setup where you want vLLM to start automatically and stay running, consider creating a startup script or using Docker Compose. Here's a simple approach using a shell script:
#!/bin/bash # start-vllm.sh docker start vllm-dev 2>/dev/null || echo "Container already running" docker exec vllm-dev pkill -f "vllm serve" 2>/dev/null sleep 2 docker exec -d vllm-dev bash -c 'vllm serve Qwen/Qwen2.5-7B-Instruct \ --max-model-len 32768 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --host 0.0.0.0 \ --port 8000 > /tmp/vllm.log 2>&1' echo "vLLM starting... check logs with: docker exec vllm-dev tail -f /tmp/vllm.log"
Make it executable with chmod +x start-vllm.sh and run it whenever you need to start or restart the server.
Conclusion
Running vLLM with ROCm opens up local AI coding assistants to AMD GPU users. Combined with Continue.dev's cn CLI, you get a capable terminal-based assistant that can understand your codebase, make edits, and use tools, all running on your own hardware with no cloud dependencies.
The setup isn't as plug-and-play as using a cloud API, but the privacy benefits and lack of per-token costs make it worthwhile for regular use. And as AMD's ROCm ecosystem continues to mature, expect the experience to get smoother with each release.
What I appreciate most about this setup is the flexibility. You're not locked into any particular client or workflow. The same vLLM server can power your terminal coding assistant, a web chat interface, custom scripts, and IDE integrations all at once. That's the advantage of running your own inference server: you control the stack from model selection to client interface.
If you're interested in exploring further, consider trying different models (DeepSeek Coder is excellent for code-focused tasks), experimenting with quantized models for better performance, or setting up the full Continue VS Code extension alongside the CLI for a complete local AI development environment.
