Running vLLM in Docker with AMD ROCm and the Continue.dev CLI

A.C. Jokela

2026-01-25

If you've been following the AI coding assistant space, you've probably noticed that most tools assume you're running NVIDIA hardware or using a cloud API. But what if you have AMD hardware and want to run large language models locally with full tool-calling support? This guide walks through setting up vLLM with AMD ROCm in Docker and connecting it to Continue.dev's cn command-line coding assistant.

Why vLLM?

There are several options for running LLMs locally: llama.cpp, Ollama, and vLLM being the most popular. I chose vLLM for a specific reason: tool calling support. vLLM implements the OpenAI-compatible API with proper function calling, which means coding assistants can use tools like file reading, code execution, and search. This is critical for getting a capable coding assistant rather than just a chat interface.

vLLM also offers excellent performance through continuous batching, PagedAttention for efficient memory management, and support for a wide range of models. The trade-off is that it's more resource-intensive than llama.cpp, but if you have the VRAM, the capabilities are worth it.

Hardware Setup

For this guide, I'm using an AMD Strix Halo system (Ryzen AI MAX+ 395) with 128GB of unified memory. If you're looking for a similar setup, the GMKtec EVO-X2 is one of the first mini PCs available with this chip. The integrated GPU shows up as gfx1151 in ROCm. However, this guide should work for any AMD GPU supported by ROCm, including discrete cards like the RX 7900 XTX, MI100, or MI250.

The unified memory architecture on Strix Halo is particularly interesting for LLM inference. Unlike discrete GPUs where you're limited by VRAM, the CPU and GPU share the same memory pool. This means you can run models that would normally require multiple high-end GPUs on a single chip, as long as you have enough system RAM.

Prerequisites

Before starting, you'll need:

An AMD GPU supported by ROCm
Docker installed on your system
ROCm drivers installed (version 6.0 or later recommended)
At least 16GB of RAM (more for larger models)

To verify your ROCm installation, run:

rocminfo | grep gfx

You should see your GPU architecture listed (e.g., gfx1100 for RDNA3, gfx1151 for Strix Halo).

Running vLLM in Docker

The easiest way to get vLLM running with ROCm is through Docker. The ROCm team maintains nightly images that include all necessary dependencies.

Pulling the Image

docker pull rocm/vllm-dev:nightly

This image is large (several GB) as it includes the full ROCm stack, PyTorch, and vLLM with all dependencies.

Starting the Container

Start the container with GPU access and port forwarding:

docker run -d \
  --name vllm-dev \
  --device=/dev/kfd \
  --device=/dev/dri \
  --group-add video \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  rocm/vllm-dev:nightly \
  tail -f /dev/null

Let me break down the important flags:

--device=/dev/kfd and --device=/dev/dri: Give the container access to the GPU
--group-add video: Required for GPU access permissions
-p 8000:8000: Expose the vLLM API port
-v ~/.cache/huggingface:/root/.cache/huggingface: Persist downloaded models between container restarts
tail -f /dev/null: Keep the container running so we can exec into it

Installing AMD SMI (Important!)

Before starting vLLM, you need to install the AMD SMI Python package inside the container. This is required for vLLM to detect the ROCm platform correctly:

docker exec -it vllm-dev bash
pip install /opt/rocm/share/amd_smi
exit

Without this step, vLLM will fail with an "UnspecifiedPlatform" error because it can't detect your AMD GPU.

Starting vLLM

Now start the vLLM server inside the container:

docker exec -d vllm-dev bash -c 'vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --host 0.0.0.0 \
  --port 8000 > /tmp/vllm.log 2>&1'

The key flags here:

Qwen/Qwen2.5-7B-Instruct: The model to serve. Qwen 2.5 is excellent for coding tasks and supports tool calling.
--max-model-len 32768: Maximum context length. Coding assistants need long contexts for system prompts and code.
--enable-auto-tool-choice: Enable function calling support
--tool-call-parser hermes: Use the Hermes format for tool calls, which Qwen supports

The first startup takes a while as vLLM downloads the model weights, compiles CUDA graphs, and warms up. Monitor progress with:

docker exec vllm-dev tail -f /tmp/vllm.log

You'll see it loading model shards, then capturing CUDA graphs. Once you see "Uvicorn running on http://0.0.0.0:8000", the server is ready.

Verifying the Server

Test that the server is responding:

curl http://localhost:8000/v1/models

You should see JSON output showing the loaded model with its maximum context length.

For a quick inference test:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 50
  }'

Choosing a Model

The model you choose depends on your available memory and performance requirements. Here's a rough guide:

Model	VRAM Required	Use Case
Qwen2.5-0.5B-Instruct	~2GB	Testing, very fast responses
Qwen2.5-7B-Instruct	~16GB	Good balance of speed and capability
Qwen2.5-32B-Instruct	~70GB	Best quality, slower
DeepSeek-Coder-33B	~70GB	Specialized for code

For my system with 96GB allocated as VRAM, the 7B model leaves about 65GB free for KV cache, allowing concurrent requests with long contexts. The 32B model fits but leaves less headroom.

Setting Up Continue.dev CLI

Continue.dev is primarily known as a VS Code extension, but they also offer a command-line interface called cn that provides an AI coding assistant directly in your terminal.

Installing cn

The cn CLI is available via npm:

npm install -g @anthropic/cn

Or if you prefer not to install globally, you can use npx:

npx @anthropic/cn

Configuring cn for vLLM

Create or edit ~/.continue/config.yaml:

name: Local Assistant
version: 1.0.0
schema: v1

models:
  - name: Qwen2.5-7B-vLLM
    provider: openai
    model: Qwen/Qwen2.5-7B-Instruct
    apiBase: http://localhost:8000/v1
    apiKey: none
    roles:
      - chat
      - edit
    timeout: 60000000

context:
  - provider: code
  - provider: docs
  - provider: diff
  - provider: terminal
  - provider: problems
  - provider: folder
  - provider: codebase

The important settings:

provider: openai: Use the OpenAI-compatible API format
apiBase: Point to your vLLM server
apiKey: none: vLLM doesn't require authentication by default
timeout: Set high for longer operations
context: Enable various context providers for code understanding

Using cn

Run cn from your project directory:

cn --config ~/.continue/config.yaml

This starts an interactive session where you can ask questions about your codebase, request changes, and have the assistant use tools to explore and modify files.

For quick one-off queries, use the print mode:

echo "What does this project do?" | cn -p --config ~/.continue/config.yaml

The -p flag prints the response and exits, useful for scripting or quick questions.

Example Session

Here's what a typical session looks like:

$ cn --config ~/.continue/config.yaml
> What files handle authentication in this project?

I'll search the codebase for authentication-related code.

[Uses grep tool to search for "auth", "login", "session"]

Based on my search, authentication is handled in:
- src/middleware/auth.js - JWT verification middleware
- src/routes/login.js - Login endpoint
- src/models/user.js - User model with password hashing

> Add rate limiting to the login endpoint

I'll read the current login route and add rate limiting...

The assistant can read files, search code, make edits, and run commands, all while maintaining context about your project.

Troubleshooting

"Context length exceeded" Error

If cn fails with a context length error, your vLLM server's --max-model-len is too low. The Continue CLI adds substantial system prompts. Restart vLLM with at least 32768 tokens:

docker exec vllm-dev pkill -f "vllm serve"
docker exec -d vllm-dev bash -c 'vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --host 0.0.0.0 > /tmp/vllm.log 2>&1'

GPU Memory Not Released

If vLLM fails to start due to insufficient memory, the previous instance may not have released GPU memory. The cleanest fix is to restart the container:

docker restart vllm-dev

Then start vLLM again.

Slow Inference

If inference is slow, check that GPU acceleration is actually being used:

watch -n 1 rocm-smi

You should see GPU utilization when generating tokens. If utilization is 0%, there may be a driver or permission issue.

Performance Notes

On Strix Halo with the 7B model, I see around 30-50 tokens per second for generation. The first request after starting is slower due to KV cache warmup. With the 32B model, speed drops to 10-15 tokens per second but quality improves significantly.

The unified memory architecture means there's no PCIe bottleneck for loading model weights, which helps with the initial prompt processing. However, the iGPU compute is slower than a discrete high-end GPU, so this setup prioritizes accessibility over raw speed.

Remote Access

If your vLLM server is running on a different machine (like a dedicated inference server), you'll need to update the apiBase in your config to point to that machine's IP address:

apiBase: http://192.168.1.100:8000/v1

Make sure port 8000 is accessible through any firewalls. For secure remote access over the internet, consider setting up a VPN or SSH tunnel rather than exposing the port directly:

ssh -L 8000:localhost:8000 user@remote-server

This forwards local port 8000 to the remote server, so you can keep using localhost:8000 in your config while the actual inference happens remotely.

Alternative Clients

While this guide focuses on Continue.dev's cn CLI, the vLLM server works with any OpenAI-compatible client. Here are a few alternatives worth considering:

Aider

Aider is another excellent terminal-based coding assistant. Install it with pip:

pip install aider-chat

Then connect to your vLLM server:

aider --model openai/Qwen/Qwen2.5-7B-Instruct \
  --openai-api-base http://localhost:8000/v1 \
  --openai-api-key none

Aider has a different interaction style than Continue, using git-aware editing and a focus on making commits. It's worth trying both to see which fits your workflow.

Open WebUI

For a graphical interface, Open WebUI provides a ChatGPT-like experience that connects to local LLM servers. It's particularly nice for non-coding conversations or when you want to share access with others who prefer a web interface.

Direct API Calls

For scripting and automation, you can call the vLLM API directly. Here's a Python example:

import requests

response = requests.post(
    "http://localhost:8000/v1/chat/completions",
    json={
        "model": "Qwen/Qwen2.5-7B-Instruct",
        "messages": [
            {"role": "user", "content": "Explain this code: def fib(n): ..."}
        ],
        "max_tokens": 500
    }
)
print(response.json()["choices"][0]["message"]["content"])

This is useful for building custom tools or integrating LLM capabilities into existing scripts.

Keeping the Server Running

For a production-like setup where you want vLLM to start automatically and stay running, consider creating a startup script or using Docker Compose. Here's a simple approach using a shell script:

#!/bin/bash
# start-vllm.sh

docker start vllm-dev 2>/dev/null || echo "Container already running"

docker exec vllm-dev pkill -f "vllm serve" 2>/dev/null

sleep 2

docker exec -d vllm-dev bash -c 'vllm serve Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 32768 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --host 0.0.0.0 \
  --port 8000 > /tmp/vllm.log 2>&1'

echo "vLLM starting... check logs with: docker exec vllm-dev tail -f /tmp/vllm.log"

Make it executable with chmod +x start-vllm.sh and run it whenever you need to start or restart the server.

Conclusion

Running vLLM with ROCm opens up local AI coding assistants to AMD GPU users. Combined with Continue.dev's cn CLI, you get a capable terminal-based assistant that can understand your codebase, make edits, and use tools, all running on your own hardware with no cloud dependencies.

The setup isn't as plug-and-play as using a cloud API, but the privacy benefits and lack of per-token costs make it worthwhile for regular use. And as AMD's ROCm ecosystem continues to mature, expect the experience to get smoother with each release.

What I appreciate most about this setup is the flexibility. You're not locked into any particular client or workflow. The same vLLM server can power your terminal coding assistant, a web chat interface, custom scripts, and IDE integrations all at once. That's the advantage of running your own inference server: you control the stack from model selection to client interface.

If you're interested in exploring further, consider trying different models (DeepSeek Coder is excellent for code-focused tasks), experimenting with quantized models for better performance, or setting up the full Continue VS Code extension alongside the CLI for a complete local AI development environment.