<?xml version="1.0" encoding="utf-8"?>
<?xml-stylesheet type="text/xsl" href="../assets/xml/rss.xsl" media="all"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TinyComputers.io (Posts about continue)</title><link>https://tinycomputers.io/</link><description></description><atom:link href="https://tinycomputers.io/categories/continue.xml" rel="self" type="application/rss+xml"></atom:link><language>en</language><copyright>Contents © 2026 A.C. Jokela 
&lt;!-- div style="width: 100%" --&gt;
&lt;a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"&gt;&lt;img alt="" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/80x15.png" /&gt; Creative Commons Attribution-ShareAlike&lt;/a&gt;&amp;nbsp;|&amp;nbsp;
&lt;!-- /div --&gt;
</copyright><lastBuildDate>Mon, 06 Apr 2026 22:13:04 GMT</lastBuildDate><generator>Nikola (getnikola.com)</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Running vLLM in Docker with AMD ROCm and the Continue.dev CLI</title><link>https://tinycomputers.io/posts/running-vllm-in-docker-with-amd-rocm-and-the-continuedev-cli.html?utm_source=feed&amp;utm_medium=rss&amp;utm_campaign=rss</link><dc:creator>A.C. Jokela</dc:creator><description>&lt;div class="audio-widget"&gt;
&lt;div class="audio-widget-header"&gt;
&lt;span class="audio-widget-icon"&gt;🎧&lt;/span&gt;
&lt;span class="audio-widget-label"&gt;Listen to this article&lt;/span&gt;
&lt;/div&gt;
&lt;audio controls preload="metadata"&gt;
&lt;source src="https://tinycomputers.io/running-vllm-in-docker-with-amd-rocm-and-the-continuedev-cli_tts.mp3" type="audio/mpeg"&gt;
&lt;/source&gt;&lt;/audio&gt;
&lt;div class="audio-widget-footer"&gt;14 min · AI-generated narration&lt;/div&gt;
&lt;/div&gt;

&lt;p&gt;If you've been following the AI coding assistant space, you've probably noticed that most tools assume you're running NVIDIA hardware or using a cloud API. But what if you have AMD hardware and want to run large language models locally with full tool-calling support? This guide walks through setting up vLLM with AMD ROCm in Docker and connecting it to Continue.dev's &lt;code&gt;cn&lt;/code&gt; command-line coding assistant.&lt;/p&gt;
&lt;h3&gt;Why vLLM?&lt;/h3&gt;
&lt;p&gt;There are several options for running LLMs locally: llama.cpp, Ollama, and vLLM being the most popular. I chose vLLM for a specific reason: &lt;strong&gt;tool calling support&lt;/strong&gt;. vLLM implements the OpenAI-compatible API with proper function calling, which means coding assistants can use tools like file reading, code execution, and search. This is critical for getting a capable coding assistant rather than just a chat interface.&lt;/p&gt;
&lt;p&gt;vLLM also offers excellent performance through continuous batching, PagedAttention for efficient memory management, and support for a wide range of models. The trade-off is that it's more resource-intensive than llama.cpp, but if you have the VRAM, the capabilities are worth it.&lt;/p&gt;
&lt;h3&gt;Hardware Setup&lt;/h3&gt;
&lt;p&gt;For this guide, I'm using an AMD Strix Halo system (Ryzen AI MAX+ 395) with 128GB of unified memory. If you're looking for a similar setup, the &lt;a href="https://baud.rs/gmVPEI"&gt;GMKtec EVO-X2&lt;/a&gt; is one of the first mini PCs available with this chip. The integrated GPU shows up as &lt;code&gt;gfx1151&lt;/code&gt; in ROCm. However, this guide should work for any AMD GPU supported by ROCm, including discrete cards like the &lt;a href="https://baud.rs/0uWeZN"&gt;RX 7900 XTX&lt;/a&gt;, MI100, or MI250.&lt;/p&gt;
&lt;p&gt;The unified memory architecture on Strix Halo is particularly interesting for LLM inference. Unlike discrete GPUs where you're limited by VRAM, the CPU and GPU share the same memory pool. This means you can run models that would normally require multiple high-end GPUs on a single chip, as long as you have enough system RAM.&lt;/p&gt;
&lt;h3&gt;Prerequisites&lt;/h3&gt;
&lt;p&gt;Before starting, you'll need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;An AMD GPU supported by ROCm&lt;/li&gt;
&lt;li&gt;Docker installed on your system&lt;/li&gt;
&lt;li&gt;ROCm drivers installed (version 6.0 or later recommended)&lt;/li&gt;
&lt;li&gt;At least 16GB of RAM (more for larger models)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To verify your ROCm installation, run:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;rocminfo&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;grep&lt;span class="w"&gt; &lt;/span&gt;gfx
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should see your GPU architecture listed (e.g., &lt;code&gt;gfx1100&lt;/code&gt; for RDNA3, &lt;code&gt;gfx1151&lt;/code&gt; for Strix Halo).&lt;/p&gt;
&lt;h3&gt;Running vLLM in Docker&lt;/h3&gt;
&lt;p&gt;The easiest way to get vLLM running with ROCm is through Docker. The ROCm team maintains nightly images that include all necessary dependencies.&lt;/p&gt;
&lt;h4&gt;Pulling the Image&lt;/h4&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;docker&lt;span class="w"&gt; &lt;/span&gt;pull&lt;span class="w"&gt; &lt;/span&gt;rocm/vllm-dev:nightly
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This image is large (several GB) as it includes the full ROCm stack, PyTorch, and vLLM with all dependencies.&lt;/p&gt;
&lt;h4&gt;Starting the Container&lt;/h4&gt;
&lt;p&gt;Start the container with GPU access and port forwarding:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;docker&lt;span class="w"&gt; &lt;/span&gt;run&lt;span class="w"&gt; &lt;/span&gt;-d&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--name&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--device&lt;span class="o"&gt;=&lt;/span&gt;/dev/kfd&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--device&lt;span class="o"&gt;=&lt;/span&gt;/dev/dri&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--group-add&lt;span class="w"&gt; &lt;/span&gt;video&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--cap-add&lt;span class="o"&gt;=&lt;/span&gt;SYS_PTRACE&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--security-opt&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nv"&gt;seccomp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;unconfined&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8000&lt;/span&gt;:8000&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-v&lt;span class="w"&gt; &lt;/span&gt;~/.cache/huggingface:/root/.cache/huggingface&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;rocm/vllm-dev:nightly&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;tail&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;/dev/null
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Let me break down the important flags:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;--device=/dev/kfd&lt;/code&gt; and &lt;code&gt;--device=/dev/dri&lt;/code&gt;: Give the container access to the GPU&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--group-add video&lt;/code&gt;: Required for GPU access permissions&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-p 8000:8000&lt;/code&gt;: Expose the vLLM API port&lt;/li&gt;
&lt;li&gt;&lt;code&gt;-v ~/.cache/huggingface:/root/.cache/huggingface&lt;/code&gt;: Persist downloaded models between container restarts&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tail -f /dev/null&lt;/code&gt;: Keep the container running so we can exec into it&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Installing AMD SMI (Important!)&lt;/h4&gt;
&lt;p&gt;Before starting vLLM, you need to install the AMD SMI Python package inside the container. This is required for vLLM to detect the ROCm platform correctly:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;docker&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-it&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;bash
pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;/opt/rocm/share/amd_smi
&lt;span class="nb"&gt;exit&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Without this step, vLLM will fail with an "UnspecifiedPlatform" error because it can't detect your AMD GPU.&lt;/p&gt;
&lt;h4&gt;Starting vLLM&lt;/h4&gt;
&lt;p&gt;Now start the vLLM server inside the container:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;docker&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-d&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;bash&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'vllm serve Qwen/Qwen2.5-7B-Instruct \&lt;/span&gt;
&lt;span class="s1"&gt;  --max-model-len 32768 \&lt;/span&gt;
&lt;span class="s1"&gt;  --enable-auto-tool-choice \&lt;/span&gt;
&lt;span class="s1"&gt;  --tool-call-parser hermes \&lt;/span&gt;
&lt;span class="s1"&gt;  --host 0.0.0.0 \&lt;/span&gt;
&lt;span class="s1"&gt;  --port 8000 &amp;gt; /tmp/vllm.log 2&amp;gt;&amp;amp;1'&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The key flags here:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Qwen/Qwen2.5-7B-Instruct&lt;/code&gt;: The model to serve. Qwen 2.5 is excellent for coding tasks and supports tool calling.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--max-model-len 32768&lt;/code&gt;: Maximum context length. Coding assistants need long contexts for system prompts and code.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--enable-auto-tool-choice&lt;/code&gt;: Enable function calling support&lt;/li&gt;
&lt;li&gt;&lt;code&gt;--tool-call-parser hermes&lt;/code&gt;: Use the Hermes format for tool calls, which Qwen supports&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The first startup takes a while as vLLM downloads the model weights, compiles CUDA graphs, and warms up. Monitor progress with:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;docker&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;tail&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;/tmp/vllm.log
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You'll see it loading model shards, then capturing CUDA graphs. Once you see "Uvicorn running on http://0.0.0.0:8000", the server is ready.&lt;/p&gt;
&lt;h4&gt;Verifying the Server&lt;/h4&gt;
&lt;p&gt;Test that the server is responding:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;curl&lt;span class="w"&gt; &lt;/span&gt;http://localhost:8000/v1/models
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should see JSON output showing the loaded model with its maximum context length.&lt;/p&gt;
&lt;p&gt;For a quick inference test:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;curl&lt;span class="w"&gt; &lt;/span&gt;http://localhost:8000/v1/chat/completions&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-H&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;-d&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'{&lt;/span&gt;
&lt;span class="s1"&gt;    "model": "Qwen/Qwen2.5-7B-Instruct",&lt;/span&gt;
&lt;span class="s1"&gt;    "messages": [{"role": "user", "content": "Hello!"}],&lt;/span&gt;
&lt;span class="s1"&gt;    "max_tokens": 50&lt;/span&gt;
&lt;span class="s1"&gt;  }'&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h3&gt;Choosing a Model&lt;/h3&gt;
&lt;p&gt;The model you choose depends on your available memory and performance requirements. Here's a rough guide:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;VRAM Required&lt;/th&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-0.5B-Instruct&lt;/td&gt;
&lt;td&gt;~2GB&lt;/td&gt;
&lt;td&gt;Testing, very fast responses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-7B-Instruct&lt;/td&gt;
&lt;td&gt;~16GB&lt;/td&gt;
&lt;td&gt;Good balance of speed and capability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-32B-Instruct&lt;/td&gt;
&lt;td&gt;~70GB&lt;/td&gt;
&lt;td&gt;Best quality, slower&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek-Coder-33B&lt;/td&gt;
&lt;td&gt;~70GB&lt;/td&gt;
&lt;td&gt;Specialized for code&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;For my system with 96GB allocated as VRAM, the 7B model leaves about 65GB free for KV cache, allowing concurrent requests with long contexts. The 32B model fits but leaves less headroom.&lt;/p&gt;
&lt;h3&gt;Setting Up Continue.dev CLI&lt;/h3&gt;
&lt;p&gt;Continue.dev is primarily known as a VS Code extension, but they also offer a command-line interface called &lt;code&gt;cn&lt;/code&gt; that provides an AI coding assistant directly in your terminal.&lt;/p&gt;
&lt;h4&gt;Installing cn&lt;/h4&gt;
&lt;p&gt;The cn CLI is available via npm:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;npm&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;-g&lt;span class="w"&gt; &lt;/span&gt;@anthropic/cn
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Or if you prefer not to install globally, you can use npx:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;npx&lt;span class="w"&gt; &lt;/span&gt;@anthropic/cn
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;Configuring cn for vLLM&lt;/h4&gt;
&lt;p&gt;Create or edit &lt;code&gt;~/.continue/config.yaml&lt;/code&gt;:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Local Assistant&lt;/span&gt;
&lt;span class="nt"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;1.0.0&lt;/span&gt;
&lt;span class="nt"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;v1&lt;/span&gt;

&lt;span class="nt"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Qwen2.5-7B-vLLM&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;openai&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;Qwen/Qwen2.5-7B-Instruct&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;apiBase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;http://localhost:8000/v1&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;none&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;roles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;chat&lt;/span&gt;
&lt;span class="w"&gt;      &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;edit&lt;/span&gt;
&lt;span class="w"&gt;    &lt;/span&gt;&lt;span class="nt"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;60000000&lt;/span&gt;

&lt;span class="nt"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;code&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;docs&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;diff&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;terminal&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;problems&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;folder&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p p-Indicator"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nt"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;codebase&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The important settings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;provider: openai&lt;/code&gt;: Use the OpenAI-compatible API format&lt;/li&gt;
&lt;li&gt;&lt;code&gt;apiBase&lt;/code&gt;: Point to your vLLM server&lt;/li&gt;
&lt;li&gt;&lt;code&gt;apiKey: none&lt;/code&gt;: vLLM doesn't require authentication by default&lt;/li&gt;
&lt;li&gt;&lt;code&gt;timeout&lt;/code&gt;: Set high for longer operations&lt;/li&gt;
&lt;li&gt;&lt;code&gt;context&lt;/code&gt;: Enable various context providers for code understanding&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;Using cn&lt;/h4&gt;
&lt;p&gt;Run cn from your project directory:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;cn&lt;span class="w"&gt; &lt;/span&gt;--config&lt;span class="w"&gt; &lt;/span&gt;~/.continue/config.yaml
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This starts an interactive session where you can ask questions about your codebase, request changes, and have the assistant use tools to explore and modify files.&lt;/p&gt;
&lt;p&gt;For quick one-off queries, use the print mode:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What does this project do?"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;cn&lt;span class="w"&gt; &lt;/span&gt;-p&lt;span class="w"&gt; &lt;/span&gt;--config&lt;span class="w"&gt; &lt;/span&gt;~/.continue/config.yaml
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;-p&lt;/code&gt; flag prints the response and exits, useful for scripting or quick questions.&lt;/p&gt;
&lt;h4&gt;Example Session&lt;/h4&gt;
&lt;p&gt;Here's what a typical session looks like:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;$&lt;span class="w"&gt; &lt;/span&gt;cn&lt;span class="w"&gt; &lt;/span&gt;--config&lt;span class="w"&gt; &lt;/span&gt;~/.continue/config.yaml
&amp;gt;&lt;span class="w"&gt; &lt;/span&gt;What&lt;span class="w"&gt; &lt;/span&gt;files&lt;span class="w"&gt; &lt;/span&gt;handle&lt;span class="w"&gt; &lt;/span&gt;authentication&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;this&lt;span class="w"&gt; &lt;/span&gt;project?

I&lt;span class="s1"&gt;'ll search the codebase for authentication-related code.&lt;/span&gt;

&lt;span class="s1"&gt;[Uses grep tool to search for "auth", "login", "session"]&lt;/span&gt;

&lt;span class="s1"&gt;Based on my search, authentication is handled in:&lt;/span&gt;
&lt;span class="s1"&gt;- src/middleware/auth.js - JWT verification middleware&lt;/span&gt;
&lt;span class="s1"&gt;- src/routes/login.js - Login endpoint&lt;/span&gt;
&lt;span class="s1"&gt;- src/models/user.js - User model with password hashing&lt;/span&gt;

&lt;span class="s1"&gt;&amp;gt; Add rate limiting to the login endpoint&lt;/span&gt;

&lt;span class="s1"&gt;I'&lt;/span&gt;ll&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;read&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;the&lt;span class="w"&gt; &lt;/span&gt;current&lt;span class="w"&gt; &lt;/span&gt;login&lt;span class="w"&gt; &lt;/span&gt;route&lt;span class="w"&gt; &lt;/span&gt;and&lt;span class="w"&gt; &lt;/span&gt;add&lt;span class="w"&gt; &lt;/span&gt;rate&lt;span class="w"&gt; &lt;/span&gt;limiting...
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;The assistant can read files, search code, make edits, and run commands, all while maintaining context about your project.&lt;/p&gt;
&lt;h3&gt;Troubleshooting&lt;/h3&gt;
&lt;h4&gt;"Context length exceeded" Error&lt;/h4&gt;
&lt;p&gt;If cn fails with a context length error, your vLLM server's &lt;code&gt;--max-model-len&lt;/code&gt; is too low. The Continue CLI adds substantial system prompts. Restart vLLM with at least 32768 tokens:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;docker&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;pkill&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vllm serve"&lt;/span&gt;
docker&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-d&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;bash&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'vllm serve Qwen/Qwen2.5-7B-Instruct \&lt;/span&gt;
&lt;span class="s1"&gt;  --max-model-len 32768 \&lt;/span&gt;
&lt;span class="s1"&gt;  --enable-auto-tool-choice \&lt;/span&gt;
&lt;span class="s1"&gt;  --tool-call-parser hermes \&lt;/span&gt;
&lt;span class="s1"&gt;  --host 0.0.0.0 &amp;gt; /tmp/vllm.log 2&amp;gt;&amp;amp;1'&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;h4&gt;GPU Memory Not Released&lt;/h4&gt;
&lt;p&gt;If vLLM fails to start due to insufficient memory, the previous instance may not have released GPU memory. The cleanest fix is to restart the container:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;docker&lt;span class="w"&gt; &lt;/span&gt;restart&lt;span class="w"&gt; &lt;/span&gt;vllm-dev
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then start vLLM again.&lt;/p&gt;
&lt;h4&gt;Slow Inference&lt;/h4&gt;
&lt;p&gt;If inference is slow, check that GPU acceleration is actually being used:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;watch&lt;span class="w"&gt; &lt;/span&gt;-n&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;rocm-smi
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;You should see GPU utilization when generating tokens. If utilization is 0%, there may be a driver or permission issue.&lt;/p&gt;
&lt;h3&gt;Performance Notes&lt;/h3&gt;
&lt;p&gt;On Strix Halo with the 7B model, I see around 30-50 tokens per second for generation. The first request after starting is slower due to KV cache warmup. With the 32B model, speed drops to 10-15 tokens per second but quality improves significantly.&lt;/p&gt;
&lt;p&gt;The unified memory architecture means there's no PCIe bottleneck for loading model weights, which helps with the initial prompt processing. However, the iGPU compute is slower than a discrete high-end GPU, so this setup prioritizes accessibility over raw speed.&lt;/p&gt;
&lt;h3&gt;Remote Access&lt;/h3&gt;
&lt;p&gt;If your vLLM server is running on a different machine (like a dedicated inference server), you'll need to update the &lt;code&gt;apiBase&lt;/code&gt; in your config to point to that machine's IP address:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="nt"&gt;apiBase&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="l l-Scalar l-Scalar-Plain"&gt;http://192.168.1.100:8000/v1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Make sure port 8000 is accessible through any firewalls. For secure remote access over the internet, consider setting up a VPN or SSH tunnel rather than exposing the port directly:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;ssh&lt;span class="w"&gt; &lt;/span&gt;-L&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;8000&lt;/span&gt;:localhost:8000&lt;span class="w"&gt; &lt;/span&gt;user@remote-server
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This forwards local port 8000 to the remote server, so you can keep using &lt;code&gt;localhost:8000&lt;/code&gt; in your config while the actual inference happens remotely.&lt;/p&gt;
&lt;h3&gt;Alternative Clients&lt;/h3&gt;
&lt;p&gt;While this guide focuses on Continue.dev's &lt;code&gt;cn&lt;/code&gt; CLI, the vLLM server works with any OpenAI-compatible client. Here are a few alternatives worth considering:&lt;/p&gt;
&lt;h4&gt;Aider&lt;/h4&gt;
&lt;p&gt;Aider is another excellent terminal-based coding assistant. Install it with pip:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;pip&lt;span class="w"&gt; &lt;/span&gt;install&lt;span class="w"&gt; &lt;/span&gt;aider-chat
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Then connect to your vLLM server:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;aider&lt;span class="w"&gt; &lt;/span&gt;--model&lt;span class="w"&gt; &lt;/span&gt;openai/Qwen/Qwen2.5-7B-Instruct&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--openai-api-base&lt;span class="w"&gt; &lt;/span&gt;http://localhost:8000/v1&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;--openai-api-key&lt;span class="w"&gt; &lt;/span&gt;none
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Aider has a different interaction style than Continue, using git-aware editing and a focus on making commits. It's worth trying both to see which fits your workflow.&lt;/p&gt;
&lt;h4&gt;Open WebUI&lt;/h4&gt;
&lt;p&gt;For a graphical interface, Open WebUI provides a ChatGPT-like experience that connects to local LLM servers. It's particularly nice for non-coding conversations or when you want to share access with others who prefer a web interface.&lt;/p&gt;
&lt;h4&gt;Direct API Calls&lt;/h4&gt;
&lt;p&gt;For scripting and automation, you can call the vLLM API directly. Here's a Python example:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="kn"&gt;import&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nn"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;"http://localhost:8000/v1/chat/completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="s2"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Qwen/Qwen2.5-7B-Instruct"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s2"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"Explain this code: def fib(n): ..."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="s2"&gt;"max_tokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="s2"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s2"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="s2"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;This is useful for building custom tools or integrating LLM capabilities into existing scripts.&lt;/p&gt;
&lt;h3&gt;Keeping the Server Running&lt;/h3&gt;
&lt;p&gt;For a production-like setup where you want vLLM to start automatically and stay running, consider creating a startup script or using Docker Compose. Here's a simple approach using a shell script:&lt;/p&gt;
&lt;div class="code"&gt;&lt;pre class="code literal-block"&gt;&lt;span class="ch"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c1"&gt;# start-vllm.sh&lt;/span&gt;

docker&lt;span class="w"&gt; &lt;/span&gt;start&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&amp;gt;/dev/null&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;||&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Container already running"&lt;/span&gt;

docker&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;pkill&lt;span class="w"&gt; &lt;/span&gt;-f&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vllm serve"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;&amp;gt;/dev/null

sleep&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;

docker&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;exec&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;-d&lt;span class="w"&gt; &lt;/span&gt;vllm-dev&lt;span class="w"&gt; &lt;/span&gt;bash&lt;span class="w"&gt; &lt;/span&gt;-c&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s1"&gt;'vllm serve Qwen/Qwen2.5-7B-Instruct \&lt;/span&gt;
&lt;span class="s1"&gt;  --max-model-len 32768 \&lt;/span&gt;
&lt;span class="s1"&gt;  --enable-auto-tool-choice \&lt;/span&gt;
&lt;span class="s1"&gt;  --tool-call-parser hermes \&lt;/span&gt;
&lt;span class="s1"&gt;  --host 0.0.0.0 \&lt;/span&gt;
&lt;span class="s1"&gt;  --port 8000 &amp;gt; /tmp/vllm.log 2&amp;gt;&amp;amp;1'&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"vLLM starting... check logs with: docker exec vllm-dev tail -f /tmp/vllm.log"&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Make it executable with &lt;code&gt;chmod +x start-vllm.sh&lt;/code&gt; and run it whenever you need to start or restart the server.&lt;/p&gt;
&lt;h3&gt;Conclusion&lt;/h3&gt;
&lt;p&gt;Running vLLM with ROCm opens up local AI coding assistants to AMD GPU users. Combined with Continue.dev's cn CLI, you get a capable terminal-based assistant that can understand your codebase, make edits, and use tools, all running on your own hardware with no cloud dependencies.&lt;/p&gt;
&lt;p&gt;The setup isn't as plug-and-play as using a cloud API, but the privacy benefits and lack of per-token costs make it worthwhile for regular use. And as AMD's ROCm ecosystem continues to mature, expect the experience to get smoother with each release.&lt;/p&gt;
&lt;p&gt;What I appreciate most about this setup is the flexibility. You're not locked into any particular client or workflow. The same vLLM server can power your terminal coding assistant, a web chat interface, custom scripts, and IDE integrations all at once. That's the advantage of running your own inference server: you control the stack from model selection to client interface.&lt;/p&gt;
&lt;p&gt;If you're interested in exploring further, consider trying different models (DeepSeek Coder is excellent for code-focused tasks), experimenting with quantized models for better performance, or setting up the full Continue VS Code extension alongside the CLI for a complete local AI development environment.&lt;/p&gt;</description><category>ai</category><category>amd</category><category>coding-assistant</category><category>continue</category><category>docker</category><category>llm</category><category>rocm</category><category>vllm</category><guid>https://tinycomputers.io/posts/running-vllm-in-docker-with-amd-rocm-and-the-continuedev-cli.html</guid><pubDate>Sun, 25 Jan 2026 20:13:00 GMT</pubDate></item></channel></rss>