A Bespoke LLM Code Scanner

A.C. Jokela

2025-11-26

Building a Nightly AI Code Scanner with vLLM, ROCm, and JIRA Integration

I've been running a ballistics calculation engine — a Rust physics library with several components, like a Flask app wrapper with machine learning capabilities, bindings for a python library as well as a Ruby gem library. There are also Android and iOS apps, too. The codebase has grown to about 15,000 lines of Rust and another 10,000 lines of Python. At this scale, bugs hide in edge cases: division by zero, floating-point precision issues in transonic drag calculations, unwrap() panics on unexpected input.

What if I could run an AI code reviewer every night while I sleep? Not a cloud API with per-token billing that could run up a $500 bill scanning 50 files, but a local model running on my own hardware, grinding through the codebase and filing JIRA tickets for anything suspicious.

This is the story of building that system.

The Hardware: AMD Strix Halo on ROCm 7.0

I'm running this on a server with an AMD Radeon 8060S (Strix Halo APU) — specifically the gfx1151 architecture. This isn't a data center GPU. It's essentially an integrated GPU with 128GB of shared memory; configured to give 96GB to VRAM and the rest to system RAM. Not the 80GB of HBM3 you'd get on an H100, but enough to run a 32B parameter model comfortably.

The key insight: for batch processing where latency doesn't matter, you don't need bleeding-edge hardware. A nightly scan can take hours. I'm not serving production traffic; I'm analyzing code files one at a time with a 30-second cooldown between requests. The APU handles this fine.

Hardware Configuration:
- AMD Radeon 8060S (gfx1151 Strix Halo APU)
- 96GB shared memory
- ROCm 7.0 with HSA_OVERRIDE_GFX_VERSION=11.5.1

The HSA_OVERRIDE_GFX_VERSION environment variable is critical. Without it, ROCm doesn't recognize the Strix Halo architecture. This is the kind of sharp edge you hit running ML on AMD consumer hardware.

Model Selection: Qwen2.5-Coder-7B-Instruct

I tested several models:

Model	Parameters	Context	Quality	Notes
DeepSeek-Coder-V2-Lite	16B	32k	Good	Requires flash_attn (ROCm issues)
Qwen3-Coder-30B	30B	32k	Excellent	Too slow on APU
Qwen2.5-Coder-7B-Instruct	7B	16k	Good	Sweet spot
TinyLlama-1.1B	1.1B	4k	Poor	Too small for code review

Qwen2.5-Coder-7B-Instruct hits the sweet spot. It understands Rust and Python well enough to spot real issues, runs fast enough to process 50 files per night, and doesn't require flash attention (which has ROCm compatibility issues on consumer hardware).

vLLM Setup

vLLM provides an OpenAI-compatible API server that makes integration trivial. Here's the startup command:

source ~/vllm-rocm7-venv/bin/activate
export HSA_OVERRIDE_GFX_VERSION=11.5.1
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-Coder-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --trust-remote-code \
    --max-model-len 16384 \
    --gpu-memory-utilization 0.85

The --max-model-len 16384 limits context to 16k tokens. My code files rarely exceed 500 lines (truncated), so this is plenty. The --gpu-memory-utilization 0.85 leaves headroom for the system.

I run this in a Python venv rather than Docker because ROCm device passthrough with Docker on Strix Halo is finicky. Sometimes you have to choose pragmatism over elegance.

Docker Configuration (When It Works)

For reference, here's the Docker Compose configuration I initially built. It works on dedicated AMD GPUs but has issues on integrated APUs:

services:
  vllm:
    image: rocm/vllm-dev:latest
    container_name: vllm-code-scanner
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri:/dev/dri
    group_add:
      - video
      - render
    security_opt:
      - seccomp:unconfined
    cap_add:
      - SYS_PTRACE
    ipc: host
    environment:
      - HSA_OVERRIDE_GFX_VERSION=11.5.1
      - PYTORCH_ROCM_ARCH=gfx1151
      - HIP_VISIBLE_DEVICES=0
    volumes:
      - /home/alex/models:/models
      - /home/alex/.cache/huggingface:/root/.cache/huggingface
    ports:
      - "8000:8000"
    command: >
      python -m vllm.entrypoints.openai.api_server
      --model Qwen/Qwen2.5-Coder-7B-Instruct
      --host 0.0.0.0
      --port 8000
      --trust-remote-code
      --max-model-len 16384
      --gpu-memory-utilization 0.85
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 120s

  scanner:
    build: .
    container_name: code-scanner-agent
    depends_on:
      vllm:
        condition: service_healthy
    environment:
      - VLLM_HOST=vllm
      - VLLM_PORT=8000
      - JIRA_EMAIL=${JIRA_EMAIL}
      - JIRA_API_KEY=${JIRA_API_KEY}
    volumes:
      - /home/alex/projects:/projects:ro
      - ./config:/app/config:ro
      - /home/alex/projects/code-scanner-results:/app/results

The ipc: host and seccomp:unconfined are necessary for ROCm to function properly. The depends_on with service_healthy ensures the scanner waits for vLLM to be fully loaded before starting — important since model loading can take 2-3 minutes.

The scanner Dockerfile is minimal:

FROM python:3.11-slim

WORKDIR /app

RUN apt-get update && apt-get install -y \
    git curl ripgrep \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY agent/ /app/agent/
COPY prompts/ /app/prompts/
COPY config/ /app/config/

CMD ["python", "-m", "agent.scanner"]

Including ripgrep in the container enables fast pattern matching when the scanner needs to search for related code.

The Scanner Architecture

The system has three main components:

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Systemd       │     │    vLLM         │     │     JIRA        │
│   Timer         │────▶│    Server       │────▶│     API         │
│   (11pm daily)  │     │  (Qwen 7B)      │     │   (tickets)     │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                               │
                               ▼
                    ┌─────────────────────┐
                    │   Scanner Agent     │
                    │ - File discovery    │
                    │ - Code analysis     │
                    │ - Finding validation│
                    │ - JIRA integration  │
                    └─────────────────────┘

Configuration

Everything is driven by a YAML configuration file:

vllm:
  host: "10.1.1.27"
  port: 8000
  model: "Qwen/Qwen2.5-Coder-7B-Instruct"

schedule:
  start_hour: 23  # 11pm
  end_hour: 6     # 6am
  max_iterations: 50
  cooldown_seconds: 30

repositories:
  - name: "ballistics-engine"
    path: "/home/alex/projects/ballistics-engine"
    languages: ["rust"]
    scan_patterns:
      - "src//*.rs"
    exclude_patterns:
      - "target/"
      - "*.lock"

  - name: "ballistics-api"
    path: "/home/alex/projects/ballistics-api"
    languages: ["python", "rust"]
    scan_patterns:
      - "ballistics//*.py"
      - "ballistics_rust/src//*.rs"
    exclude_patterns:
      - "__pycache__/"
      - "target/"
      - ".venv/"

jira:
  enabled: true
  project_key: "MBA"
  confidence_threshold: 0.75
  labels: ["ai-detected", "code-scanner"]
  max_tickets_per_run: 10
  review_threshold: 5

The confidence_threshold: 0.75 is crucial. Without it, the model reports every minor style issue. At 75%, it focuses on things it's genuinely concerned about.

The review_threshold: 5 triggers a different behavior: if the model finds more than 5 issues, it creates a single summary ticket for manual review rather than flooding JIRA with individual tickets. This is a safety valve for when the model goes haywire.

Structured Outputs with Pydantic

LLMs are great at finding issues but terrible at formatting output consistently. Left to their own devices, they'll return findings as markdown, prose, JSON with missing fields, or creative combinations thereof.

The solution is structured outputs. I define Pydantic models for exactly what I expect:

class Severity(str, Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"
    INFO = "info"

class FindingType(str, Enum):
    BUG = "bug"
    PERFORMANCE = "performance"
    SECURITY = "security"
    CODE_QUALITY = "code_quality"
    POTENTIAL_ISSUE = "potential_issue"

class CodeFinding(BaseModel):
    file_path: str = Field(description="Path to the file")
    line_start: int = Field(description="Starting line number")
    line_end: Optional[int] = Field(default=None)
    finding_type: FindingType
    severity: Severity
    title: str = Field(max_length=100)
    description: str
    suggestion: Optional[str] = None
    confidence: float = Field(ge=0.0, le=1.0)
    code_snippet: Optional[str] = None

The confidence field is a float between 0 and 1. The model learns to be honest about uncertainty — "I think this might be a bug (0.6)" versus "This is definitely division by zero (0.95)."

In a perfect world, I'd use vLLM's Outlines integration for guided JSON generation. In practice, I found that prompting Qwen for JSON and parsing the response works reliably:

def _analyze_code(self, file_path: str, content: str) -> List[CodeFinding]:
    messages = [
        {"role": "system", "content": self.system_prompt},
        {"role": "user", "content": f"""Analyze this code for bugs and issues.

File: {file_path}

{content}

Return a JSON array of findings. Each finding must have:
- file_path: string
- line_start: number
- finding_type: "bug" | "performance" | "security" | "code_quality"
- severity: "critical" | "high" | "medium" | "low" | "info"
- title: string (max 100 chars)
- description: string
- suggestion: string or null
- confidence: number 0-1

If no issues found, return an empty array: []"""}
    ]

    response = self._call_llm(messages)

    # Parse JSON from response (handles markdown code blocks too)
    if response.strip().startswith('['):
        findings_data = json.loads(response)
    elif '```json' in response:
        json_str = response.split('```json')[1].split('```')[0]
        findings_data = json.loads(json_str)
    elif '[' in response:
        start = response.index('[')
        end = response.rindex(']') + 1
        findings_data = json.loads(response[start:end])
    else:
        return []

    # Validate each finding with Pydantic
    findings = []
    for item in findings_data:
        try:
            finding = CodeFinding(item)
            findings.append(finding)
        except ValidationError:
            pass  # Skip malformed findings

    return findings

The System Prompt

The system prompt is where you teach the model what you care about. Here's mine:

You are an expert code reviewer specializing in Rust and Python.
Your job is to find bugs, performance issues, security vulnerabilities,
and code quality problems.

You are analyzing code from a ballistics calculation project that includes:
- A Rust physics engine for trajectory calculations
- Python Flask API with ML models
- PyO3 bindings between Rust and Python

Key areas to focus on:
1. Numerical precision issues (floating point errors, rounding)
2. Edge cases in physics calculations (division by zero, negative values)
3. Memory safety in Rust code
4. Error handling (silent failures, unwrap panics)
5. Performance bottlenecks (unnecessary allocations, redundant calculations)
6. Security issues (input validation, injection vulnerabilities)

Be conservative with findings - only report issues you are confident about.
Avoid false positives.

The phrase "Be conservative with findings" is doing heavy lifting. Without it, the model reports everything that looks slightly unusual. With it, it focuses on actual problems.

Timeout Handling

Large files (500+ lines) can take a while to analyze. My initial 120-second timeout caused failures on complex files. I bumped it to 600 seconds (10 minutes):

response = requests.post(
    f"{self.base_url}/chat/completions",
    json=payload,
    headers={"Content-Type": "application/json"},
    timeout=600
)

I also truncate files to 300 lines. For longer files, the model only sees the first 300 lines. This is a trade-off — I might miss bugs in the back half of long files — but it keeps scans predictable and prevents timeout cascades. I plan to revisit this in future iterations.

lines = content.split('\n')
if len(lines) > 300:
    content = '\n'.join(lines[:300])
    logger.info("Truncated to 300 lines for analysis")

JIRA Integration

When the scanner finds issues, it creates JIRA tickets automatically. The API is straightforward:

def create_jira_tickets(self, findings: List[CodeFinding]):
    jira_base_url = f"https://{jira_domain}/rest/api/3"

    for finding in findings:
        # Map severity to JIRA priority
        priority_map = {
            Severity.CRITICAL: "Highest",
            Severity.HIGH: "High",
            Severity.MEDIUM: "Medium",
            Severity.LOW: "Low",
            Severity.INFO: "Lowest"
        }

        payload = {
            "fields": {
                "project": {"key": "MBA"},
                "summary": f"[AI] {finding.title}",
                "description": {
                    "type": "doc",
                    "version": 1,
                    "content": [{"type": "paragraph", "content": [
                        {"type": "text", "text": build_description(finding)}
                    ]}]
                },
                "issuetype": {"name": "Bug" if finding.finding_type == FindingType.BUG else "Task"},
                "priority": {"name": priority_map[finding.severity]},
                "labels": ["ai-detected", "code-scanner"]
            }
        }

        response = requests.post(
            f"{jira_base_url}/issue",
            json=payload,
            auth=(jira_email, jira_api_key),
            headers={"Content-Type": "application/json"}
        )

The [AI] prefix in the summary makes it obvious these tickets came from the scanner. The ai-detected label allows filtering.

I add a 2-second delay between ticket creation to avoid rate limiting:

time.sleep(2)  # Rate limit protection

Systemd Scheduling

The scanner runs nightly via systemd timer:

# /etc/systemd/system/code-scanner.timer
[Unit]
Description=Run Code Scanner nightly at 11pm

[Timer]
OnCalendar=*-*-* 23:00:00
Persistent=true
RandomizedDelaySec=300

[Install]
WantedBy=timers.target

The RandomizedDelaySec=300 adds up to 5 minutes of random delay. This prevents the scanner from always starting at exactly 11:00:00, which helps if multiple services share the same schedule.

The service unit is a oneshot that runs the scanner script:

# /etc/systemd/system/code-scanner.service
[Unit]
Description=Code Scanner Agent
After=docker.service

[Service]
Type=oneshot
User=alex
WorkingDirectory=/home/alex/projects/ballistics/code-scanner
ExecStart=/home/alex/projects/ballistics/code-scanner/scripts/start_scanner.sh
TimeoutStartSec=25200

The TimeoutStartSec=25200 (7 hours) gives the scanner enough time to complete even if it scans every file.

Sample Findings

Here's what the scanner actually finds. From a recent run:

{
  "file_path": "/home/alex/projects/ballistics-engine/src/fast_trajectory.rs",
  "line_start": 115,
  "finding_type": "bug",
  "severity": "high",
  "title": "Division by zero in fast_integrate when velocity approaches zero",
  "description": "The division dt / velocity_magnitude could result in division by zero if the projectile stalls (velocity_magnitude = 0). This can happen at the apex of a high-angle shot.",
  "suggestion": "Add a check for velocity_magnitude < epsilon before division, or clamp to a minimum value.",
  "confidence": 0.85
}

This is a real issue. In ballistics calculations, a projectile fired at a high angle momentarily has zero horizontal velocity at the apex. Without a guard, this causes a panic.

Not every finding is valid. The model occasionally flags intentional design decisions as "issues." But at a 75% confidence threshold, the false positive rate is manageable — maybe 1 in 10 findings needs to be closed as "not a bug."

Trade-offs and Lessons

What works well: - Finding numerical edge cases (division by zero, overflow) - Spotting unwrap() calls on Options that might be None - Identifying missing error handling - Flagging dead code and unreachable branches

What doesn't work as well: - Understanding business logic (the model doesn't know physics) - Spotting subtle race conditions in concurrent code - False positives on intentional patterns

Operational lessons: - Start with a low iteration limit (10-20 files) to test the pipeline - Monitor the first few runs manually before trusting it - Keep credentials in .env files excluded from rsync - The 300-line truncation is aggressive; consider chunking for long files

Handling JSON Parse Failures

Despite asking for JSON, LLMs sometimes produce malformed output. I see two failure modes:

Truncated JSON: The model runs out of tokens mid-response, leaving an unterminated string or missing closing brackets.
Wrapped JSON: The model adds explanatory text around the JSON, like "Here are the findings:" before the array.

My parser handles both:

def parse_findings_response(response: str) -> list:
    """Extract JSON from potentially messy LLM output."""
    response = response.strip()

    # Best case: raw JSON array
    if response.startswith('['):
        try:
            return json.loads(response)
        except json.JSONDecodeError:
            pass  # Fall through to extraction

    # Common case: JSON in markdown code block
    if '```json' in response:
        try:
            json_str = response.split('```json')[1].split('```')[0]
            return json.loads(json_str)
        except (IndexError, json.JSONDecodeError):
            pass

    # Fallback: extract JSON array from surrounding text
    if '[' in response and ']' in response:
        try:
            start = response.index('[')
            end = response.rindex(']') + 1
            return json.loads(response[start:end])
        except json.JSONDecodeError:
            pass

    # Give up
    logger.warning("Could not extract JSON from response")
    return []

When parsing fails, I log the error and skip that file rather than crashing the entire scan. In a typical 50-file run, I see 2-3 parse failures — annoying but acceptable.

Testing the Pipeline

Before trusting the scanner with JIRA ticket creation, I ran it in "dry run" mode:

# Set max iterations low and disable JIRA
export MAX_ITERATIONS=5
# In config: jira.enabled: false

python run_scanner_direct.py

This scans just 5 files and prints findings without creating tickets. I manually reviewed each finding:

True positive: Division by zero in trajectory calculation — good catch
False positive: Flagged intentional unwrap() on a guaranteed-Some Option — needs better context
True positive: Dead code path never executed — valid cleanup suggestion
Marginal: Style suggestion about variable naming — below my quality threshold

After tuning the confidence threshold and system prompt, the true positive rate improved to roughly 90%.

Monitoring and Observability

The scanner writes detailed logs to stdout and a JSON results file. Sample log output:

2025-11-26 15:48:25 - CODE SCANNER AGENT STARTING
2025-11-26 15:48:25 - Max iterations: 50
2025-11-26 15:48:25 - Model: Qwen/Qwen2.5-Coder-7B-Instruct
2025-11-26 15:48:25 - Starting scan of ballistics-engine
2025-11-26 15:48:25 - Found 35 files to scan
2025-11-26 15:48:25 - Scanning: src/trajectory_sampling.rs
2025-11-26 15:48:25 -   Truncated to 300 lines for analysis
2025-11-26 15:49:24 -   Found 5 findings (>= 75% confidence)
2025-11-26 15:49:24 -     [LOW] Redundant check for step_m value
2025-11-26 15:49:24 -     [LOW] Potential off-by-one error

The JSON results include full finding details:

{
  "timestamp": "20251126_151136",
  "total_findings": 12,
  "repositories": [
    {
      "repository": "ballistics-engine",
      "files_scanned": 35,
      "findings": [...],
      "duration_seconds": 1842.5,
      "iterations_used": 35
    }
  ]
}

I keep the last 30 result files (configurable) for historical comparison. Eventually I'll build a dashboard showing finding trends over time.

What's Next

The current system is batch-oriented: run once per night, file tickets, done. Future improvements I'm considering:

Pre-commit integration: Run on changed files only, fast enough for CI
Retrieval-augmented context: Include related files when analyzing (e.g., when scanning a function, include its callers)
Learning from feedback: Track which tickets get closed as "not a bug" and use that to tune prompts
Multi-model ensemble: Run the same code through two models, only file tickets when both agree

For now, though, the simple approach works. Every morning I check JIRA, triage the overnight findings, and fix the real bugs. The model isn't perfect, but it finds things I miss. And unlike a human reviewer, it never gets tired, never skips files, and never has a bad day.

Get the Code

I've open-sourced the complete scanner implementation on GitHub: llm-code-scanner

The project includes:

Dual scanning modes: Fast nightly scans via vLLM and comprehensive weekly analyses through Ollama
Smart deduplication: SQLite database prevents redundant issue tracking across runs
JIRA integration: Automatically creates tickets for findings above your confidence threshold
Email reports: SendGrid integration for daily/weekly summaries
Multi-language support: Python, Rust, TypeScript, Kotlin, Swift, Go, and more

To get started, clone the repo, configure your scanner_config.yaml with your vLLM/Ollama server details, and run python -m agent.scanner. The README has full setup instructions including environment variables for JIRA and SendGrid integration.