Building a Nightly AI Code Scanner with vLLM, ROCm, and JIRA Integration
I've been running a ballistics calculation engine — a Rust physics library with several components, like a Flask app wrapper with machine learning capabilities, bindings for a python library as well as a Ruby gem library. There are also Android and iOS apps, too. The codebase has grown to about 15,000 lines of Rust and another 10,000 lines of Python. At this scale, bugs hide in edge cases: division by zero, floating-point precision issues in transonic drag calculations, unwrap() panics on unexpected input.
What if I could run an AI code reviewer every night while I sleep? Not a cloud API with per-token billing that could run up a $500 bill scanning 50 files, but a local model running on my own hardware, grinding through the codebase and filing JIRA tickets for anything suspicious.
This is the story of building that system.
The Hardware: AMD Strix Halo on ROCm 7.0
I'm running this on a server with an AMD Radeon 8060S (Strix Halo APU) — specifically the gfx1151 architecture. This isn't a data center GPU. It's essentially an integrated GPU with 128GB of shared memory; configured to give 96GB to VRAM and the rest to system RAM. Not the 80GB of HBM3 you'd get on an H100, but enough to run a 32B parameter model comfortably.
The key insight: for batch processing where latency doesn't matter, you don't need bleeding-edge hardware. A nightly scan can take hours. I'm not serving production traffic; I'm analyzing code files one at a time with a 30-second cooldown between requests. The APU handles this fine.
Hardware Configuration: - AMD Radeon 8060S (gfx1151 Strix Halo APU) - 96GB shared memory - ROCm 7.0 with HSA_OVERRIDE_GFX_VERSION=11.5.1
The HSA_OVERRIDE_GFX_VERSION environment variable is critical. Without it, ROCm doesn't recognize the Strix Halo architecture. This is the kind of sharp edge you hit running ML on AMD consumer hardware.
Model Selection: Qwen2.5-Coder-7B-Instruct
I tested several models:
| Model | Parameters | Context | Quality | Notes |
|---|---|---|---|---|
| DeepSeek-Coder-V2-Lite | 16B | 32k | Good | Requires flash_attn (ROCm issues) |
| Qwen3-Coder-30B | 30B | 32k | Excellent | Too slow on APU |
| Qwen2.5-Coder-7B-Instruct | 7B | 16k | Good | Sweet spot |
| TinyLlama-1.1B | 1.1B | 4k | Poor | Too small for code review |
Qwen2.5-Coder-7B-Instruct hits the sweet spot. It understands Rust and Python well enough to spot real issues, runs fast enough to process 50 files per night, and doesn't require flash attention (which has ROCm compatibility issues on consumer hardware).
vLLM Setup
vLLM provides an OpenAI-compatible API server that makes integration trivial. Here's the startup command:
source ~/vllm-rocm7-venv/bin/activate export HSA_OVERRIDE_GFX_VERSION=11.5.1 python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-Coder-7B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --trust-remote-code \ --max-model-len 16384 \ --gpu-memory-utilization 0.85
The --max-model-len 16384 limits context to 16k tokens. My code files rarely exceed 500 lines (truncated), so this is plenty. The --gpu-memory-utilization 0.85 leaves headroom for the system.
I run this in a Python venv rather than Docker because ROCm device passthrough with Docker on Strix Halo is finicky. Sometimes you have to choose pragmatism over elegance.
Docker Configuration (When It Works)
For reference, here's the Docker Compose configuration I initially built. It works on dedicated AMD GPUs but has issues on integrated APUs:
services: vllm: image: rocm/vllm-dev:latest container_name: vllm-code-scanner devices: - /dev/kfd:/dev/kfd - /dev/dri:/dev/dri group_add: - video - render security_opt: - seccomp:unconfined cap_add: - SYS_PTRACE ipc: host environment: - HSA_OVERRIDE_GFX_VERSION=11.5.1 - PYTORCH_ROCM_ARCH=gfx1151 - HIP_VISIBLE_DEVICES=0 volumes: - /home/alex/models:/models - /home/alex/.cache/huggingface:/root/.cache/huggingface ports: - "8000:8000" command: > python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-Coder-7B-Instruct --host 0.0.0.0 --port 8000 --trust-remote-code --max-model-len 16384 --gpu-memory-utilization 0.85 healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 5 start_period: 120s scanner: build: . container_name: code-scanner-agent depends_on: vllm: condition: service_healthy environment: - VLLM_HOST=vllm - VLLM_PORT=8000 - JIRA_EMAIL=${JIRA_EMAIL} - JIRA_API_KEY=${JIRA_API_KEY} volumes: - /home/alex/projects:/projects:ro - ./config:/app/config:ro - /home/alex/projects/code-scanner-results:/app/results
The ipc: host and seccomp:unconfined are necessary for ROCm to function properly. The depends_on with service_healthy ensures the scanner waits for vLLM to be fully loaded before starting — important since model loading can take 2-3 minutes.
The scanner Dockerfile is minimal:
FROM python:3.11-slim WORKDIR /app RUN apt-get update && apt-get install -y \ git curl ripgrep \ && rm -rf /var/lib/apt/lists/* COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY agent/ /app/agent/ COPY prompts/ /app/prompts/ COPY config/ /app/config/ CMD ["python", "-m", "agent.scanner"]
Including ripgrep in the container enables fast pattern matching when the scanner needs to search for related code.
The Scanner Architecture
The system has three main components:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Systemd │ │ vLLM │ │ JIRA │
│ Timer │────▶│ Server │────▶│ API │
│ (11pm daily) │ │ (Qwen 7B) │ │ (tickets) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────────┐
│ Scanner Agent │
│ - File discovery │
│ - Code analysis │
│ - Finding validation│
│ - JIRA integration │
└─────────────────────┘
Configuration
Everything is driven by a YAML configuration file:
vllm: host: "10.1.1.27" port: 8000 model: "Qwen/Qwen2.5-Coder-7B-Instruct" schedule: start_hour: 23 # 11pm end_hour: 6 # 6am max_iterations: 50 cooldown_seconds: 30 repositories: - name: "ballistics-engine" path: "/home/alex/projects/ballistics-engine" languages: ["rust"] scan_patterns: - "src//*.rs" exclude_patterns: - "target/" - "*.lock" - name: "ballistics-api" path: "/home/alex/projects/ballistics-api" languages: ["python", "rust"] scan_patterns: - "ballistics//*.py" - "ballistics_rust/src//*.rs" exclude_patterns: - "__pycache__/" - "target/" - ".venv/" jira: enabled: true project_key: "MBA" confidence_threshold: 0.75 labels: ["ai-detected", "code-scanner"] max_tickets_per_run: 10 review_threshold: 5
The confidence_threshold: 0.75 is crucial. Without it, the model reports every minor style issue. At 75%, it focuses on things it's genuinely concerned about.
The review_threshold: 5 triggers a different behavior: if the model finds more than 5 issues, it creates a single summary ticket for manual review rather than flooding JIRA with individual tickets. This is a safety valve for when the model goes haywire.
Structured Outputs with Pydantic
LLMs are great at finding issues but terrible at formatting output consistently. Left to their own devices, they'll return findings as markdown, prose, JSON with missing fields, or creative combinations thereof.
The solution is structured outputs. I define Pydantic models for exactly what I expect:
class Severity(str, Enum): CRITICAL = "critical" HIGH = "high" MEDIUM = "medium" LOW = "low" INFO = "info" class FindingType(str, Enum): BUG = "bug" PERFORMANCE = "performance" SECURITY = "security" CODE_QUALITY = "code_quality" POTENTIAL_ISSUE = "potential_issue" class CodeFinding(BaseModel): file_path: str = Field(description="Path to the file") line_start: int = Field(description="Starting line number") line_end: Optional[int] = Field(default=None) finding_type: FindingType severity: Severity title: str = Field(max_length=100) description: str suggestion: Optional[str] = None confidence: float = Field(ge=0.0, le=1.0) code_snippet: Optional[str] = None
The confidence field is a float between 0 and 1. The model learns to be honest about uncertainty — "I think this might be a bug (0.6)" versus "This is definitely division by zero (0.95)."
In a perfect world, I'd use vLLM's Outlines integration for guided JSON generation. In practice, I found that prompting Qwen for JSON and parsing the response works reliably:
def _analyze_code(self, file_path: str, content: str) -> List[CodeFinding]: messages = [ {"role": "system", "content": self.system_prompt}, {"role": "user", "content": f"""Analyze this code for bugs and issues. File: {file_path}
{content}
Return a JSON array of findings. Each finding must have: - file_path: string - line_start: number - finding_type: "bug" | "performance" | "security" | "code_quality" - severity: "critical" | "high" | "medium" | "low" | "info" - title: string (max 100 chars) - description: string - suggestion: string or null - confidence: number 0-1 If no issues found, return an empty array: []"""} ] response = self._call_llm(messages) # Parse JSON from response (handles markdown code blocks too) if response.strip().startswith('['): findings_data = json.loads(response) elif '```json' in response: json_str = response.split('```json')[1].split('```')[0] findings_data = json.loads(json_str) elif '[' in response: start = response.index('[') end = response.rindex(']') + 1 findings_data = json.loads(response[start:end]) else: return [] # Validate each finding with Pydantic findings = [] for item in findings_data: try: finding = CodeFinding(item) findings.append(finding) except ValidationError: pass # Skip malformed findings return findings
The System Prompt
The system prompt is where you teach the model what you care about. Here's mine:
You are an expert code reviewer specializing in Rust and Python. Your job is to find bugs, performance issues, security vulnerabilities, and code quality problems. You are analyzing code from a ballistics calculation project that includes: - A Rust physics engine for trajectory calculations - Python Flask API with ML models - PyO3 bindings between Rust and Python Key areas to focus on: 1. Numerical precision issues (floating point errors, rounding) 2. Edge cases in physics calculations (division by zero, negative values) 3. Memory safety in Rust code 4. Error handling (silent failures, unwrap panics) 5. Performance bottlenecks (unnecessary allocations, redundant calculations) 6. Security issues (input validation, injection vulnerabilities) Be conservative with findings - only report issues you are confident about. Avoid false positives.
The phrase "Be conservative with findings" is doing heavy lifting. Without it, the model reports everything that looks slightly unusual. With it, it focuses on actual problems.
Timeout Handling
Large files (500+ lines) can take a while to analyze. My initial 120-second timeout caused failures on complex files. I bumped it to 600 seconds (10 minutes):
response = requests.post( f"{self.base_url}/chat/completions", json=payload, headers={"Content-Type": "application/json"}, timeout=600 )
I also truncate files to 300 lines. For longer files, the model only sees the first 300 lines. This is a trade-off — I might miss bugs in the back half of long files — but it keeps scans predictable and prevents timeout cascades. I plan to revisit this in future iterations.
lines = content.split('\n') if len(lines) > 300: content = '\n'.join(lines[:300]) logger.info("Truncated to 300 lines for analysis")
JIRA Integration
When the scanner finds issues, it creates JIRA tickets automatically. The API is straightforward:
def create_jira_tickets(self, findings: List[CodeFinding]): jira_base_url = f"https://{jira_domain}/rest/api/3" for finding in findings: # Map severity to JIRA priority priority_map = { Severity.CRITICAL: "Highest", Severity.HIGH: "High", Severity.MEDIUM: "Medium", Severity.LOW: "Low", Severity.INFO: "Lowest" } payload = { "fields": { "project": {"key": "MBA"}, "summary": f"[AI] {finding.title}", "description": { "type": "doc", "version": 1, "content": [{"type": "paragraph", "content": [ {"type": "text", "text": build_description(finding)} ]}] }, "issuetype": {"name": "Bug" if finding.finding_type == FindingType.BUG else "Task"}, "priority": {"name": priority_map[finding.severity]}, "labels": ["ai-detected", "code-scanner"] } } response = requests.post( f"{jira_base_url}/issue", json=payload, auth=(jira_email, jira_api_key), headers={"Content-Type": "application/json"} )
The [AI] prefix in the summary makes it obvious these tickets came from the scanner. The ai-detected label allows filtering.
I add a 2-second delay between ticket creation to avoid rate limiting:
time.sleep(2) # Rate limit protection
Systemd Scheduling
The scanner runs nightly via systemd timer:
# /etc/systemd/system/code-scanner.timer [Unit] Description=Run Code Scanner nightly at 11pm [Timer] OnCalendar=*-*-* 23:00:00 Persistent=true RandomizedDelaySec=300 [Install] WantedBy=timers.target
The RandomizedDelaySec=300 adds up to 5 minutes of random delay. This prevents the scanner from always starting at exactly 11:00:00, which helps if multiple services share the same schedule.
The service unit is a oneshot that runs the scanner script:
# /etc/systemd/system/code-scanner.service [Unit] Description=Code Scanner Agent After=docker.service [Service] Type=oneshot User=alex WorkingDirectory=/home/alex/projects/ballistics/code-scanner ExecStart=/home/alex/projects/ballistics/code-scanner/scripts/start_scanner.sh TimeoutStartSec=25200
The TimeoutStartSec=25200 (7 hours) gives the scanner enough time to complete even if it scans every file.
Sample Findings
Here's what the scanner actually finds. From a recent run:
{ "file_path": "/home/alex/projects/ballistics-engine/src/fast_trajectory.rs", "line_start": 115, "finding_type": "bug", "severity": "high", "title": "Division by zero in fast_integrate when velocity approaches zero", "description": "The division dt / velocity_magnitude could result in division by zero if the projectile stalls (velocity_magnitude = 0). This can happen at the apex of a high-angle shot.", "suggestion": "Add a check for velocity_magnitude < epsilon before division, or clamp to a minimum value.", "confidence": 0.85 }
This is a real issue. In ballistics calculations, a projectile fired at a high angle momentarily has zero horizontal velocity at the apex. Without a guard, this causes a panic.
Not every finding is valid. The model occasionally flags intentional design decisions as "issues." But at a 75% confidence threshold, the false positive rate is manageable — maybe 1 in 10 findings needs to be closed as "not a bug."
Trade-offs and Lessons
What works well: - Finding numerical edge cases (division by zero, overflow) - Spotting unwrap() calls on Options that might be None - Identifying missing error handling - Flagging dead code and unreachable branches
What doesn't work as well: - Understanding business logic (the model doesn't know physics) - Spotting subtle race conditions in concurrent code - False positives on intentional patterns
Operational lessons:
- Start with a low iteration limit (10-20 files) to test the pipeline
- Monitor the first few runs manually before trusting it
- Keep credentials in .env files excluded from rsync
- The 300-line truncation is aggressive; consider chunking for long files
Handling JSON Parse Failures
Despite asking for JSON, LLMs sometimes produce malformed output. I see two failure modes:
- Truncated JSON: The model runs out of tokens mid-response, leaving an unterminated string or missing closing brackets.
- Wrapped JSON: The model adds explanatory text around the JSON, like "Here are the findings:" before the array.
My parser handles both:
def parse_findings_response(response: str) -> list: """Extract JSON from potentially messy LLM output.""" response = response.strip() # Best case: raw JSON array if response.startswith('['): try: return json.loads(response) except json.JSONDecodeError: pass # Fall through to extraction # Common case: JSON in markdown code block if '```json' in response: try: json_str = response.split('```json')[1].split('```')[0] return json.loads(json_str) except (IndexError, json.JSONDecodeError): pass # Fallback: extract JSON array from surrounding text if '[' in response and ']' in response: try: start = response.index('[') end = response.rindex(']') + 1 return json.loads(response[start:end]) except json.JSONDecodeError: pass # Give up logger.warning("Could not extract JSON from response") return []
When parsing fails, I log the error and skip that file rather than crashing the entire scan. In a typical 50-file run, I see 2-3 parse failures — annoying but acceptable.
Testing the Pipeline
Before trusting the scanner with JIRA ticket creation, I ran it in "dry run" mode:
# Set max iterations low and disable JIRA export MAX_ITERATIONS=5 # In config: jira.enabled: false python run_scanner_direct.py
This scans just 5 files and prints findings without creating tickets. I manually reviewed each finding:
- True positive: Division by zero in trajectory calculation — good catch
- False positive: Flagged intentional
unwrap()on a guaranteed-Some Option — needs better context - True positive: Dead code path never executed — valid cleanup suggestion
- Marginal: Style suggestion about variable naming — below my quality threshold
After tuning the confidence threshold and system prompt, the true positive rate improved to roughly 90%.
Monitoring and Observability
The scanner writes detailed logs to stdout and a JSON results file. Sample log output:
2025-11-26 15:48:25 - CODE SCANNER AGENT STARTING 2025-11-26 15:48:25 - Max iterations: 50 2025-11-26 15:48:25 - Model: Qwen/Qwen2.5-Coder-7B-Instruct 2025-11-26 15:48:25 - Starting scan of ballistics-engine 2025-11-26 15:48:25 - Found 35 files to scan 2025-11-26 15:48:25 - Scanning: src/trajectory_sampling.rs 2025-11-26 15:48:25 - Truncated to 300 lines for analysis 2025-11-26 15:49:24 - Found 5 findings (>= 75% confidence) 2025-11-26 15:49:24 - [LOW] Redundant check for step_m value 2025-11-26 15:49:24 - [LOW] Potential off-by-one error
The JSON results include full finding details:
{ "timestamp": "20251126_151136", "total_findings": 12, "repositories": [ { "repository": "ballistics-engine", "files_scanned": 35, "findings": [...], "duration_seconds": 1842.5, "iterations_used": 35 } ] }
I keep the last 30 result files (configurable) for historical comparison. Eventually I'll build a dashboard showing finding trends over time.
What's Next
The current system is batch-oriented: run once per night, file tickets, done. Future improvements I'm considering:
- Pre-commit integration: Run on changed files only, fast enough for CI
- Retrieval-augmented context: Include related files when analyzing (e.g., when scanning a function, include its callers)
- Learning from feedback: Track which tickets get closed as "not a bug" and use that to tune prompts
- Multi-model ensemble: Run the same code through two models, only file tickets when both agree
For now, though, the simple approach works. Every morning I check JIRA, triage the overnight findings, and fix the real bugs. The model isn't perfect, but it finds things I miss. And unlike a human reviewer, it never gets tired, never skips files, and never has a bad day.
Get the Code
I've open-sourced the complete scanner implementation on GitHub: llm-code-scanner
The project includes:
- Dual scanning modes: Fast nightly scans via vLLM and comprehensive weekly analyses through Ollama
- Smart deduplication: SQLite database prevents redundant issue tracking across runs
- JIRA integration: Automatically creates tickets for findings above your confidence threshold
- Email reports: SendGrid integration for daily/weekly summaries
- Multi-language support: Python, Rust, TypeScript, Kotlin, Swift, Go, and more
To get started, clone the repo, configure your scanner_config.yaml with your vLLM/Ollama server details, and run python -m agent.scanner. The README has full setup instructions including environment variables for JIRA and SendGrid integration.