Building Stalker: A Mid-Cap Trading Bot and the Data Network That Feeds It

A.C. Jokela — Fri, 08 May 2026 18:00:00 GMT

Three years ago I built a Slack bot that generated company stock reports. You'd type /report TSLA and a Lambda fan-out would pull yfinance data, scrape recent news through BeautifulSoup, run technical indicators with the ta library, and ask GPT-4 to write three paragraphs about what was happening with the stock. The output landed back in Slack and on a static S3 site. It was fun. It was a toy.

It told you what one stock looked like. It didn't tell you what to do about it.

Stalker is what I built after I stopped wanting toys.

It's an autonomous mid-cap equity trading bot. It runs on AWS Lambda. It reads a daily macro brief, ranks a 300-name universe by factor scores, asks Claude Sonnet 4.6 to propose orders against the current portfolio, runs the proposed plan through a deterministic risk gate, and submits the survivors to Alpaca's paper trading API with deterministic client-order IDs so re-fires are idempotent. It logs every decision, every fill, every rejection. It emails me a daily report. It runs without me touching it.

Right now it has 12 positions in a paper account modeled as a synthetic \$1,000 seeded deposit. Inception-to-date return on that \$1,000 baseline is +3.76% against SPY's +2.12% — so +1.65pp of alpha over four weeks of live operation. (To head off a misreading of the next number: the position book currently shows \$1,576 of market value, which is not \$576 of gains on the seed. It's over-deployment from a bug I'll cover later — the bot was reading its cash limit as auto-replenishing every brief day instead of flowing from the order ledger, so cumulative buys ran past what a real \$1,000 account could have funded. The fix is in; the bot is currently trimming positions back to within the seed. The +3.76% number is the honest baseline-relative return.) The numbers don't matter at four weeks anyway — that's noise. What matters is that the system is running, the architecture is settled, and the methodology for measuring whether any of it actually works is pre-registered.

Stalker is one of six related projects. The other five are data sources that feed it:

Headwater is a daily financial-newsletter aggregator that emits a structured macro brief twice a weekday morning and afternoon. It reads the writers I follow, classifies what they're saying, and ships a JSON document with a regime call, sector tilts, themes, and a watchlist.
Estuary does cross-source consensus detection. When five different writers all flag the same ticker on the same week, that's a cluster, and clusters end up in a daily brief at 23:30 UTC.
PrivateEye decodes paywalled-newsletter teasers into ticker picks. The actual decoded stock recommendations land in a daily digest.
Tributary ingests SEC EDGAR 8-K filings, extracts each item-level event, and classifies materiality. NT-10K late-filings, 4.02 restatements, 1.03 bankruptcies, 5.02 executive departures — Stalker reads these as risk-and-opportunity overlays.
Goldfinch pulls federal prime contract awards from USAspending.gov, maps recipient legal entities to public tickers, and emits the matches as confirmatory revenue-disclosure signals.

Each one is its own AWS account-scope project — its own Lambdas, its own DynamoDB tables, its own SES inbound endpoint. They're loosely coupled: each emits a JSON document on a known schema; Stalker reads from each one through a producer-specific loader at analyze time. Adding a new feeder is a Pydantic model plus a load function plus an SES rule. The architecture is "more sources are better, but no source is required."

This is the post that explains how all of it hangs together.

The Data Network

The shape of the network matters. Each feeder is a separate project because each one solves a different problem. Trying to put all the signal extraction inside Stalker would have produced a monolith that does five things badly. The split lets each project focus on one thing — newsletter aggregation, consensus detection, teaser decoding, EDGAR ingestion, contract scraping — and lets Stalker focus on the trading layer alone.

Headwater generates the macro overlay. The structured brief carries a regime field (one of risk_off, transitional, neutral, transitional_risk_on, risk_on), a list of sector_tilts with view and strength, a list of thematic_views with affected_sectors, a key_risks list with horizons, and a watchlist of tickers under active discussion. Stalker doesn't trade the watchlist names directly — those are mostly megacaps that fall outside the mid-cap filter — but the macro block converts directly into multipliers on the combined factor score. A bullish-Energy tilt at high strength multiplies every Energy mid-cap's combined_z by 1.20 before ranking. A bearish-Tech tilt at medium strength multiplies it by 0.90. The factor selection is sector-tilted by the macro view at the rank-construction stage, not by Claude's interpretation at the prompt stage. The macro pipe into selection is deterministic and traceable.

Estuary is a different role. It tracks what individual newsletter writers are publishing across a window, computes consensus when multiple writers converge on the same ticker within a few days, and emits the clusters along with the per-writer high-conviction picks. A cluster of five sources flagging a single name in the same week is a stronger signal than five separate calls scattered across the year. Stalker reads Estuary's output as a confirmatory soft signal — when a name in the candidate universe also has Estuary cluster support, Claude can use that as a tie-breaker between similarly-ranked candidates.

PrivateEye is the cheapest unit-economics piece in the stack. Financial-newsletter teasers are designed to make you subscribe — they're written to gesture at a recommendation without giving it away. PrivateEye reads the teasers and extracts the underlying ticker pick. The decoded picks ship in a daily digest. Most of them are megacaps that don't apply to Stalker, but the percentage that fall in the \$2B–\$10B band become another confirmatory soft signal alongside the Estuary clusters.

Tributary is the structural-events feed. The SEC publishes 8-K filings continuously, and many of them are noise — boilerplate amendments, routine disclosures, exhibit lists. The interesting ones are the 5.02 executive departures, the 1.03 bankruptcy filings, the 4.02 audit-restatements, the 5.07 favorable shareholder votes, and the 8.01 material acquisitions. Tributary classifies each item-level event by materiality and salience, summarizes the substance, and emits one record per filing. Stalker uses high-materiality Tributary events as risk overlays: a recent NT-10K on a holding is a reason to consider trimming. The architecture treats negative-materiality items as exit triggers and positive items as additional context.

Goldfinch is the newest feeder. Federal contract awards land in USAspending.gov within a few days of action; they're a real fundamental disclosure that often precedes earnings discussion of the same revenue. The challenge is the recipient-to-ticker mapping — federal awards are made to legal entities, not to listed companies, and many awards go to subsidiaries or government-services divisions whose parent ticker isn't obvious. Goldfinch handles the mapping (with a confidence rating per match), filters to material awards, and emits the matched records. A \$300M Department of Defense award to a \$5B mid-cap defense contractor is a real near-term revenue event; the same award to a \$200B megacap is rounding error. Stalker weighs the signal accordingly.

Each feeder ships through SES inbound mail to its own recipient at in.stalker.bot. briefs@ is Headwater. estuary@ is Estuary. events@ is Tributary. picks@ is PrivateEye. Goldfinch is a special case — it writes directly to a shared DynamoDB table because it runs in the same AWS account, but conceptually it's the same pattern: structured JSON, schema-versioned, lenient on unknown fields. SES drops each inbound message into S3, an EventBridge notification fires the Stalker ingest Lambda, the Lambda dispatches to the producer-specific parser, validates against a Pydantic model, archives the parsed payload, indexes a row in DynamoDB, and (for Headwater only) emits a BriefIngested event that triggers the analysis layer.

That's the inbound side. The outbound side is one Lambda — the analyzer.

The Trading Core

Stalker's actual decision logic is layered. There's a deterministic factor stack at the bottom, an LLM call in the middle, and a deterministic risk gate at the top. The LLM has freedom in the middle layer; everything below and above is mechanical.

The universe. Every morning at 04:00 UTC, a Lambda hits FMP's screener API for US-listed stocks with market cap between \$2 billion and \$10 billion that are tradable on Alpaca. The result is a partition in the stalker-universe DynamoDB table keyed on refresh_date. The mid-cap band is the strategy's first commitment: Stalker doesn't trade megacaps (they're efficient, no edge available) and doesn't trade microcaps (liquidity, regulatory, and size-premium hazards). Mid-cap is the band where factor strategies historically have shown the most edge over passive indexing.

Factor scoring. Fifteen minutes after the universe refresh, a second Lambda computes per-name factor scores. Three factors:

Momentum: 12-month return minus the most recent month — the classical Jegadeesh-Titman 12-1 factor. The skip-month removes short-term reversal noise.
Quality: trailing-twelve-month ROIC plus operating margin, equally weighted. This is the post-ADR-011 production definition. (Stalker tracks load-bearing strategy decisions in numbered Architecture Decision Records — short Markdown documents in docs/adr/ that record the context, the evidence, the chosen option, and the reasoning. ADR-011 was the one that switched the quality factor.) The original ROE-plus-gross-margin definition turned out to be silently destructive in backtest, dragging alpha by 31 percentage points over the 3-year survivorship-bias-free window. The ADR walks through the seven candidate definitions, the alpha and Sharpe under each, and the reasoning for picking ROIC-plus-operating-margin over single-metric ROIC even though the latter showed a slightly higher headline alpha. Subsequent ADRs in this post (009, 010, 013, 014) follow the same pattern.
Low-vol: the negative of 60-day realized volatility. Lower-vol names get higher z-scores.

Each factor is z-scored within sector — a Financials name's quality is graded against other Financials, not against Tech. The sector-neutral approach prevents structural concentration: REITs and banks naturally have high ROE and low vol, so cross-sectional z-scoring would otherwise dump the whole portfolio into one or two sectors regardless of the macro view. Sectors with fewer than five names skip scoring; their z-scores would be noise.

The three factor z-scores combine into a combined_z via configurable weights. The current production weights are 0.55 momentum, 0.225 quality, 0.225 low-vol — momentum-tilted because the bias-corrected backtest sweep showed monotonic alpha improvement up through 0.55 with diminishing return beyond that. The factor stack writes top-300 by combined_z back to DynamoDB with in_universe=true.

The macro overlay applies here. Before the factor scores are written, Headwater's sector tilts are pulled in and applied as multiplicative adjustments to combined_z per sector. This is the "macro pipe" mentioned earlier: the sector view from the structured brief becomes a factor in the rank itself. A bullish/high Energy tilt floats Energy names up; a bearish/medium Tech tilt sinks Tech names. The multipliers are deterministic — defined in macro_sizing.py and unit-tested — so a given brief plus a given factor snapshot produces a reproducible rank. No LLM in this layer.

The analyze layer. When a Headwater brief lands in S3 and the ingest Lambda emits BriefIngested, the analyze Lambda fires. It loads the brief, fetches the current Alpaca account state, fetches today's universe partition with factor scores, queries the four producer feeder loaders for recent confirmatory signals, builds a structured user message, and calls Claude Sonnet 4.6 with a tool-use schema that constrains the output to a propose_trade_plan JSON object. The system prompt explains the strategy posture, the hard rules the risk gate enforces, and the role of each feeder. The user message contains the macro block, the top-50 candidates by combined_z with their factor scores and Kelly-derived suggested allocations, the current positions, the universe whitelist, and the soft-signal sections from each feeder.

Claude's job is constrained selection. It picks 6–10 names from the candidate list (or the existing positions) and assigns target dollar values. It cannot trade outside the universe whitelist. It cannot propose a single trade larger than 15% of NAV. It cannot exceed 25% of NAV in any single position post-trade. It cannot buy a name with earnings inside seven days. The system prompt makes these constraints explicit, but the risk gate enforces them mechanically — the LLM is one defense layer, not the only one.

The proposed plan flows into risk.evaluate(), a pure-Python function that takes the state and the proposed orders and returns one of four decisions: approved, needs_human_approval, rejected, or — under specific edge cases — a noise band that triggers a re-prompt. The risk gate enforces:

25% per-position cap (post-trade)
15% per-trade cap on buys
2% minimum cash buffer
31-day IRS wash-sale block on buys of recently sold-at-loss names
Earnings veto (no buys within 7 days of next earnings)
Daily drawdown halt (-5% intraday → no new buys)
Total drawdown halt (-15% from inception → no new buys)
Losing-streak halt (3 consecutive losing sells → no new buys)
Universe whitelist (any non-whitelisted ticker is rejected)

The risk gate's outputs are logged in DynamoDB regardless of whether execution proceeds. If the plan is approved, the executor Lambda picks it up via EventBridge and submits each order to Alpaca with a deterministic client_order_id formed from the SHA-1 of (plan_id, ticker, action). The deterministic ID makes re-fires idempotent: if the executor crashes after submitting three of five orders, the next invocation tries to submit the same orders, hits a 422 collision on the three already-submitted, fetches their existing state, and proceeds with the remaining two.

There's one nuance in the executor that took an incident to find. Alpaca's day-trade detection rejects new orders on a symbol when an opposite-side order is already open — labeled "potential wash trade" in the rejection message. We discovered this when a sell got blocked because a stale stop-loss was still on the book. The fix is a preflight: before each submission, the executor lists open Alpaca orders for the symbol, cancels them, syncs the matching stalker-orders rows to cancelled, and then submits the new order. Belt and suspenders — Alpaca's enforcement still works, but we don't rely on it.

The Seeded Account

There's a subtle constraint baked into the architecture that took two iterations to get right: the bot is supposed to behave as if I'd actually deposited \$1,000 of real money, not as if it had access to Alpaca's \$100,000 paper-account default. Live trading caps the position size to what a \$1,000 retail account would actually do; paper testing should exercise the same sizing logic.

The first cut of this used a simple min(real_cash, NAV_CAP) cap on the cash field reported to the LLM and risk gate. That worked when the bot had no positions. Once positions accumulated, the cap silently broke the percentage math: a position with \$230 of market value plus a proposed \$40 buy is \$270 against the real \$2,500 NAV (a benign 10.8% of portfolio), but the cap reported the NAV as \$1,000, making the same position read as 27% — over the hard 25% cap. Three plans got rejected over a 36-hour window before the bug surfaced.

The first fix tightened that: cap the cash, not the NAV. NAV becomes capped_cash + sum(position market values), which is a coherent number that reflects real portfolio percentages while preserving the small-account sizing exercise. That fixed the rejection bug.

But it introduced a deeper problem: the cash kept reading \$1,000 every brief, replenished from Alpaca's bottomless paper-account seed. A real \$1,000 account doesn't work that way. A real account spends \$200 to buy a position; cash goes to \$800. The bot was effectively redeploying \$1,000 of fresh capital every brief day. Over four weeks the cumulative buys totaled \$2,123 against \$557 of sells — 2.1× the intended seed. A real account would have hit a cash wall after about ten buys.

The honest fix is the second iteration: cash flows from the order ledger. The seeded cash at any moment is $1,000 + sum(filled_sells) − sum(filled_buys). The function scans stalker-orders for filled and partially-filled rows, sums the filled_qty * filled_avg_price per side, and returns the seeded cash. NAV is seeded_cash + positions_mv. Buying power is max(0, seeded_cash) — no margin in the seeded model. When seeded cash goes negative (the bot is over-deployed relative to the seed), the LLM sees the negative number with an inline note instructing trim-first behavior, and the risk gate's existing 2% cash-buffer check naturally enforces rebalance-only mode until sells refill the seed.

After the second fix, the bot's reported state is cash = -$566, nav = $993, buying_power = $0. It will spend the next several brief cycles trimming positions back into the seed before it can buy again. That's exactly how a real \$1,000 account would behave coming off a 4-week over-deployment streak. Self-healing — no manual reset, no position rebalancing required.

The lesson here generalizes beyond Stalker: when a paper-test wrapper diverges from the live behavior it's supposed to mirror, the divergence accumulates silently. The only protection is to define the wrapper's semantics carefully and test the boundary conditions. "What happens when positions exceed the cap" was the question I should have asked at design time.

Backtesting Without Lying To Yourself

Backtesting a strategy against historical data is the easiest way to fool yourself in finance. The classic failure mode is survivorship bias: you backtest against today's universe of public companies, applied retroactively. The names that delisted, got acquired, or went to zero aren't in your test set, because they're not in today's universe. Your universe is by construction a sample of survivors. You're testing how well the strategy would have done on the names that turned out to be successful — which is not the same as testing how well it would have done in real time.

Stalker's backtest engine handles this through a point-in-time (PIT) universe archive. The bootstrap process queries FMP's /delisted-companies endpoint to recover the names that left the public market, then queries /historical-market-cap per ticker to determine the cap band membership at each historical date. The result is a JSON archive at ~/.cache/stalker-backtest/pit_universe/<window>.json that answers the question "which tickers were in the \$2B–\$10B mid-cap band on date X?" for any X in the bootstrap window. The current archive covers 2023-01-01 → 2026-04-30 and contains roughly 2,000 tickers ever in band, of which several hundred have a delisted_date.

Every backtest run can take a --pit flag. With it on, the rebalance candidate set at each weekly rebalance date is filtered to symbols whose market cap was actually in the \$2B–\$10B band on that date. With it off, the rebalance uses today's universe applied retroactively — the survivorship-biased path, kept around for legacy comparability.

The first thing the PIT archive did was discredit a previous result. ADR-009 had bumped the production momentum weight from a balanced 0.40 / 0.30 / 0.30 to 0.55 / 0.225 / 0.225 based on a non-PIT backtest showing +24pp alpha over a 1-year window. When I re-ran the same configuration with PIT, alpha collapsed to -12pp. The original number was a survivorship-bias artifact. The momentum tilt looked dominant because the survivors were the names with strongest momentum — by definition.

ADR-010 superseded ADR-009 with the bias-corrected result: at 0.55 momentum on PIT, the strategy beats SPY by +21.6% over the 3-year window. Better than equal-weight but a fraction of what the survivorship-biased number suggested. Honest accounting hurts.

The PIT archive also enabled ADR-011's quality-factor switch. The original quality definition (ROE plus gross margin) was producing essentially zero contribution in the bias-corrected backtest. The natural diagnostic question was whether quality is a noisy factor at our universe size, or whether ROE-plus-GM is the wrong measurement of quality. Adding a quality_definition knob to the engine and sweeping seven definitions showed the second answer: switching to ROIC-plus-operating-margin lifted alpha by +31 percentage points at production weights and improved Sharpe from 1.23 to 1.44. ROE rewards leverage (which varies enormously across mid-cap capital structures), and gross margin is industry-structural (SaaS at 80%, distribution at 8%) which sector-neutral z-scoring only partially undoes. ROIC is leverage-neutral and operating margin captures pricing power within sector — both tighter signals at our universe size.

The PIT archive turns "backtest" from a marketing exercise into a real measurement. It's not perfect — historical FMP fundamental data has its own gaps and revisions — but it removes the dominant bias.

The Meta-Experiment

The factor stack has been validated end-to-end on bias-corrected backtests. The risk gate is a pure-Python module with full test coverage. The executor is mechanical. What hasn't been validated, in any rigorous way, is the LLM layer in the middle. The brief-driven analyze step might be adding alpha by combining macro context, position-aware reasoning, and human-style synthesis the factors can't see. Or it might be a wash. Or it might be subtracting alpha by overriding good factor picks with brief-narrative picks that don't survive in the data.

The cost of getting this wrong in either direction is asymmetric. A non-additive LLM layer costs roughly \$200–700 a year in API plus an ongoing complexity tax — debugging an LLM-mediated trade path is harder than debugging a deterministic one. An additive LLM layer foregone is real alpha left on the table. Either way, the answer should come from data, not intuition.

ADR-013 is the pre-registered A/B test that mechanizes the question. Two arms, identical except for the selection signal: the brief arm (current Stalker, with Claude reading the brief and picking 6–10 names) versus a factors-only arm (top-N by combined_z, equal-weighted, same risk gates, same Kelly sizing). The factors-only arm runs as a daily shadow book — a separate DynamoDB table tracking what the deterministic strategy would have done each weekday close. The pairwise comparison is a paired daily-return diff t-test pooled over the elapsed window.

The pre-registration document locks the methodology before any data is examined. Hypotheses are pinned. The test statistic is mean(d) / (sd(d) / sqrt(N)) where d is the daily return difference. The threshold is 30 basis points per month of alpha at p < 0.05. The decision rule is mechanical:

Brief arm wins by >30bp/mo at p<0.05 → keep
Brief arm loses by >30bp/mo at p<0.05 → simplify (retire the LLM)
Inconclusive → simplify (default; the burden of evidence is on the LLM layer to justify itself)

The horizon is 12 months from inception (2026-05-04 to 2027-05-04). At horizon end, the test runs once on the pooled paired returns, the decision rule is applied, and the ADR's status updates from proposed to one of accepted (kept), accepted (simplified), or — if the data supports a conditional-fire hybrid — accepted (hybrid) with a follow-up ADR scoping the hybrid design.

Mid-flight changes invalidate the pre-registration. If the factor weights change, the test restarts. If the prompt changes, the test restarts. The whole point of pre-registration is that it converts a tempting post-hoc optimization into a principled experiment, with a documented decision rule that doesn't move once data starts coming in.

The shadow book runs on a daily 16:35 CT cron that ticks the factors-only portfolio through one rebalance, marks positions to today's close, and writes a row to stalker-shadow-performance. A weekly Monday-morning cron joins the live and shadow performance series, computes the running paired-diff statistic, and posts a status comment to the project's tracking ticket. The cron is dormant infrastructure for the first 12 months — the running stats are informational only; the keep-versus-simplify decision fires once at horizon close, not every week.

There's a sub-experiment I ran during the pre-registration drafting that's worth noting because it killed a tempting alternative. ADR-013's locked baseline is equal-weight factors-only, but the live system uses Kelly-derived suggestions to bias Claude's sizing. A reasonable concern was: if Kelly-as-binding-sizer (using the combined_z to set position weights directly, not just suggest them to the LLM) beats equal-weight by a lot in backtest, then the equal-weight baseline is a weak counterfactual and the brief arm is being given an unfair advantage. ADR-014 ran the offline comparison: at top_n=30, Kelly-as-binding-sizer appeared to beat equal-weight by +47.75pp alpha. Striking number. I almost wrote it up as a win.

The result didn't survive sensitivity checking. At top_n=10 (where Kelly's per-name weights actually fit within the cash budget), Kelly underperformed equal-weight by -43pp. The +47.75pp at top_n=30 was an implementation artifact: with the unnormalized Kelly weights summing to over 100% of NAV, the engine's per-buy cost = min(delta, cash) clamp was eating the lower-ranked names and concentrating capital in the highest-z names. Kelly was acting as both selector and sizer at top_n=30 by exhausting cash on the top names. The apparent edge was concentration, not better relative weighting. ADR-014 was rejected — Kelly stays as advisory input to Claude rather than a binding sizer, and the equal-weight baseline for ADR-013 is defensible.

This kind of pre-flight check is the discipline pre-registration enables. The temptation to ship a +47pp number is real. The discipline of asking "but does it survive the obvious sensitivity check?" is what separates a research finding from a marketing claim.

The Data Extraction Layer

Most of what makes the data network valuable is the upstream work — the work of getting the data into a normalized, schema-versioned form that Stalker can read. Each feeder has its own extraction story. Headwater reads HTML email digests and parses them into structured records. Estuary tracks individual writer feeds and computes consensus. PrivateEye decodes paywalled teasers into ticker picks, which is its own can of worms. Tributary subscribes to SEC EDGAR's filing stream and runs a structured extraction over the 8-K text. Goldfinch hits USAspending.gov's API and runs the recipient-to-ticker mapping.

The orchestration for the heavier extraction work runs on a Bosgame M5 mini PC in my basement — the same machine that handles DirtScout's tax-list PDF extraction. It's a Ryzen-class system with 128GB of RAM and decent local inference horsepower for the structured-extraction passes that don't need frontier-model quality. The split between cloud and on-prem is roughly: cloud handles the trade-path quality-sensitive work (Claude Sonnet 4.6 for analysis), and on-prem handles the batch extraction work (PDF parsing, structured field extraction, classification at volume). Same pattern I described in the economics of owning your own inference.

Cron jobs on the Bosgame run the weekly gradient sweep that probes the factor weight space, the daily shadow-book tick for the ADR-013 A/B, and the periodic ADR validation runs that check whether shipped strategy changes are tracking their predicted alpha. Each script is wrapped in a cron_wrapper.sh that does a git pull --ff-only before exec, so changes pushed from my laptop propagate to the on-prem cron without manual SSH. The bash wrapper is forty lines long and has saved me hours.

The point of the on-prem layer is operational independence. The cloud Lambdas are for the trading path — they need to be reliable, fast, and well-observable. The on-prem cron is for the background research path — it can take twenty minutes to run a backtest sweep, and that's fine. Putting the long-running work on Lambda would burn timeout budget and money for no benefit. Putting the trading work on the on-prem machine would bind the strategy's uptime to my home internet. The split is operationally cleanest.

What This Is Actually For

Stalker is paper-trading. It hasn't moved a dollar of real money. The Alpaca account is paper, the brokerage credentials are paper-mode, and the seeded \$1,000 is synthetic. The point isn't to make money in the next four weeks. The point is to validate the architecture under realistic conditions before any real money is at stake.

The criteria I want to satisfy before live cutover are:

The strategy beats SPY meaningfully on PIT-corrected backtest (currently +21.6% over 3 years at production weights — done).
The factor definitions are documented in ADRs and the rationale is reproducible (done).
The risk gate is unit-tested with full coverage of every guardrail (done).
The executor handles real-world edge cases like opposite-side wash-trade rejection and partial fills (done).
The seeded-account model is validated end-to-end including the over-deployment failure mode (done).
Six months of paper-trading without operational incidents — no missed briefs, no failed analyses, no risk-gate false positives, no executor failures that get past idempotency.
The brief-versus-factors-only A/B has run to horizon and the LLM layer is justified (12 months — in progress).
The bot's behavior under drawdown halts has been observed in practice (waiting for a real correction).

That's a long list. The 6-month operational checkpoint and the 12-month ADR-013 close-out are the slow parts. The other items are mostly done.

What this isn't: a money-printing scheme. The factor stack has documented edge in academic literature and bias-corrected backtest, but mid-cap factor strategies are well-known and don't have huge mispricing margins. The expected outcome at the upper end is something like SPY +5–15pp annualized at modestly higher volatility. That's a decent risk-adjusted return, not a free lunch. If the strategy underperforms in real time, the answer is to deconstruct what changed — regime shift, factor decay, prompt drift, brief-stream change — not to tweak parameters until the curve looks right.

What this is: an exercise in building the smallest deployable instance of a real-money trading system, then validating its components individually before scaling. The \$1,000 seed is the smallest amount that exercises live-account semantics (rounding, fractional shares, cash management) without trivial scaling. The mid-cap focus is the band where factor edges historically existed. The pre-registered A/B is the methodology that converts "I think the LLM is doing useful work" into "the data either supports keeping the LLM or it doesn't."

The other five projects in the data network exist for the same reason. Each one is its own small thing that does one job well, with a published schema, with structured output, with a clean handoff to whoever's downstream. Headwater could feed any number of consumers; Stalker is just one. Tributary's 8-K events could drive any number of overlays; Stalker happens to read them as risk filters. The architecture is a sequence of clean producer-consumer interfaces with no shared state, no implicit dependencies, no monolith. Adding a seventh project tomorrow would be — schema, ingest path, loader, prompt mention. Same pattern, different signal.

What I'd Do Differently

A few things, in retrospect.

I'd start with the PIT archive sooner. ADR-009 shipped a non-PIT backtest result that turned out to be largely a survivorship-bias artifact. The corrected number from ADR-010 is still a real edge, but it's a fraction of what the original number suggested. If I'd built the PIT bootstrap as part of the initial backtest infrastructure rather than retrofitting it, I would have shipped fewer ADRs that needed superseding. The cost of the PIT bootstrap is real — the FMP /delisted-companies endpoint takes a few minutes to walk and the per-ticker historical-cap queries take an hour the first time — but the cost is one-time, and the cost of being wrong about a strategy parameter is worse.

I'd pre-register the LLM-versus-factors test earlier. ADR-013's pre-registration locks methodology before data examination. I drafted it after the live system had been running for three weeks, which means the locked-baseline decision was already informed by the running paper performance. That's not strictly invalidating — the locked threshold (30bp/mo at p<0.05) doesn't move with informed data — but a properly clean pre-registration runs before any live data is observed. The lesson is to instrument the meta-experiment before instrumenting the experiment.

I'd separate the risk constants from the strategy constants more cleanly. Things like MAX_POSITION_PCT and WASH_SALE_DAYS live in the same module as factor weights, which conflates policy-decision parameters (where changes are sensitive and should be ADR-driven) with implementation parameters (where changes are routine). The current structure works, but a clean separation would make the policy boundary more obvious.

I wouldn't change the producer-consumer pattern. That's the most reusable architectural decision in the stack. Each feeder being its own project with its own schema and its own SES inbound endpoint means I can add or remove sources without touching the consumer logic. Stalker reads from each loader independently and degrades gracefully if any loader returns empty — DDB throttle, S3 hiccup, brief stream paused, whatever. The system stays operational on partial signal. That property has been worth every minute of the architecture-discipline cost.

Cross-Project Notes

If you've read the DirtScout post, some of the patterns here will look familiar. CDK in Python for infrastructure-as-code. Python Lambdas with deterministic IDs for idempotency. DynamoDB instead of a relational database. SES inbound for the producer-mail pattern. Static Next.js export on CloudFront for the human-facing dashboard. The same architectural style that worked for a land-acquisition platform works for a trading system, because both are read-heavy event-driven workloads with bursty inbound and structured persistence.

The differences are at the edges. DirtScout deals with parcels — slow-moving, geographically-bound, low-cardinality. Stalker deals with tickers — fast-moving, market-state-dependent, high-correlation. DirtScout's risk model is "did we accidentally surface a parcel that's not for sale." Stalker's risk model is "did we put 30% of NAV into one ticker right before its earnings miss." The shape of the failure modes determines the shape of the safeguards.

The other shared piece is the vibecoding approach to the codebase itself. I direct the architecture, make the load-bearing decisions, and review the diffs. The actual lines of code mostly come from conversations. Stalker is around 8,500 lines of Python plus 800 lines of CDK plus 3,400 lines of TypeScript (the dashboard). I wrote almost none of that by hand. I directed all of it.

That's a real distinction. "Directing" means owning the architecture, the policy decisions, the risk constants, the methodology for evaluation, the criteria for live cutover. It means saying "no, this isn't how we should structure that" or "actually let's pull this up to its own ADR before we ship it." It's design and review, not typing. The typing is a commodity. The design isn't.

The Forward Path

The 12-month ADR-013 horizon ends 2027-05-04. Between now and then, the system runs on its own. The shadow book accumulates daily. The weekly stats post lands in the project tracker every Monday. If the strategy hits a real drawdown, the halt logic engages and I find out whether the halt criteria are calibrated correctly. If a brief stream goes down for a day, the bot still has factor signal to fall back on. If a feeder schema changes, the lenient Pydantic models tolerate the addition until I update the consumer.

At horizon, the close-out runs the t-test, applies the decision rule, and updates ADR-013's status. If the brief arm wins by margin, the LLM layer keeps its place in the architecture. If it loses or is inconclusive, I retire the LLM analysis path — analyze_handler.py becomes a thin "select top-N by combined_z" function and the brief stream becomes pure observability rather than the trading driver. Both outcomes are structurally fine; the point is the choice is made by data.

The five upstream projects keep running regardless. Headwater publishes its briefs. Estuary computes its consensus clusters. PrivateEye decodes its teasers. Tributary classifies its 8-Ks. Goldfinch matches its contracts to tickers. Each project is independently valuable; each one feeds Stalker; none depends on Stalker for its purpose.

The five-feeders-and-a-trader architecture is the part I'm most certain about. The factor stack might need to evolve. The LLM layer might get retired at horizon. The risk constants might need adjustment under live conditions. But the producer-consumer pattern, the per-project SES inbound endpoint, the lenient Pydantic schemas, the deterministic IDs, the bias-corrected backtest discipline, the pre-registered A/B for the load-bearing architectural choice — those are durable. They're the parts I'd rebuild the same way if I started over.

Three years ago I built a Slack bot that generated stock reports. It told you what one stock looked like.

Stalker tells you what to do about it. It runs on its own. It's written down what it expects to see and how it'll know whether it was right. And it's surrounded by five other projects that do the work of making structured data available in the first place — because the trading layer is the smallest part of the system, and the data layer is where most of the leverage lives.

The work continues.

TinyComputers.io (Posts about finance)