Architecture Verified, Mythology Intact: Running OpenMythos on a Strix Halo

A.C. Jokela — Mon, 20 Apr 2026 13:00:00 GMT

Anthropic has a rumored upcoming model called Mythos. The weights are not public, the architecture is not published, and Anthropic has said nothing official about how it works. That has not stopped people from guessing.

OpenMythos is one of those guesses: an open-source "theoretical reconstruction" by Kye Gomez, built from publicly available research on what Anthropic's architecture might look like. The repository's disclaimer is blunt: "an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic."

The architecture Gomez bets on is called a Recurrent-Depth Transformer. That's a specific and unusual design choice. Most current language models, like GPT or Llama, are feed-forward: tokens enter at the bottom, flow through dozens of distinct layers stacked on top of each other, and exit as predicted next tokens. A Recurrent-Depth Transformer splits that stack differently. A small number of ordinary layers run once at the start and once at the end. In between, a single layer runs many times in sequence, with the output of each run fed back in as the input to the next. Same weights. More computation.

You can pip install OpenMythos. It has configurations from mythos_1b (one billion parameters, toy scale) up to mythos_1t (one trillion, frontier scale). The README shows you how to instantiate the 1B version in about ten lines of Python and run a forward pass.

I ran that 1B variant on my Strix Halo box (a Ryzen AI MAX+ 395 with an integrated Radeon 8060S GPU, 60 GB of unified memory, running PyTorch on ROCm). The question is not whether it runs. The question is what running it can tell you. The answer turns out to be interesting in both directions: more than expected about the architecture, and exactly nothing about the model.

The Setup

The Strix Halo has one GPU. OpenMythos targets distributed training via FSDP, but the forward and inference paths work single-GPU. Install path:

pip install --no-deps open-mythos

The --no-deps matters. The pyproject.toml pins torch = "2.11.0", which is not what my gfx1151 wheels are at, and the package's actual runtime requirements are satisfied by any torch >=2.1. Skipping deps keeps my ROCm stack intact.

One ROCm-specific patch was needed. The DepthWiseLoRA module's forward has this line:

s = self.scale(torch.tensor(loop_t, device=x.device))

That creates a 0-dim tensor and passes it to an nn.Embedding. On gfx1151 this produces a hip launch failure. The fix is a one-line change to index the embedding weight directly:

s = self.scale.weight[loop_t]

Same semantics, different kernel path, no crash. Expect similar papercuts in any research code run on non-reference hardware.

With that done, mythos_1b instantiates cleanly. Parameter count: 1,064,028,034. A forward pass at batch 1, sequence 128, 16 loops, using bfloat16 mixed precision (a numerical format that halves memory versus regular float32 with negligible quality loss for inference): 2.07 seconds. Peak GPU memory: 6.44 GB. Well within the Strix Halo's envelope.

That gives me a working model. The rest of this post is what I did with it.

Why a Looped Transformer, Briefly

Before the experiments, a quick tour of what's specifically weird about this architecture, because everything downstream depends on it.

A standard transformer has roughly 32 to 100 distinct layers. Each layer has its own parameters. A prompt passes through every layer once. The parameter count is proportional to the layer count times the width of each layer.

A looped transformer keeps only one "inner" layer but runs it many times. Training on 32 "effective layers" requires only 1 layer's worth of parameters. Inference with more loops is equivalent to running a deeper model, without actually storing a deeper model. The architectural bet: if you can get this to work, you get a deeper reasoning model for a fraction of the memory.

There are two reasons to care about this for a model like Claude Mythos. First, memory efficiency at scale. A trillion-parameter model is expensive to serve; a looped model with the capability of a trillion-parameter feed-forward model but 1/16th the parameters would be dramatically cheaper. Second, reasoning depth. A 2025 paper by Saunshi et al. proved mathematically that running a looped transformer for T loops is equivalent to doing T implicit steps of chain-of-thought reasoning (the now-familiar "let me think step by step" trick that makes large models better at hard problems), except the "thoughts" happen in continuous latent space inside the model rather than being emitted as visible text tokens. If Mythos is doing that, it would explain why the model seems to do multi-step reasoning without the user ever seeing intermediate "scratch" tokens.

The catch is that training a looped transformer is notoriously unstable. If the single inner layer amplifies the signal each time it runs, that amplification compounds. A 5% boost per loop becomes a 65% boost after 10 loops, and a model with a 65% boost per forward pass either explodes in training or produces outputs that don't resemble language. Most attempts at looped transformers over the last decade failed for exactly this reason.

The fix that makes OpenMythos (and the hypothesized Mythos) workable is borrowed from a 2026 paper called Parcae (Prairie et al.). It introduces a clever parameterization of the "gain" of the recurrent update. Instead of letting the model learn arbitrary weights in the core recurrence, Parcae constrains one piece of the architecture to always have its largest amplification factor strictly less than 1. In dynamical-systems terms, the "spectral radius" of the update matrix is always less than one. That guarantee is what makes the loops stable: any signal gets damped by each repeated application, so repeated iteration converges toward a useful fixed point instead of blowing up.

This is the claim I'm about to test. The spectral radius should be less than 1 by construction. Without that constraint, training should break. And the failure mode should match what Parcae predicts.

What I Verified

1. The spectral radius at initialization

The cleanest possible check: the Parcae paper claims a specific mathematical structure for the stability-guaranteeing matrix. Starting from the parameters' default initialized values of zero, the formula works out to a single number: exp(-1) = 0.3679. If the code matches the paper, a fresh mythos_1b should have its key matrix set to exactly that value everywhere.

The parameterization in the code is:

A = exp(-exp(log_dt + log_A))

With log_A and log_dt both initialized to zero, that becomes exp(-exp(0)) = exp(-1) = 0.3679. The matrix in question is diagonal, meaning it's effectively a list of numbers rather than a two-dimensional grid, so the spectral radius (technical definition: magnitude of the largest eigenvalue) reduces to the largest absolute value in that list. At initialization, every entry is the same 0.3679.

Measurement on the instantiated 1B model:

log_A init value (first 5): [0. 0. 0. 0. 0.]
A min: 0.367879
A max: 0.367879
rho(A) at init: 0.367879

Matches the theoretical prediction to six decimal places. The constraint is doing what the paper claims. This is the kind of thing you can only verify by actually running the code, because documentation and papers often drift from implementations, and subtle bugs in implementations of clever mathematical constructions are common.

2. The loops are not a no-op

Next question: do the loops actually do anything? The README claims each loop iteration is "functionally equivalent to one step of chain-of-thought." In practical terms, that means running more loops should produce different (and presumably better) output than running fewer. If the recurrent block has learned to do nothing, or if the architecture happens to be set up such that the injection of the original input drowns out everything the loop contributes, then all loop counts would produce identical outputs and the whole looped-transformer idea is moot.

The cleanest test: take the same input, the same random-initialized model, and run it with different loop counts. Compare the outputs. For each token position, the model predicts a probability distribution over the next token. Two different ways to compare:

Argmax agreement is just "for what fraction of positions does the most-likely-next-token come out the same?" If two runs pick the same top token 95% of the time, they mostly agree. If they agree 35% of the time, they're meaningfully different. The comparison below uses the 16-loop run as the reference.

KL divergence is a standard measure of how different two probability distributions are, expressed in nats (units of the natural logarithm). Zero means identical distributions. Higher means more different. Intuitively: how much information is lost if you model a distribution as something other than itself.

Running a fresh, untrained mythos_1b with a fixed input and seed:

n_loops= 1: argmax agreement with 16-loop run = 35.2%   KL = 0.19 nats
n_loops= 2: argmax agreement                  = 65.6%   KL = 0.10 nats
n_loops= 3: argmax agreement                  = 72.7%   KL = 0.07 nats
n_loops= 4: argmax agreement                  = 80.5%   KL = 0.06 nats
n_loops= 6: argmax agreement                  = 88.3%   KL = 0.04 nats
n_loops= 8: argmax agreement                  = 90.6%   KL = 0.02 nats
n_loops=12: argmax agreement                  = 96.1%   KL = 0.005 nats
n_loops=16: argmax agreement                  =100.0%   KL = 0 nats

Even with random initialization, the loops do substantive work. After a single loop, only 35% of the 128 output tokens match what the model produces after 16 loops. By three loops, 73% match. By twelve, 96%. The KL divergence tells the same story from a different angle: the probability distributions converge monotonically toward the 16-loop baseline as loop count rises.

This is exactly the signature of a well-behaved recurrent system settling toward a fixed point. The loops aren't a no-op. They also aren't chaotic: each successive loop gets closer to convergence, which is what the stability guarantee predicts.

3. The stability constraint does its job

The reconstruction becomes load-bearing here. The Parcae paper claims the constraint on the matrix A is not just a nice-to-have but a requirement. Without it, they say, training diverges at aggressive learning rates. With it, training is stable.

The test: build three otherwise-identical small models (shrunk to a 128-dimensional hidden state for training speed while keeping the full looped architecture). The only difference is how the matrix A is parameterized:

stable: the shipped LTIInjection that uses the exp(-exp(...)) construction to keep A in the stable range by construction
unstable (start at 0.368): replace the clever construction with a raw learnable parameter initialized to the same value the stable version starts at
unstable (start at 0.95): same raw parameter, but initialized close to the stability boundary, to see whether training pushes it over

Trained at a deliberately high learning rate of 0.05 with 8 recurrent loops per forward pass, for 300 steps, on random next-token prediction. (The point isn't to train a good model. It's to stress-test the stability mechanism under conditions where unstable training would be expected to break.)

The metric is max|A|, the largest entry in the diagonal. For the stable version, this is the spectral radius and the theory guarantees it stays below 1. For the unstable versions, nothing guarantees anything; we're watching whether training happens to keep it bounded.

Step	Stable	Unstable (0.368)	Unstable (0.95)
0	0.368	0.418	1.000
20	0.496	0.719	1.319
60	0.480	0.749	1.340
100	0.477	0.736	1.315
200	0.474	0.700	1.251
299	0.469	0.666	1.190

Within 20 training steps, both unstable variants push at least one entry of A well above the stable version's cap. The 0.95-init case jumps past 1 immediately and stays there. Above 1 is the forbidden regime: a diagonal entry greater than 1 in magnitude means that dimension's contribution to the hidden state grows with every loop instead of shrinking. The Parcae paper says this is fatal. Does it actually kill training?

Mostly, yes. The stable variant kept producing meaningful gradients the whole way through and the loss moved (noisily, because the training data was random). Both unstable variants had their gradient norm collapse to machine zero within 20 steps and stay there. Their loss froze at log(512) = 6.238, which is the entropy of a uniform distribution over the 512-token vocabulary we used: the training signal became meaningless because the model was outputting a flat "I have no preference about any token" distribution regardless of input.

This isn't the classic way training fails. It's not the "loss explodes to infinity and the whole job crashes" failure mode most people think of. It's subtler: the recurrent state grows large enough that the final output saturates to uniform, every possible update to the weights produces the same (wrong) uniform output, so the gradients go to zero and the optimizer stops making progress. Training is effectively dead, silently.

That is a specific failure mode the Parcae paper warns about, and it is exactly what happens here when the constraint is removed.

4. Hidden states blow up by exactly the predicted factor per loop

The previous experiment showed that removing the stability constraint breaks training. This one looks at the mechanism underneath. What does "the recurrent state grows unboundedly" actually look like numerically?

The theory predicts that if the spectral radius is ρ, then after each loop the magnitude of the hidden state grows (or shrinks) by a factor of roughly ρ. With ρ < 1, repeated shrinkage by less than one converges toward a fixed value. With ρ > 1, repeated growth by more than one goes to infinity exponentially.

Setup: instrument the model to record the magnitude of the hidden state at each loop iteration. Force A to specific values from stable (0.37) through borderline (1.0) through clearly unstable (2.0). Disable ACT halting (an early-exit mechanism explained below in experiment 5) so all 8 loops run and we can see the full trajectory. Each number below is the magnitude of the hidden state measured after that loop iteration completes.

Loop	ρ=0.37	ρ=0.9	ρ=1.0	ρ=1.2	ρ=1.5	ρ=2.0
1	91	91	91	91	91	92
2	124	172	182	200	228	274
3	136	246	272	330	432	638
4	141	312	363	487	738	1367
5	143	371	454	675	1199	2825
6	143	425	544	901	1889	5740
7	144	473	635	1172	2924	11570
8	144	517	726	1498	4476	23232

Two things to notice. First, the stable column (ρ=0.37) converges to a fixed value around 143 and stops moving. Each loop shrinks the hidden state closer to an equilibrium, then settles. This is the intended behavior: a useful, computation-performing recurrent system that's doing work but not running away.

Second, the ρ=2.0 column grows by almost exactly 2× per loop after the first couple: 274 → 638 → 1367 → 2825 → 5740 → 11570 → 23232. The last three ratios average 2.02×, which is as close to the theoretical 2.0 as you'd expect given the transformer block itself contributes nonlinear noise on top of the linear dynamics. The prediction is tight.

Four loops of ρ=2.0 take the hidden state from 91 to 1367, already a 15× blowup. Sixteen loops (the designed inference depth for mythos_1b) would push it to around 10^7, which is well past the representable range of bfloat16 and would produce infinities in a real forward pass.

That is the stability analysis verified on actual silicon. The clever exp(-exp(...)) construction does what the paper says it does, and removing it produces exactly the divergence the paper says it produces, at the rate the paper says it produces it.

5. Throughput on a consumer APU

The numbers for curiosity, not for the thesis. Single prompt of 128 tokens, bfloat16, running on the integrated Radeon 8060S:

n_loops	Latency	Tokens/sec	Peak GB
1	141 ms	910	6.29
2	254 ms	503	6.37
4	501 ms	255	6.42
8	1021 ms	125	6.44
16	2071 ms	62	6.44

Latency is almost perfectly linear in loop count, which is what you'd expect: double the loops, double the compute, double the wall-clock time. The knee of the inference-time scaling curve, to the extent one exists, sits around 6 to 8 loops before the "more reasoning" benefit stops being worth the "wait twice as long" cost.

The architecture has a feature called Adaptive Computation Time (ACT) that is supposed to help here. ACT learns, per-position in the prompt, whether that token's representation has "converged enough" to stop looping. Easy tokens (a period, a common function word) should halt after a couple of loops; hard tokens (the key answer in a math problem) keep looping. In theory, this saves compute on easy tokens.

In practice, ACT had no measurable effect on throughput in my runs. Two reasons. First, ACT only breaks the loop early if every position in the batch has halted, because the GPU runs all positions in parallel and can't just skip one. With 128 positions in the prompt, the probability that every single position happens to halt simultaneously is effectively zero, so the early-exit path never fires. Second, the halting decision is made by a learned predictor, and my model was randomly initialized (not trained). An untrained model doesn't know which tokens are easy. You'd need actual training plus a mix of easy-and-hard positions for ACT to help. Neither was present in my experiments.

The Architectural Ceiling I Didn't Expect

While trying to run 24 loops for an ablation that's not in this post, I hit this error:

IndexError: index 16 is out of bounds for dimension 0 with size 16

It turned out the architecture has a small per-loop adaptation component, a tiny learnable "tweak" that's different for each loop iteration, letting loop #1 behave slightly differently from loop #8. That component is implemented as a lookup table with exactly max_loop_iters entries. If you configured the model to train on 16 loops, you have 16 entries in the table. Trying to run a 17th loop means looking up entry 16 in a 16-entry table, which fails.

This matters because one of the headline claims about looped transformers is depth extrapolation: train the model on (say) 5-step reasoning chains, then at inference time let it run 10 or 20 loops to handle harder problems than it ever saw during training. The theoretical argument is that running more loops = more reasoning depth, and this should emerge as a free capability at inference time.

The OpenMythos implementation supports depth extrapolation only up to max_loop_iters. Past that, the per-loop adaptation lookup fails. You can extend the table, but only by reinitializing it larger and resuming training. You cannot simply crank a knob at inference time.

That's a genuine constraint on the "more loops = deeper reasoning at inference" story, and it's the kind of thing you find only by trying to cross the boundary. Nothing in the README warns you about it. It's the sort of detail that disappears when a paper's theoretical claim ("more loops at inference!") becomes a concrete implementation ("a table indexed by loop number, which has a fixed size").

What I Could Not Verify

Nothing I did tells you anything about Claude Mythos.

The architecture OpenMythos implements could be exactly the Mythos architecture. It could be a reasonable guess that shares some features with Mythos. It could be entirely wrong. You and I have no way to check, because Anthropic has not published the architecture. The mythos_1b I trained is a 1B-parameter looped transformer that behaves consistently with published research on looped transformers. It is not Mythos.

This is the epistemic limit that the repo's disclaimer is trying to name. Running a speculative reconstruction tells you whether the reconstruction is internally coherent, and whether it matches the published research it claims to match. It tells you nothing about whether the reconstruction maps to the thing it was reconstructed from. No amount of running it closes that gap. The gap is closed only by information the model's creators chose not to release, and running silicon against latent belief doesn't produce that information.

So "I verified that OpenMythos's architecture works as claimed" is a real and useful statement. "I verified that Claude Mythos uses this architecture" is not something I can say, and nobody outside Anthropic can, no matter how thoroughly they run the reconstruction.

What an Open Reconstruction Is Good For

It's good for three things.

One, as a teaching artifact. There's a live research line on looped transformers. Most descriptions of it are paper-shaped: dense with notation, theorem statements, ablation tables. OpenMythos is one of the few places you can read the whole architecture as working code in a single file, with every piece named and addressable. The stability guarantee that takes several pages of the Parcae paper to motivate resolves to one line of Python:

def get_A(self):
    return torch.exp(-torch.exp((self.log_dt + self.log_A).clamp(-20, 20)))

That's the whole guarantee. You can read it, you can run it, you can verify it on your bench. The paper claim goes from abstract math to a concrete object you can measure. For anyone who wants to understand why looped transformers work and has been bouncing off the academic literature, that's worth the install.

Two, as a testbed for your own ideas. If you want to try modifying the architecture (swap which attention variant it uses, change how the experts are routed, make the loops behave differently at different depths) the code is about 1000 lines of clean PyTorch. You don't need to build a looped transformer from scratch; you can start from a working baseline and modify. The research is live and you can participate in it.

Three, as a way to calibrate your expectations. My 1B-parameter mythos_1b produces 62 tokens per second at 16 loops on the 8060S. A full-scale Mythos would be far larger and presumably run at similar or more loops per token. If Mythos is actually a Recurrent-Depth Transformer, that tells you something about the real cost of running it: every token takes the full loop count of compute, regardless of how "easy" it is. That's a different cost shape than a standard transformer, which uses the same compute per token but over a fixed number of distinct layers. You can form a rough sense of the compute-per-token ratio that a looped architecture would imply for a frontier deployment, which is useful even if you never run a frontier model yourself.

None of those three things are "I now know how Claude Mythos works." They are all "I now know things about looped transformers that I did not know before." For the blogger running an 8060S in a home lab, that's the realistic upside, and it's a larger upside than zero.

Coda: The Reconstruction as Thing

I wrote a philosophy piece earlier this week about Heidegger's distinction between things and endpoints. A Z80 on a RetroShield is a thing: it gathers a world of silicon, engineers, software history, and your own hands. A cloud API is an endpoint: it offers a contract and deliberately hides everything behind it.

Claude Mythos is an endpoint. You send tokens, you get tokens, the weights are not yours, the architecture is not yours, and if Anthropic swaps the backing model nothing changes for the caller by design. That's the whole value proposition. It refuses to gather.

OpenMythos is a thing. I have its weights. I know the parameter count down to the last entry: 1,064,028,034. I measured its internal stability matrix at initialization and watched it move across training. I watched the hidden state blow up to 23,000 when I forced the instability and disabled the halting mechanism. I know how long a forward pass takes on my specific GPU with my specific ROCm wheel on a specific Tuesday in April 2026. It gathers a whole lineage: the 2026 Parcae paper that explained the stability trick, the older research on looped transformers that it built on, Kye Gomez's speculative synthesis of the two into a candidate architecture for Mythos, the AMD gfx1151 toolchain that lets me run any of this on a Ryzen APU at all, the one-line patch I had to apply to the code, my own debugging session, and thirty minutes of my GPU's fans running at full tilt.

The thing gathers. What it gathers, though, is the reconstruction. Not the reconstructed. My 1B model is a physical artifact with measurable behavior. It is not a window onto Anthropic's internals. Those internals remain an endpoint, and the endpoint remains abstract.

Mythology intact. Architecture verified. That is what a home lab buys you in 2026.

TinyComputers.io (Posts about openmythos)