We Keep Building Castles on the Swamp

When even the model providers can't tell when their own toolchain has regressed


Last month, Anthropic's own eval suite missed a quality regression in Claude Code that users had been reporting for weeks. If the people who built the model can't tell when their toolchain has shifted, the rest of us are building a castle in a swamp and wondering why our feet are wet.

I've spent the last few months building governance automation across a handful of LLM harnesses. Each one is its own substrate. And every substrate moves. Prompts that worked in March stop working in April. Schemas drift. Switch harnesses and you're starting over on different ground. The only way I find out is by watching outputs.

That's not a complaint about any single vendor. It's a structural property of building on LLMs. The fix is harness eval: your prompts, your schemas, your expected behaviours, gated in CI. Many ship without it. That's the swamp. If the swamp is winning, here's how to start building your foundation.

Exhibit A: the Anthropic postmortem

In April 2026, Anthropic published a postmortem explaining why Claude Code had felt off for several weeks. Three separate changes had shipped through their normal process. Each affected different users at different times. The combined effect looked like Claude was randomly getting worse. Their internal evals missed all of it.

The first was a March change that made Claude think less to respond faster. The second was a caching fix with a bug that kept wiping Claude's memory of earlier decisions. The third was an April system prompt tweak that capped output length and hurt coding quality. The bug passed code review, unit tests, end-to-end tests, and dogfooding. The prompt change passed weeks of internal testing and every eval they had.

The eval suite stayed silent through all of it. When Anthropic eventually ran broader tests, one of them showed a 3% drop. They only ran those tests because users had been complaining for weeks.

Look at the fix list. Per-model eval suites for every prompt change. Testing each prompt line on its own. Time to soak before rolling wider. Broader eval coverage. Gradual rollouts. The fix for a model provider's quality crisis was better testing.

If the team building the model needs that, every team building on top of it needs it more.

Exhibit B: a smarter model can break your harness

When Anthropic released Opus 4.7, the migration guide flagged that the model interprets prompts more literally, calibrates response length to task complexity, and tokenises text differently from Opus 4.6. The docs recommended a prompt and harness review as part of migration. A team that swaps claude-opus-4-6 for claude-opus-4-7 and ships gets a smarter model behaving differently. Without a harness, you cannot tell whether your application improved, regressed, or quietly shifted in ways your users will notice next week.

Exhibit C: silent provider updates

You can also sit still and lose ground. Most major providers offer model aliases that point to whatever the current best version is. The point is convenience: you do not have to update your code to get improvements. The cost is that the model behind the alias can change without your code noticing. Even pinning to a specific dated snapshot just buys time. Providers deprecate snapshots eventually, on a schedule you do not control. Your prompts have not moved. Your code has not moved. The substrate has.

What harness eval actually is

A harness eval has four parts. A golden dataset of inputs paired with expected outputs or expected behaviours. An output contract that says what shape the response must take. A scorer that compares actual output to expected, deterministically where possible and with a model-graded fallback where not. And a CI gate that fails the build when the score drops below threshold.

The dataset is the asset. Everything else is plumbing. A good harness fails loudly when something changes, even when the change looks like an improvement. That is the whole point. The harness is the early warning, not the verdict.

What good looks like

Start with one workflow. Pick the highest-stakes prompt in your application, the one whose output you would notice if it shifted. Write enough examples to cover the cases you actually care about catching: the happy path, the edge cases you have already seen break, and the failure modes you fear. Make the dataset small enough that you finish it in an afternoon. That is your golden dataset. It is small. It is yours. It is enough to start.

Add a scorer. Exact match where the output is structured. Model-graded with a fixed grader prompt where it is not. Wire it into CI. Fail the build if the score drops below the threshold you set today.

That is the whole foundation. A starter dataset and a CI gate. You have not solved harness eval. You have made it impossible to silently regress on the workflow you cared most about.

Then add the second workflow. Then a third. Treat the dataset as code, reviewed in pull requests, owned by the team that owns the prompt. The dataset is the asset. The asset will rot. Prevention is the next step.


The model providers will keep shipping. The substrate will keep moving. A harness is your foundation: every regression you have seen, written down so it cannot surprise you twice. Acting on what it tells you, without stopping when the ground shifts, is the next problem.