We Keep Building Castles on the Swamp

When even the model providers can't tell when their own toolchain has regressed


Last month, Anthropic's own eval suite missed a quality regression in Claude Code that users had been reporting for weeks. If the people who built the model can't tell when their toolchain has shifted, the rest of us are building a castle in a swamp and wondering why our feet are wet.

I've spent the last few months building governance automation across a handful of LLM harnesses. Each one is its own substrate. And every substrate moves. Prompts that worked in March stop working in April. Schemas drift. Switch harnesses and you're starting over on different ground. The only way I find out is by watching outputs.

That's not a complaint about any single vendor. It's a structural property of building on LLMs. The fix is harness eval: your prompts, your schemas, your expected behaviours, gated in CI. Many ship without it. That's the swamp. If the swamp is winning, here's how to start building your foundation.

Exhibit A: the Anthropic postmortem

In April 2026, Anthropic published a postmortem explaining why Claude Code had felt off for several weeks. Three separate changes had shipped through their normal process. Each affected different users at different times. The combined effect looked like Claude was randomly getting worse. Their internal evals missed all of it.

The first was a March change that made Claude think less to respond faster. The second was a caching fix with a bug that kept wiping Claude's memory of earlier decisions. The third was an April system prompt tweak that capped output length and hurt coding quality. The bug passed code review, unit tests, end-to-end tests, and dogfooding. The prompt change passed weeks of internal testing and every eval they had.

The eval suite stayed silent through all of it. When Anthropic eventually ran broader tests, one of them showed a 3% drop. They only ran those tests because users had been complaining for weeks.

Look at the fix list. Per-model eval suites for every prompt change. Testing each prompt line on its own. Time to soak before rolling wider. Broader eval coverage. Gradual rollouts. The fix for a model provider's quality crisis was better testing.

If the team building the model needs that, every team building on top of it needs it more.

Exhibit B: a smarter model can break your harness

When Anthropic released Opus 4.7, the migration guide flagged that the model interprets prompts more literally, calibrates response length to task complexity, and tokenises text differently from Opus 4.6. The docs recommended a prompt and harness review as part of migration. A team that swaps claude-opus-4-6 for claude-opus-4-7 and ships gets a smarter model behaving differently. Without a harness, you cannot tell whether your application improved, regressed, or quietly shifted in ways your users will notice next week.

Exhibit C: silent provider updates

You can also sit still and lose ground. Most major providers offer model aliases that point to whatever the current best version is. The point is convenience: you do not have to update your code to get improvements. The cost is that the model behind the alias can change without your code noticing. Even pinning to a specific dated snapshot just buys time. Providers deprecate snapshots eventually, on a schedule you do not control. Your prompts have not moved. Your code has not moved. The substrate has.

What harness eval actually is

A harness eval has four parts. A golden dataset of inputs paired with expected outputs or expected behaviours. An output contract that says what shape the response must take. A scorer that compares actual output to expected, deterministically where possible and with a model-graded fallback where not. And a CI gate that fails the build when the score drops below threshold.

The dataset is the asset. Everything else is plumbing. A good harness fails loudly when something changes, even when the change looks like an improvement. That is the whole point. The harness is the early warning, not the verdict.

What good looks like

Start with one workflow. Pick the highest-stakes prompt in your application, the one whose output you would notice if it shifted. Write enough examples to cover the cases you actually care about catching: the happy path, the edge cases you have already seen break, and the failure modes you fear. Make the dataset small enough that you finish it in an afternoon. That is your golden dataset. It is small. It is yours. It is enough to start.

Add a scorer. Exact match where the output is structured. Model-graded with a fixed grader prompt where it is not. Wire it into CI. Fail the build if the score drops below the threshold you set today.

That is the whole foundation. A starter dataset and a CI gate. You have not solved harness eval. You have made it impossible to silently regress on the workflow you cared most about.

Then add the second workflow. Then a third. Treat the dataset as code, reviewed in pull requests, owned by the team that owns the prompt. The dataset is the asset. The asset will rot. Prevention is the next step.


The model providers will keep shipping. The substrate will keep moving. A harness is your foundation: every regression you have seen, written down so it cannot surprise you twice. Acting on what it tells you, without stopping when the ground shifts, is the next problem.

Functional Programming - Getting Started

As, principally, an iOS developer, Swift has made for interesting times. Whilst the first code I wrote in Swift was heavily influenced by Obj-C patterns I (like many before me) quickly discovered that this was not the best way. A core part of Reactive Programming is Functional Programming, so that seems like a good place to start.

Following some best practices, and advice from others (including Apple’s WWDC sessions) I found I was moving to writing code that adhered more closely to the Functional Programming patterns. That’s not to say it was intentional, or that I was accidentally discovering Functional Programming on my own. Just that adopting it more formally is a smaller step than I was expecting.

In summary (and this is simplified) functional programming emphasises immutability and minimises state. The output of a function is dependent only on its input. An inherent requirement is that functions do not have side effects. A common description states:

[C]omputation as the evaluation of mathematical functions
- Functional programming - Wikipedia

I’ll be honest, it took me longer than it really should have for that statement to click with me. In the hope that I’m not the only one, here’s an example showing a simple mathematical function using imperative and functional approaches. We’ll evaluate the mathematical function 3 + 2 + 6 + -1:

class ImperativeNumber {
    var value: Int

   func add(value: Int) {
        self.value += value
    }

    init(value: Int) {
        self.value = value
    }
}

let imperativeThree = ImperativeNumber(value: 3)
var imperativeNumber = imperativeThree
imperativeNumber.add(value: 2)
imperativeNumber.add(value: 6)
imperativeNumber.add(value: -1)
print(imperativeNumber.value)      // 10
print(imperativeThree.value)   // 10

struct FunctionalNumber {
    let value: Int

    func add(value: Int) -> FunctionalNumber {
        return FunctionalNumber(value: self.value + value)
    }
}

let functionalThree = FunctionalNumber(value: 3)
let functionalNumber = functionalThree
    .add(value: 2)
    .add(value: 6)
    .add(value: -1)
print(functionalNumber.value)      // 10
print(functionalThree.value)   // 3

In functional programming, like the mathematical function, the value of “3” does not change because we added “2” to it. Instead, we have a new number that we can perform a new function on. If “3” did change, as in the imperative case, then anything else that used “3” in its calculations would be affected as “3” could now be “5”, or “11”, or “10”. It’s quite common to see this chaining pattern in functional programming, and reactive programming.

A note on naming. According to Swift guidelines, a function without side-effects should be a noun, and a function that has side-effects should be a verb. It’s the difference between getting an object, or performing an action on it. See also, Array.sort and Array.sorted. However, whilst some Swift types have imperative and functional equivalents, the versions which have their roots in functional programming typically only have the functional equivalent (e.g. filter, map, flatMap) regardless of if they could be implemented in an imperative fashion. If we have a purely functional type, it should not be unexpected, to use the verb form. However, when mixing functional and imperative code, it’s likely best to adhere to the guidelines:

struct ScoreBoard {
    private(set) var counter = 0

    mutating func increment() {
        counter += 1
    }

    func incremented() -> ScoreBoard {
        return ScoreBoard(counter: counter + 1)
    }
}

What functional programming means for your code is that it is safer. If an object can’t be mutated, you aren’t at risk of an object being modified by another thread whilst you read from it. It’s also easier to test when you are guaranteed the same result based on a consistent input.

Learning Reactive Programming

I started taking a look at functional programming recently, and more particularly in the context of Reactive Programming. There’s a lot to cover and remember and I’ve found writing it down helps. To that end, I’m going to try and write a few blog posts as I go in the hope of remembering this stuff and working through some ideas as I do. Hopefully, if I make any mistakes, someone can correct me. And maybe it will help someone else.