Back to Blog
TIA™July 4, 20265 min read

Where a Passing Test Still Isn't Proof of a Right Answer

Take a single, finite handful of data points. No more will ever arrive. And yet a data scientist wants to know something the data alone cannot say for certain: how much would the answer wobble if the world had handed over a slightly different sample. The trick, formalized under names like the bootstrap and the jackknife, is almost absurdly simple. Watch how much the estimate moves. If it barely moves, call it stable. If it swings wildly, the original answer was never solid ground, just a lucky roll.

This idea, that "estimators behave similarly when subjected to small perturbations in the data," is what a recent paper on cross-validation calls stability, and it treats stability as close to a first principle of modern statistics: the concept behind generalization error, privacy guarantees, and the validity of adaptive inference all at once. The paper's opening complaint is the real subject, though. Black-box inference is hard precisely because the estimators doing the work "lack traditional modeling assumptions and exhibit opaque behavior," which means nobody can write down, in closed form, what the error should look like. So statisticians reach for "post-hoc randomization": permutation tests, resampling, conformal prediction. Methods that don't ask the model to explain itself. They ask it to survive being shaken.

That is one kind of test. There is an older, unrelated-looking one from software engineering: design by contract, Bertrand Meyer's framework for getting "software elements cooperating with each other" by making each one state, formally, what it requires and what it promises in return. A precondition, a postcondition, an invariant. The function that divides two numbers doesn't have to trust its caller; it just states the caller must never hand it a zero, and if that's honored, the function guarantees a correct quotient back. Neither side has to understand the other's internals. They just have to keep their word at the boundary.

Put these two next to each other and something clicks that neither field says out loud. A contract checks whether an exchange happened in the right shape. A stability test checks whether an answer would survive being asked again under slightly different conditions. Both are ways of verifying something without opening the box. And both, quietly, verify something much narrower than "this answer is true."

That narrowness is where the discipline of caging inference instead of trusting it lives or dies. The instinct to never use inference for what can be deterministically codified, to reach for code first and script second and let inference touch only what nothing else can reach, is really an attempt to build Meyer's contracts around a mind that cannot state its own preconditions. Give the inference call codified inputs, a codified output shape, a codified gate that checks the output is well-formed, and a codified loop around the whole thing. That gate is a postcondition in the strict Meyer sense. It can catch a malformed answer, an out-of-range score, a hallucinated field that breaks the schema. It cannot catch a wrong answer that happens to be dressed correctly, the same way a permutation test cannot catch an estimator that is stable and also biased. Stability tells you the number would not have moved much under resampling. It never tells you the number was measuring the right thing.

This is the seam the cross-validation paper gestures at without naming it: post-hoc randomization works under "mild assumptions such as exchangeability," meaning the perturbed samples have to be drawn from the same well the original one came from. The whole method is only as honest as that assumption. Which means the test's authority is borrowed, not earned. It is earned somewhere upstream, in whoever decided the sampling was exchangeable in the first place, and in Meyer's world, in whoever wrote the postcondition and actually knew what correct looked like.

That is the exact spot where a test stops catching and starts rubber-stamping: the moment the check is built out of the same well as the answer it is checking. A gate whose validity criteria were themselves inferred, guessed at, generated by the very process under test, is not an independent check anymore. It is an echo with a signature attached. A permutation test only means something because the shuffled data sets are real data, drawn from reality, not from another guess about reality. A postcondition only means something because someone with independent knowledge of the domain wrote it before seeing the output, not after, and not by asking the same model what its own answer should look like.

Cage quality is supposed to dominate model quality, and it does, but only up to the honesty of the cage. Every gate is worth exactly as much as the independence of the ground it stands on. Build the cage from the same clay as the thing it's caging, and the fire in the fireplace stops warming the room. It just starts admiring itself in the glass.

Jon Mayo

Written by

Jon Mayo

Liked “Where a Passing Test Still Isn't Proof of a Right Answer”?

Get notified when new TIA™ articles are ready.

Subscribed to: