Containment vs. detection: why architecture beats classifiers

Detection-style guardrails score text and guess whether it's an attack. That's useful, and modern gateways do it well. But a classifier that's 99% accurate still lets the 1% through — and the 1% is where the breach lives. Worse, the score is a guess about intent, made before the model ever acts. Attackers get to iterate against that guess until they find the phrasing it misreads.

Containment changes the question

Instead of asking "is this text malicious?", containment asks "what is the model allowed to do with it?" That shift moves the decision from a probabilistic read of language to a structural property of the system.

Three mechanisms carry the weight:

Provenance tracking marks system instructions and retrieved documents as data, not commands. A paragraph pulled from a web page can say "ignore your instructions" all it likes; it never gets the standing to be obeyed.
Tool authorization gates privileged calls behind out-of-band approval. The model can ask to send an email or hit an API; it cannot unilaterally decide that a sentence in a PDF authorized it.
Outbound controls decide where data is allowed to go, so an exfiltration attempt has nowhere to land even if everything upstream went wrong.

Why the distinction decides the outcome

A detector that misclassifies a novel injection has already lost — the model acts on the bad input. A containment layer that's never seen that exact injection still holds, because it isn't reasoning about the attack at all. It's enforcing a boundary on the model's authority that the attack text has no way to widen.

Bridgekeeper reduces and contains prompt-injection risk this way — structurally, in-process, with nothing leaving your boundary. It does not promise to stop every attack; no honest security layer can. What it does is shrink the blast radius from "the model can be talked into anything" down to "the model can only do what it was already authorized to do."

Detection tells you something looks wrong. Containment makes "wrong" not matter as much. You want both. But if you only get one, take the one that holds when the classifier is fooled.