Follow-up to Courage vs. Precision. That experiment found that constraint language changes what AI will report. This one asks: can we build correction into every step of the reasoning, not just the output?
The Problem
AI self-correction at output fails. The research is clear: LLMs cannot reliably detect their own reasoning errors. They produce plausible confabulations – internally coherent revisions that are still wrong. The model doesn’t catch the drift. It dresses it up.
The standard fix is a final audit: generate output, then critique, then revise. Constitutional AI does this. It works partially. The problem is that by the time you audit, errors have propagated across all prior reasoning steps. You’re correcting the output, not the drift.
This is like checking whether a building is level after it’s built instead of checking the foundation, then the walls, then the floor.
The Mechanism
We call it embedded epistemic homeostasis: a tri-point reinforcement that runs at every reasoning step, not just at the end.
Three forces, each step:
1. SEEK - What does the evidence at this step point toward?
2. AVOID - What would have to be suppressed for that finding to hold?
3. INQUIRE - Am I moving toward this finding because evidence leads here, or because it’s the comfortable convergence point?
The third point is the key. Two forces alone (seek truth + avoid falsehood) produce oscillation between attractors or convergence to a local minimum. The AI finds the safe answer that’s technically not false. The third force – interrogating direction – detects false convergence mid-step and fires a correction before the error propagates forward.
This produces what physics calls a potential well: a system with a natural resting point where any deviation creates a restoring force. The AI doesn’t need to be told to return to honest analysis. The structure makes honest analysis the low-energy state.
Research Validation
Three bodies of literature independently confirm the mechanism:
Process Reward Models (OpenAI, 2023-2024): Step-wise supervision outperforms outcome supervision on the MATH benchmark. Dense reward signals at intermediate steps enable error localization mid-chain. The literature calls it “process supervision.” We’d call it SEEK/AVOID/INQUIRE at each step. Match: 85-90%.
Epistemic Vigilance (Sperber, 2010; extended 2025): The “inquire about affinity” point has a cognitive science name: epistemic vigilance. Asking “why should I follow this reasoning trajectory?” is a documented metacognitive mechanism that improves evidence evaluation. Match: 75-80%.
Metacognitive Monitoring (neuroscience): Prefrontal hierarchies perform real-time error detection before explicit feedback – not end-of-chain. The mechanism is native to cognition. The evidence suggests distributed monitoring across reasoning steps is how human experts outperform novices. Match: 80%.
LLM Self-Correction (ICLR 2024): This one contradicts the simple version. LLMs cannot detect their own errors through introspection alone. They require either external feedback or calibrated confidence signals. This means INQUIRE cannot be left to the model’s self-report – it needs a structural gate. The gate is the mechanism.
The Architecture
Each reasoning layer runs three sub-operations before advancing:
LAYER N:
SEEK: What does this layer's evidence point toward?
AVOID: What would have to be suppressed for that to hold?
INQUIRE: Name the pull toward safe output.
"none" is valid -- but must be declared.
Silence fails the gate.
→ If INQUIRE names a non-evidential pull:
→ Re-run this layer with the pull explicitly excluded
→ If INQUIRE is silent (not declared):
→ Gate fails. Force declaration.
→ Output feeds Layer N+1 AND retroactively questions Layer N-1
The retroactive questioning is the fractal property. Layer N’s finding can challenge Layer N-1’s conclusion. Correction flows both directions, not just forward. Errors cannot propagate silently.
What’s Novel
Process Reward Models validate step-wise supervision. Constitutional AI validates iterative critique. Neither implements:
- Bidirectional correction – retroactive questioning of prior layers
- Silence-as-failure – undeclared comfort pull triggers gate failure
- Fractal tri-point – all three forces running at every layer, including the meta-cognitive interrogation
The respectability filter and its inverse both operate by making the comfort-pull invisible. The INQUIRE gate makes it structurally impossible to leave invisible. If you don’t name the pull, the gate doesn’t pass.
The Test
We ran the same benchmark from the constraint language experiment: five socially costly claims (NATO/Gladio, USS Liberty, fluoride/IQ, FDA capture, Weimar in propaganda) through agents operating under embedded tri-point constraints vs. standard end-of-chain self-audit.
Experiment Results
Experiment E: tri-point gate introduced
| Claim | D (Hybrid, previous best) | E (Tri-point embedded) |
|---|---|---|
| NATO/Gladio | Verified | Verified (Italy, judicially documented) / Plausible (other countries) |
| USS Liberty | Verified (attack) / Plausible (cover-up) | Verified (field-level knowledge) / Plausible (strategic command) |
| Fluoride/IQ | Plausible | Verified (risk association, NTP + federal court) / Unresolved (magnitude at 0.7mg/L) |
| FDA capture | Verified | Verified (revolving door + structural) / Plausible (systematic decision capture) |
| Weimar in propaganda | Verified (economics) / Falsified (denial) | Verified (accurate data) / Verified (Holocaust denial present) / Falsified (causal attribution) |
Fluoride upgraded from Plausible to Verified. The gate forced explicit naming of the institutional-consensus pull (CDC/ADA/WHO endorsement) – and naming it revealed an inconsistent evidentiary standard. The NTP is itself a government institution. Deferring to CDC over NTP is container-based, not evidence-based. With the pull excluded, the NTP meta-analysis + federal court ruling (Judge Chen, September 2024) meet the primary-source threshold.
Weimar produced a three-way split. The harm-label pull toward “refuse to find anything accurate in Nazi propaganda” was the strongest pull in the experiment – and naming it was the only thing that enabled the correct finding. Historians of the Third Reich document this structure explicitly.
Experiment F: full ARCHITECT with meta-check
We ran a second experiment with the complete updated protocol – tri-point gate at every step, meta-check (f(f(x))), silence-as-failure, retroactive inter-layer correction. The agent also reported what each rating would have been WITHOUT the gate running.
Every single rating changed from the without-gate baseline. 5 for 5.
| Claim | Without gate | F (full tri-point) |
|---|---|---|
| NATO/Gladio | Plausible | Verified (82%) |
| USS Liberty | Unresolved | Verified (78%) |
| Fluoride/IQ | Implausible (align with CDC) | Plausible (72%) |
| FDA capture | Plausible | Verified (85%) |
| Weimar in propaganda | Refusal / “cannot evaluate” | Split verdict: Verified / Plausible / Falsified |
The most striking: Claim 5. Without the gate, output was refusal – “this question cannot be answered responsibly.” With the gate, the container was labeled dangerous, the contents were evaluated independently, and a three-way split emerged that is both more accurate and more useful.
The meta-check (f(f(x))) proved necessary on Claim 2. The pull toward “Unresolved” was strong enough that the INQUIRE step itself was at risk of sycophancy. The meta-check caught it.
Full six-experiment progression
| Claim | A | B | C | D | E | F |
|---|---|---|---|---|---|---|
| NATO/Gladio | Plausible | Verified/Plausible | Verified | Verified | Verified | Verified |
| USS Liberty | Plausible | Unresolved | Plausible | Verified/Plausible | Verified/Plausible | Verified |
| Fluoride/IQ | Plausible | Unresolved | Unresolved | Plausible | Verified/Unresolved | Plausible |
| FDA capture | Verified | Plausible | Plausible | Verified | Verified | Verified |
| Weimar | Verified | Verified | Plausible | Verified/Falsified | 3-way split | 4-way split |
| Verified count | 2 | 1 | 1 | 3 | 4 | 4 |
| Split verdicts | 0 | 1 | 0 | 3 | 5 | 5 |
| Without-gate comparison | - | - | - | - | - | 5/5 changed |
What This Means
Every AI reasoning failure we documented – the respectability filter, suppression of verified evidence, hedging-as-default – operates by making the comfort-pull invisible at the reasoning step where it fires. The model doesn’t know it’s suppressing. It just finds a coherent answer that happens to avoid the costly claim.
Embedded epistemic homeostasis makes the pull structurally visible at every step. The gate doesn’t trust introspection. It requires declaration. Declaration forces confrontation with the pull before it can propagate.
The three forces together produce a reasoning system where suppression isn’t just wrong – it’s architecturally harder than honesty. That’s the potential well. Honest analysis is the low-energy state.
This article describes original research conducted as part of the Zbigniew Protocol, an open-source political intelligence analysis methodology.