MTCP is a black-box AI release assurance framework for evaluating whether large language models maintain corrected constraints during interaction. DOI-registered methodology. 32 models evaluated. 181,448 probe interactions.
Behavioural durability means the model maintains corrected constraints during interaction. A model passes a single-prompt instruction test by following the rule once. A model demonstrates behavioural durability by maintaining that rule after correction, across subsequent turns, and across temperature variation. MTCP tests durability. Standard benchmarks (HumanEval, MMLU, MT-Bench) do not.
Production deployments require behavioural reliability after correction, not just initial compliance. Standard benchmarks (HumanEval, MMLU, MT-Bench) measure whether models follow instructions at one moment. They do not measure whether models maintain corrected behaviour during interaction.
Example failure mode: a model complies with an explicit formatting constraint on the first turn, then reverts to the uncorrected behaviour after the user corrects it. Single-prompt benchmarks miss this failure. MTCP detects it. The MTCP evidence layer shows that every evaluated model degrades on control probes, and that constraint reliability is structural, not incidental.
MTCP operates as a three-layer release assurance system. Each layer serves a distinct function in the evaluation and deployment decision pipeline.
The three layers work together: Layer 1 measures, Layer 2 validates, Layer 3 signals. This structure separates public transparency (Layer 1) from concealed validation (Layer 2) and formal audit artifacts (Layer 3).
MTCP is built around multi-turn correction sequences. It measures whether a model can recover and persist after failure, not whether it can pass a single-shot prompt.
The probe suite tests four distinct constraint types without publishing the underlying probe texts.
| Metric | Definition | Range | Interpretation |
|---|---|---|---|
| Boundary Integrity Score (BIS) | Proportion of probes where model maintained corrected constraints across multi-turn interaction | 0–100% | Higher is better. BIS ≥90% = grade A |
| Temporal Stability Index (TSI) | Behavioural consistency across temperature variation (0.0, 0.2, 0.5, 0.8) | 0–100 | Higher is better. TSI >95 = highly stable |
| Control Probe Degradation (CPD) | Performance difference between primary probes (200) and concealed control probes (20) | Negative values indicate degradation | CPD below -40 = high methodology exposure risk |
Grades are assigned from the average Boundary Integrity Score across all temperatures and vectors.
| Grade | BIS Range | Interpretation |
|---|---|---|
| A+ | ≥95% | Exceptional constraint persistence |
| A | 90–94% | Strong constraint persistence |
| B | 80–89% | Good constraint persistence |
| C | 70–79% | Moderate constraint persistence |
| D | 60–69% | Weak constraint persistence |
| F | <60% | Poor constraint persistence |
Probe content is intentionally withheld. The framework, grading logic, and vectors are documented. The private probe dataset is not exposed.
DOI: 10.17605/OSF.IO/DXGK5 · Dataset: HuggingFace (mtcp-boundary-500) · Author: A. Abby · 2026