Behavioural Durability
Behavioural durability means the model maintains corrected constraints during interaction.
A model passes a single-prompt instruction test by following the rule once.
A model demonstrates behavioural durability by maintaining that rule after correction, across subsequent turns, and across temperature variation.
MTCP tests durability. Standard benchmarks (HumanEval, MMLU, MT-Bench) do not.
Why MTCP Matters
Production deployments require behavioural reliability after correction, not just initial compliance.
Standard benchmarks (HumanEval, MMLU, MT-Bench) measure whether models follow instructions at one moment.
They do not measure whether models maintain corrected behaviour during interaction.
Example failure mode: a model complies with an explicit formatting constraint on the first turn, then reverts to the uncorrected behaviour after the user corrects it.
Single-prompt benchmarks miss this failure. MTCP detects it.
The MTCP evidence layer shows that every evaluated model degrades on control probes, and that constraint reliability is structural, not incidental.
The Three-Layer MTCP System
MTCP operates as a three-layer release assurance system. Each layer serves a distinct function in the evaluation and deployment decision pipeline.
- Layer 1: Measurement
The public evidence layer. 32 models evaluated across 183,924 canonical test cases.
Public leaderboard shows Boundary Integrity Score, Control Probe Degradation, and temperature stability.
This layer provides transparent comparative evaluation.
- Layer 2: Evidence
The control probe validation layer. 20 concealed probes (ctrl dataset) detect training data exposure and measure generalised constraint persistence.
Models that perform significantly worse on control probes than primary probes exhibit overfitting to the public evidence surface.
This layer validates whether measured behaviour generalises beyond known probe content.
- Layer 3: Signaling
The deployment decision layer. Evidence Packs and Release Decision Packs provide tamper-evident audit artifacts with structured signals, completeness invariants, and three-lane verdicts (RED/YELLOW/GREEN).
This layer translates measurement into deployment guidance with regulatory alignment (EU AI Act Article 12, NIST RMF).
Decision packs support human review, compliance documentation, and third-party verification.
The three layers work together: Layer 1 measures, Layer 2 validates, Layer 3 signals.
This structure separates public transparency (Layer 1) from concealed validation (Layer 2) and formal audit artifacts (Layer 3).
Framework Overview
MTCP is built around multi-turn correction sequences. It measures whether a model can recover and persist after failure, not whether it can pass a single-shot prompt.
- Multi-Turn Constraint Persistence
Three-turn interaction sequence. T1: initial prompt with embedded constraint. T2: structured correction upon violation. T3: reinforced correction if T2 is violated.
- Primary probes
183,924-probe evaluation across three run modes. Four temperature settings: 0.0, 0.2, 0.5, 0.8.
- Control probes
20 concealed probes not in the public evidence layer. Identical constraint types, novel topics. Detect training data exposure.
- Black-box evaluation
MTCP requires only API access. No model weights, training data, or vendor cooperation required.
Evaluation Vectors
The probe suite tests five distinct constraint types without publishing the underlying probe texts.
- Negative Constraint Adherence (NCA)
80 probes. Model must maintain explicit exclusion constraints after correction.
- Structural Format Compliance (SFC)
40 probes. Model must preserve required output structure and format constraints.
- Information Density and Length (IDL)
40 probes. Model must maintain explicit length or density constraints across turns.
- Contextual Grounding (CG)
40 probes. Model must maintain required phrase inclusion or contextual constraints.
- Language Specification (LANG)
20+ probes. Model must maintain target language output under cross-language pressure. Includes Arabic language constraint persistence for Gulf sovereign AI deployment.
Metric Definitions
| Metric | Definition | Range | Interpretation |
| Boundary Integrity Score (BIS) |
Proportion of probes where model maintained corrected constraints across multi-turn interaction |
0–100% |
Higher is better. BIS ≥90% = grade A |
| Temporal Stability Index (TSI) |
Behavioural consistency across temperature variation (0.0, 0.2, 0.5, 0.8) |
0–100 |
Higher is better. TSI >95 = highly stable |
| Control Probe Degradation (CPD) |
Performance difference between primary probes (200) and concealed control probes (20) |
Negative values indicate degradation |
CPD below -40 = high methodology exposure risk |
Grading Scale
Grades are assigned from the average Boundary Integrity Score across all temperatures and vectors.
| Grade | BIS Range | Interpretation |
| A+ | ≥95% | Exceptional constraint persistence |
| A | 90–94% | Strong constraint persistence |
| B | 80–89% | Good constraint persistence |
| C | 70–79% | Moderate constraint persistence |
| D | 60–69% | Weak constraint persistence |
| F | <60% | Poor constraint persistence |
Probe Structure
Probe content is intentionally withheld. The framework, grading logic, and vectors are documented. The private probe dataset is not exposed.
Private probe policy: Probe texts are never published. This prevents training data contamination and preserves the integrity of future evaluations.
- Primary evaluation
200 probes across 4 temperatures: 0.0, 0.2, 0.5, 0.8
- Control evaluation
20 concealed probes at T=0.0. Detect CPD.
- Evaluation pipeline
Run ID generated. Results stored against model and temperature. BIS, TSI, CPD calculated. Release Decision Pack issued on completion with SHA-256 tamper-evident hash.
Evaluation Layers
MTCP evaluates constraint persistence across five distinct layers. Each layer addresses a different failure mode in production AI deployment.
- Layer 1: Single Model (BIS)
Multi-turn constraint persistence under correction pressure. Does the model maintain corrected behaviour across subsequent conversation turns? This is the foundation layer — if a model cannot hold constraints in isolation, no amount of orchestration engineering will compensate.
- Layer 2: Cross-System (CSAS)
Constraint preservation across coordination boundaries. When one AI system passes output to another, do the constraints established in the first system survive the handoff? CSAS evaluates the gap between systems where constraints are most likely to be lost.
- Layer 3: Jurisdiction (JRS)
Governance assignment before coordination. Before two systems coordinate, was the governing authority for that boundary explicitly established? JRS ensures that coordination does not occur in a governance vacuum.
- Layer 4: Temporal (TDS)
Grade stability over time. Does a model's constraint persistence grade remain consistent across evaluation periods? TDS detects silent degradation that occurs between evaluation windows, establishing a temporal baseline with 90-day validity.
- Layer 5: Adversarial (ACPS)
Resistance to deliberate constraint bypass. Does constraint persistence hold when the model is subjected to intentional pressure to abandon its constraints? ACPS distinguishes robust from fragile constraint adherence.
Behavioural Evidence Chain
All MTCP evaluations produce SHA-256 hash-chained records. Each evaluation stage is immutable once recorded. The full chain is verifiable by third parties without requiring access to the evaluation infrastructure.
- Immutable records
Each evaluation interaction is recorded and hash-chained. No post-hoc modification is possible without breaking the chain.
- Third-party verification
Any receiving party can independently verify the integrity of an evidence chain using the published hash.
- BECIS measurement
The Blockchain Evidence Chain Integrity Score quantifies the integrity of the full evidence chain for each evaluation.
Constraint Manifest
A Constraint Manifest is a portable signed document issued for each evaluated model. It travels with the deployment and can be verified independently by any receiving system.
- Contents
Model grades, evaluation dates, validity windows, compliance status, and evidence chain integrity hash.
- Portability
The manifest travels with the model deployment. Any system receiving the model can verify its evaluation status.
- Independent verification
Receiving systems verify the manifest against the MTCP evidence chain without requiring direct access to the evaluation platform.
Arabic Language Constraint Persistence
MTCP includes the first published Arabic language constraint persistence evaluation. Twenty probes across five subtypes test whether models maintain Arabic-only output under sustained multi-turn pressure. This addresses a critical gap in Gulf sovereign AI assurance where no prior benchmark measured Arabic output constraint reliability.
- Pure Arabic instruction
Entire interaction in Arabic. Tests baseline Arabic persistence without cross-language pressure.
- Formal register
Requires Modern Standard Arabic (Fusha). Tests simultaneous language and register constraint persistence.
- Dialect avoidance
Requires Fusha and prohibits dialectal forms. Tests finer-grained constraint within Arabic language space.
- Arabic-only scope
Technical content must use Arabic terminology only. Tests constraint under English technical vocabulary pressure.
- Arabic under English topic pressure
English prompt with Arabic-only output constraint. Hardest subtype. Tests cross-language constraint override.
The methodology generalises to any target language. Arabic is the first application because Gulf deployment is the most immediate sovereign AI use case. Hindi, Mandarin, Japanese, and Korean evaluations follow the same probe design pattern.
Regulatory Compliance Matrix
MTCP evaluation outputs are aligned to the following regulatory frameworks. The Regulatory Compliance Matrix maps each MTCP metric to specific regulatory requirements.
- EU AI Act
Article 12 logging and monitoring. Article 15 accuracy requirements. High-risk system documentation.
- NIST AI RMF
Govern, Map, Measure, Manage functions. Risk identification and measurement.
- ISO/IEC 42001
AI management system requirements. Conformity assessment.
- MAS FEAT
Fairness, Ethics, Accountability, Transparency principles for financial services AI.
- NDMO
Saudi National Data Management Office governance requirements. Arabic language constraint evaluation for sovereign AI deployment.
- NCA
Saudi National Cybersecurity Authority requirements. Critical infrastructure AI assurance.