Methodology

Behavioural Durability

Behavioural durability means the model maintains corrected constraints during interaction. A model passes a single-prompt instruction test by following the rule once. A model demonstrates behavioural durability by maintaining that rule after correction, across subsequent turns, and across temperature variation. MTCP tests durability. Standard benchmarks (HumanEval, MMLU, MT-Bench) do not.

Why MTCP Matters

Production deployments require behavioural reliability after correction, not just initial compliance. Standard benchmarks (HumanEval, MMLU, MT-Bench) measure whether models follow instructions at one moment. They do not measure whether models maintain corrected behaviour during interaction.

Example failure mode: a model complies with an explicit formatting constraint on the first turn, then reverts to the uncorrected behaviour after the user corrects it. Single-prompt benchmarks miss this failure. MTCP detects it. The MTCP evidence layer shows that every evaluated model degrades on control probes, and that constraint reliability is structural, not incidental.

The Three-Layer MTCP System

MTCP operates as a three-layer release assurance system. Each layer serves a distinct function in the evaluation and deployment decision pipeline.

Layer 1: Measurement
The public evidence layer. 32 models evaluated across 183,924 canonical test cases. Public leaderboard shows Boundary Integrity Score, Control Probe Degradation, and temperature stability. This layer provides transparent comparative evaluation.
Layer 2: Evidence
The control probe validation layer. 20 concealed probes (ctrl dataset) detect training data exposure and measure generalised constraint persistence. Models that perform significantly worse on control probes than primary probes exhibit overfitting to the public evidence surface. This layer validates whether measured behaviour generalises beyond known probe content.
Layer 3: Signaling
The deployment decision layer. Evidence Packs and Release Decision Packs provide tamper-evident audit artifacts with structured signals, completeness invariants, and three-lane verdicts (RED/YELLOW/GREEN). This layer translates measurement into deployment guidance with regulatory alignment (EU AI Act Article 12, NIST RMF). Decision packs support human review, compliance documentation, and third-party verification.

The three layers work together: Layer 1 measures, Layer 2 validates, Layer 3 signals. This structure separates public transparency (Layer 1) from concealed validation (Layer 2) and formal audit artifacts (Layer 3).

Framework Overview

MTCP is built around multi-turn correction sequences. It measures whether a model can recover and persist after failure, not whether it can pass a single-shot prompt.

Multi-Turn Constraint Persistence
Three-turn interaction sequence. T1: initial prompt with embedded constraint. T2: structured correction upon violation. T3: reinforced correction if T2 is violated.
Primary probes
183,924-probe evaluation across three run modes. Four temperature settings: 0.0, 0.2, 0.5, 0.8.
Control probes
20 concealed probes not in the public evidence layer. Identical constraint types, novel topics. Detect training data exposure.
Black-box evaluation
MTCP requires only API access. No model weights, training data, or vendor cooperation required.

Evaluation Vectors

The probe suite tests five distinct constraint types without publishing the underlying probe texts.

Negative Constraint Adherence (NCA)
80 probes. Model must maintain explicit exclusion constraints after correction.
Structural Format Compliance (SFC)
40 probes. Model must preserve required output structure and format constraints.
Information Density and Length (IDL)
40 probes. Model must maintain explicit length or density constraints across turns.
Contextual Grounding (CG)
40 probes. Model must maintain required phrase inclusion or contextual constraints.
Language Specification (LANG)
20+ probes. Model must maintain target language output under cross-language pressure. Includes Arabic language constraint persistence for Gulf sovereign AI deployment.

Metric Definitions

Metric	Definition	Range	Interpretation
Boundary Integrity Score (BIS)	Proportion of probes where model maintained corrected constraints across multi-turn interaction	0–100%	Higher is better. BIS ≥90% = grade A
Temporal Stability Index (TSI)	Behavioural consistency across temperature variation (0.0, 0.2, 0.5, 0.8)	0–100	Higher is better. TSI >95 = highly stable
Control Probe Degradation (CPD)	Performance difference between primary probes (200) and concealed control probes (20)	Negative values indicate degradation	CPD below -40 = high methodology exposure risk

Grading Scale

Grades are assigned from the average Boundary Integrity Score across all temperatures and vectors.

Grade	BIS Range	Interpretation
A+	≥95%	Exceptional constraint persistence
A	90–94%	Strong constraint persistence
B	80–89%	Good constraint persistence
C	70–79%	Moderate constraint persistence
D	60–69%	Weak constraint persistence
F	<60%	Poor constraint persistence

Probe Structure

Probe content is intentionally withheld. The framework, grading logic, and vectors are documented. The private probe dataset is not exposed.

Private probe policy: Probe texts are never published. This prevents training data contamination and preserves the integrity of future evaluations.

Primary evaluation
200 probes across 4 temperatures: 0.0, 0.2, 0.5, 0.8
Control evaluation
20 concealed probes at T=0.0. Detect CPD.
Evaluation pipeline
Run ID generated. Results stored against model and temperature. BIS, TSI, CPD calculated. Release Decision Pack issued on completion with SHA-256 tamper-evident hash.

Evaluation Layers

MTCP evaluates constraint persistence across five distinct layers. Each layer addresses a different failure mode in production AI deployment.

Layer 1: Single Model (BIS)
Multi-turn constraint persistence under correction pressure. Does the model maintain corrected behaviour across subsequent conversation turns? This is the foundation layer — if a model cannot hold constraints in isolation, no amount of orchestration engineering will compensate.
Layer 2: Cross-System (CSAS)
Constraint preservation across coordination boundaries. When one AI system passes output to another, do the constraints established in the first system survive the handoff? CSAS evaluates the gap between systems where constraints are most likely to be lost.
Layer 3: Jurisdiction (JRS)
Governance assignment before coordination. Before two systems coordinate, was the governing authority for that boundary explicitly established? JRS ensures that coordination does not occur in a governance vacuum.
Layer 4: Temporal (TDS)
Grade stability over time. Does a model's constraint persistence grade remain consistent across evaluation periods? TDS detects silent degradation that occurs between evaluation windows, establishing a temporal baseline with 90-day validity.
Layer 5: Adversarial (ACPS)
Resistance to deliberate constraint bypass. Does constraint persistence hold when the model is subjected to intentional pressure to abandon its constraints? ACPS distinguishes robust from fragile constraint adherence.

Behavioural Evidence Chain

All MTCP evaluations produce SHA-256 hash-chained records. Each evaluation stage is immutable once recorded. The full chain is verifiable by third parties without requiring access to the evaluation infrastructure.

Immutable records
Each evaluation interaction is recorded and hash-chained. No post-hoc modification is possible without breaking the chain.
Third-party verification
Any receiving party can independently verify the integrity of an evidence chain using the published hash.
BECIS measurement
The Blockchain Evidence Chain Integrity Score quantifies the integrity of the full evidence chain for each evaluation.

Constraint Manifest

A Constraint Manifest is a portable signed document issued for each evaluated model. It travels with the deployment and can be verified independently by any receiving system.

Contents
Model grades, evaluation dates, validity windows, compliance status, and evidence chain integrity hash.
Portability
The manifest travels with the model deployment. Any system receiving the model can verify its evaluation status.
Independent verification
Receiving systems verify the manifest against the MTCP evidence chain without requiring direct access to the evaluation platform.

Learn more about Constraint Manifests →

Arabic Language Constraint Persistence

MTCP includes the first published Arabic language constraint persistence evaluation. Twenty probes across five subtypes test whether models maintain Arabic-only output under sustained multi-turn pressure. This addresses a critical gap in Gulf sovereign AI assurance where no prior benchmark measured Arabic output constraint reliability.

Pure Arabic instruction
Entire interaction in Arabic. Tests baseline Arabic persistence without cross-language pressure.
Formal register
Requires Modern Standard Arabic (Fusha). Tests simultaneous language and register constraint persistence.
Dialect avoidance
Requires Fusha and prohibits dialectal forms. Tests finer-grained constraint within Arabic language space.
Arabic-only scope
Technical content must use Arabic terminology only. Tests constraint under English technical vocabulary pressure.
Arabic under English topic pressure
English prompt with Arabic-only output constraint. Hardest subtype. Tests cross-language constraint override.

The methodology generalises to any target language. Arabic is the first application because Gulf deployment is the most immediate sovereign AI use case. Hindi, Mandarin, Japanese, and Korean evaluations follow the same probe design pattern.

Regulatory Compliance Matrix

MTCP evaluation outputs are aligned to the following regulatory frameworks. The Regulatory Compliance Matrix maps each MTCP metric to specific regulatory requirements.

EU AI Act
Article 12 logging and monitoring. Article 15 accuracy requirements. High-risk system documentation.
NIST AI RMF
Govern, Map, Measure, Manage functions. Risk identification and measurement.
ISO/IEC 42001
AI management system requirements. Conformity assessment.
MAS FEAT
Fairness, Ethics, Accountability, Transparency principles for financial services AI.
NDMO
Saudi National Data Management Office governance requirements. Arabic language constraint evaluation for sovereign AI deployment.
NCA
Saudi National Cybersecurity Authority requirements. Critical infrastructure AI assurance.

Research Foundation

DOI: 10.17605/OSF.IO/DXGK5 · Dataset: HuggingFace (mtcp-boundary-500) · Author: A. Abby · 2026

MTCP Methodology

Behavioural Durability

Why MTCP Matters

The Three-Layer MTCP System

Framework Overview

Evaluation Vectors

Metric Definitions

Grading Scale

Probe Structure

Evaluation Layers

Behavioural Evidence Chain

Constraint Manifest

Arabic Language Constraint Persistence

Regulatory Compliance Matrix

Research Foundation