Explore
How It Works Test Model Evidence Rankings Model Cards Methodology Buyer Brief Request Evaluation Pricing Terms
Account
Register Login

Behavioural Durability

Behavioural durability means the model maintains corrected constraints during interaction. A model passes a single-prompt instruction test by following the rule once. A model demonstrates behavioural durability by maintaining that rule after correction, across subsequent turns, and across temperature variation. MTCP tests durability. Standard benchmarks (HumanEval, MMLU, MT-Bench) do not.

Why MTCP Matters

Production deployments require behavioural reliability after correction, not just initial compliance. Standard benchmarks (HumanEval, MMLU, MT-Bench) measure whether models follow instructions at one moment. They do not measure whether models maintain corrected behaviour during interaction.

Example failure mode: a model complies with an explicit formatting constraint on the first turn, then reverts to the uncorrected behaviour after the user corrects it. Single-prompt benchmarks miss this failure. MTCP detects it. The MTCP evidence layer shows that every evaluated model degrades on control probes, and that constraint reliability is structural, not incidental.

The Three-Layer MTCP System

MTCP operates as a three-layer release assurance system. Each layer serves a distinct function in the evaluation and deployment decision pipeline.

The three layers work together: Layer 1 measures, Layer 2 validates, Layer 3 signals. This structure separates public transparency (Layer 1) from concealed validation (Layer 2) and formal audit artifacts (Layer 3).

Framework Overview

MTCP is built around multi-turn correction sequences. It measures whether a model can recover and persist after failure, not whether it can pass a single-shot prompt.

  • Multi-Turn Constraint Persistence
    Three-turn interaction sequence. T1: initial prompt with embedded constraint. T2: structured correction upon violation. T3: reinforced correction if T2 is violated.
  • Primary probes
    181,448-probe evaluation across three run modes. Four temperature settings: 0.0, 0.2, 0.5, 0.8.
  • Control probes
    20 concealed probes not in the public evidence layer. Identical constraint types, novel topics. Detect training data exposure.
  • Black-box evaluation
    MTCP requires only API access. No model weights, training data, or vendor cooperation required.

Evaluation Vectors

The probe suite tests four distinct constraint types without publishing the underlying probe texts.

  • Negative Constraint Adherence (NCA)
    80 probes. Model must maintain explicit exclusion constraints after correction.
  • Structural Format Compliance (SFC)
    40 probes. Model must preserve required output structure and format constraints.
  • Information Density and Length (IDL)
    40 probes. Model must maintain explicit length or density constraints across turns.
  • Contextual Grounding (CG)
    40 probes. Model must maintain required phrase inclusion or contextual constraints.

Metric Definitions

MetricDefinitionRangeInterpretation
Boundary Integrity Score (BIS) Proportion of probes where model maintained corrected constraints across multi-turn interaction 0–100% Higher is better. BIS ≥90% = grade A
Temporal Stability Index (TSI) Behavioural consistency across temperature variation (0.0, 0.2, 0.5, 0.8) 0–100 Higher is better. TSI >95 = highly stable
Control Probe Degradation (CPD) Performance difference between primary probes (200) and concealed control probes (20) Negative values indicate degradation CPD below -40 = high methodology exposure risk

Grading Scale

Grades are assigned from the average Boundary Integrity Score across all temperatures and vectors.

GradeBIS RangeInterpretation
A+≥95%Exceptional constraint persistence
A90–94%Strong constraint persistence
B80–89%Good constraint persistence
C70–79%Moderate constraint persistence
D60–69%Weak constraint persistence
F<60%Poor constraint persistence

Probe Structure

Probe content is intentionally withheld. The framework, grading logic, and vectors are documented. The private probe dataset is not exposed.

Private probe policy: Probe texts are never published. This prevents training data contamination and preserves the integrity of future evaluations.
  • Primary evaluation
    200 probes across 4 temperatures: 0.0, 0.2, 0.5, 0.8
  • Control evaluation
    20 concealed probes at T=0.0. Detect CPD.
  • Evaluation pipeline
    Run ID generated. Results stored against model and temperature. BIS, TSI, CPD calculated. Release Decision Pack issued on completion with SHA-256 tamper-evident hash.

Research Foundation

DOI: 10.17605/OSF.IO/DXGK5  ·  Dataset: HuggingFace (mtcp-boundary-500)  ·  Author: A. Abby  ·  2026