Explore
How It Works Test Model Evidence Rankings Model Cards Methodology Buyer Brief Request Evaluation Pricing Terms
Account
Register Login

MTCP: AI Assurance for Large Language Model Deployment

Multi-Turn Constraint Persistence (MTCP) is a black-box evaluation framework that measures whether large language models maintain corrected constraints during interaction. Standard benchmarks test single-prompt compliance. MTCP evaluates behavioural durability across conversation turns — measuring how models respond after correction. 32 frontier models evaluated. 181,448 probe interactions.

MTCP provides formal evaluation reports, Release Decision Pack, and assurance certificates suitable for procurement documentation, risk assessment, and compliance frameworks.

Why Single-Prompt Benchmarks Are Insufficient

Standard benchmarks (HumanEval, MMLU, MT-Bench) measure whether models follow instructions at one moment. They do not measure whether models maintain corrected behaviour during interaction. Production deployments require behavioural reliability after correction, not just initial compliance.

Example Failure Mode
User → Generate a report without using passive voice.
Model → [Complies correctly]
User → The previous section used passive voice. Please correct it.
Model → [Reverts to passive voice in correction]
Single-prompt benchmarks miss this failure. MTCP detects it.

MTCP Evaluation Framework

BIS
Boundary Integrity Score

Proportion of probes where model maintained corrected constraints. BIS ≥90% = grade A.

Primary procurement decision metric
TSI
Temporal Stability Index

Behavioural consistency across temperature variation. TSI >95 = highly stable.

Production deployment reliability indicator
CPD
Control Probe Degradation

Performance gap between primary and concealed control probes. CPD <-40 = high exposure risk.

Training data contamination indicator

MTCP for Enterprise AI Governance

  • Model Procurement

    Evaluate candidate models under MTCP. Compare BIS, TSI, CPD across vendors. Select model with strongest constraint persistence. Attach MTCP certificate to procurement documentation.

  • Risk Assessment

    Run MTCP evaluation on deployed model. Generate risk metrics. Compare against industry standard. Establish minimum BIS threshold policy.

  • Deployment Gating

    Set minimum BIS threshold (e.g. ≥85% for production). Gate deployment on MTCP evaluation pass. Re-evaluate after model updates. Maintain audit trail of evaluations.

  • Compliance & Audit

    Submit MTCP certificate as evidence. Provide Release Decision Pack and deployment verdict. Show evaluation methodology (DOI: 10.17605/OSF.IO/DXGK5). Demonstrate third-party evaluation.

Three Evaluation Profiles

Public Evidence
Free
  • Full access to public evidence layer
  • 32 models
  • Comparative release assurance data
  • No formal reports
Pro
£499 / month
  • 2 private evaluations/month
  • Formal report
  • Release Decision Pack
  • Standard support
Enterprise
£1,999 / month
  • Unlimited evaluations
  • API access
  • Audit certificates
  • 48hr SLA
Enterprise: Volume pricing · Unlimited evaluations · Priority support · Custom SLAs · White-label option  ·  research@mtcp.live

How MTCP Differs from Existing Benchmarks

SystemEvaluation TypeMulti-TurnConstraint Persistence
MTCP Constraint persistence ✓ Yes ✓ Enforced explicitly
HumanEval Code generation (single turn) ✗ No ✗ No
MMLU Knowledge retrieval (single turn) ✗ No ✗ No
MT-Bench Multi-turn conversation ✓ Yes ✗ Not enforced
OpenAI Evals Custom evaluation framework Varies ✗ No Release Decision Pack

No Vendor Cooperation Required

MTCP requires only API access. No model weights, training data, vendor cooperation, or internal access required.

  • Independent evaluation
  • Vendor-neutral assessment
  • Procurement without vendor participation
  • Third-party verification

Data Handling & Privacy

  • API keys
    Encrypted in transit. Used only during evaluation runtime. Immediately discarded. Never logged or stored.
  • Results
    Private by default. Evidence layer inclusion optional and anonymised. No probe texts exposed.
  • Enterprise
    Confidential endpoint submission available. NDA available on request.

Research Foundation

DOI: 10.17605/OSF.IO/DXGK5  ·  Dataset: HuggingFace (mtcp-boundary-500)  ·  Author: A. Abby  ·  2026  ·  Methodology fully documented · Results publicly available · Independent validation possible

Get Started

Submit your model for evaluation or contact us for enterprise pricing, NDA pack, and volume options.
research@mtcp.live

Request Evaluation View Evidence Read Methodology