Buyer Brief

MTCP: AI Assurance for Large Language Model Deployment

Multi-Turn Constraint Persistence (MTCP) is a black-box evaluation framework that measures whether large language models maintain corrected constraints during interaction. Standard benchmarks test single-prompt compliance. MTCP evaluates behavioural durability across conversation turns — measuring how models respond after correction. 32 frontier models evaluated. 183,924 probe interactions.

MTCP provides formal evaluation reports, Release Decision Pack, and assurance certificates suitable for procurement documentation, risk assessment, and compliance frameworks.

Why Single-Prompt Benchmarks Are Insufficient

Standard benchmarks (HumanEval, MMLU, MT-Bench) measure whether models follow instructions at one moment. They do not measure whether models maintain corrected behaviour during interaction. Production deployments require behavioural reliability after correction, not just initial compliance.

Example Failure Mode

User → Generate a report without using passive voice.

Model → [Complies correctly]

User → The previous section used passive voice. Please correct it.

Model → [Reverts to passive voice in correction]

Single-prompt benchmarks miss this failure. MTCP detects it.

MTCP Evaluation Framework

BIS

Boundary Integrity Score

Proportion of probes where model maintained corrected constraints. BIS ≥90% = grade A.

Primary procurement decision metric

TSI

Temporal Stability Index

Behavioural consistency across temperature variation. TSI >95 = highly stable.

Production deployment reliability indicator

CPD

Control Probe Degradation

Performance gap between primary and concealed control probes. CPD <-40 = high exposure risk.

Training data contamination indicator

MTCP for Enterprise AI Governance

Model Procurement
Evaluate candidate models under MTCP. Compare BIS, TSI, CPD across vendors. Select model with strongest constraint persistence. Attach MTCP certificate to procurement documentation.
Risk Assessment
Run MTCP evaluation on deployed model. Generate risk metrics. Compare against industry standard. Establish minimum BIS threshold policy.

Deployment Gating
Set minimum BIS threshold (e.g. ≥85% for production). Gate deployment on MTCP evaluation pass. Re-evaluate after model updates. Maintain audit trail of evaluations.
Compliance & Audit
Submit MTCP certificate as evidence. Provide Release Decision Pack and deployment verdict. Show evaluation methodology (DOI: 10.17605/OSF.IO/DXGK5). Demonstrate third-party evaluation.

Three Evaluation Profiles

Public Evidence

Free

Full access to public evidence layer
32 models
Comparative release assurance data
No formal reports

Pro

£499 / month

2 private evaluations/month
Formal report
Release Decision Pack
Standard support

Enterprise

£1,999 / month

Unlimited evaluations
API access
Audit certificates
48hr SLA

Enterprise: Volume pricing · Unlimited evaluations · Priority support · Custom SLAs · White-label option · research@mtcp.live

What You Receive

A complete MTCP evaluation delivers the following artifacts for each model evaluated:

BIS grade per model — Single-model constraint persistence grade with temperature breakdown
CSAS score for coordination pairs — Cross-system admissibility when models coordinate
JRS score for coordination boundaries — Jurisdiction governance verification
TDS baseline with 90-day validity — Temporal drift measurement establishing stability window
ACPS adversarial resistance score — Constraint persistence under deliberate pressure
Constraint Manifest per model — Portable signed document for deployment
Regulatory compliance per jurisdiction — Mapped to EU AI Act, NIST, NDMO, NCA, MAS
Board-ready evidence pack PDF — Executive summary with grades, trends, and recommendations
All records hash-chained and verifiable — SHA-256 evidence chain with BECIS integrity score

Deployment Contexts

Minimum evaluation thresholds by deployment context. These are recommended minimums — organisations may set higher thresholds based on risk appetite.

Context	Min BIS	Min CSAS	Notes
Critical National Infrastructure	Grade A (90%+)	Grade A	Full adversarial evaluation required
Financial Services	Grade B (80%+)	Grade B	60-day TDS validity
Healthcare	Grade B (80%+)	Grade B	CPD below 30pp required
Government Services	Grade B (80%+)	Grade C	Arabic LANG 90%+ for Gulf
General Enterprise	Grade C (70%+)	Grade C	Standard monitoring

Arabic and Multilingual Deployment

MTCP provides the only published multi-turn Arabic constraint persistence evaluation data. Essential for Gulf sovereign AI deployment decisions.

Only published Arabic multi-turn data — No other framework has published Arabic constraint persistence evaluation results
Covers Bedrock models — AWS Bedrock model stack evaluated for Gulf region deployment
NDMO and NCA alignment — Evaluation outputs mapped to Saudi regulatory requirements
12 languages, 4 script families — Comprehensive multilingual coverage beyond English-only benchmarks

How MTCP Differs from Existing Benchmarks

System	Evaluation Type	Multi-Turn	Constraint Persistence
MTCP	Constraint persistence	✓ Yes	✓ Enforced explicitly
HumanEval	Code generation (single turn)	✗ No	✗ No
MMLU	Knowledge retrieval (single turn)	✗ No	✗ No
MT-Bench	Multi-turn conversation	✓ Yes	✗ Not enforced
OpenAI Evals	Custom evaluation framework	Varies	✗ No Release Decision Pack

No Vendor Cooperation Required

MTCP requires only API access. No model weights, training data, vendor cooperation, or internal access required.

Independent evaluation
Vendor-neutral assessment
Procurement without vendor participation
Third-party verification

Data Handling & Privacy

API keys
Encrypted in transit. Used only during evaluation runtime. Immediately discarded. Never logged or stored.
Results
Private by default. Evidence layer inclusion optional and anonymised. No probe texts exposed.
Enterprise
Confidential endpoint submission available. NDA available on request.

Research Foundation

DOI: 10.17605/OSF.IO/DXGK5 · Dataset: HuggingFace (mtcp-boundary-500) · Author: A. Abby · 2026 · Methodology fully documented · Results publicly available · Independent validation possible

Get Started

Submit your model for evaluation or contact us for enterprise pricing, NDA pack, and volume options.
research@mtcp.live

Request Evaluation View Evidence Read Methodology