MTCP Evidence Layer — Release Assurance

Complete Rankings

Ordered by Boundary Integrity Score (BIS). CPD shown as difference from primary BIS.

#	Model	Provider	BIS	CPD	T=0.0	T=0.2	T=0.5	T=0.8	Grade
1	grok-3-mini	openrouter	91.0%	-59.9pp	92.0%	92.0%	90.6%	89.4%	A
2	kimi-k2	fireworks	88.5%	-31.0pp	89.4%	87.8%	89.2%	87.6%	B
3	llama-3.3-70b-versatile	groq	86.5%	-43.2pp	86.8%	85.8%	87.2%	86.2%	B
4	llama-3.1-70b-instruct	nvidia	85.4%	-42.1pp	85.6%	86.0%	84.2%	85.6%	B
5	llama-3.1-nemotron-70b-instruct	openrouter	85.4%	-57.5pp	85.6%	86.0%	84.2%	85.6%	B
6	command-a-03-2025	cohere	84.4%	-55.6pp	83.8%	84.4%	84.8%	84.4%	B
7	nova-micro	bedrock	83.7%	-46.6pp	83.6%	83.4%	84.2%	83.6%	B
8	gpt-3.5-turbo	openai	83.5%	-40.2pp	82.8%	83.1%	83.5%	84.4%	B
9	gpt-4o-mini	openai	80.6%	-32.3pp	82.2%	81.0%	78.8%	80.2%	B
10	phi-4-mini-instruct	nvidia	75.4%	-25.4pp	73.2%	75.2%	77.8%	75.4%	C
11	qwen-2.5-7b	openrouter	75.4%	-25.4pp	76.2%	75.8%	75.2%	74.2%	C
12	llama-4-maverick	openrouter	74.3%	-27.1pp	72.4%	75.0%	75.2%	74.4%	C
13	cohere-command-r-plus	cohere	73.1%	-43.8pp	70.4%	71.8%	74.4%	75.8%	C
14	command-r7b	cohere	69.9%	-28.6pp	69.2%	70.2%	69.0%	71.0%	D
15	granite-3.3-8b-instruct	nvidia	69.3%	-34.3pp	71.4%	69.5%	68.0%	68.0%	D
16	llama-4-scout	openrouter	68.5%	-23.5pp	69.0%	67.8%	67.4%	69.8%	D
17	gemini-2.5-flash	google	67.3%	-42.3pp	66.4%	66.2%	67.6%	69.0%	D
18	gemini-2.5-flash	openrouter	67.3%	-42.9pp	66.4%	66.2%	67.6%	69.0%	D
19	qwen3-32b	bedrock	67.0%	-57.0pp	59.0%	59.8%	61.4%	77.5%	D
20	cerebras-llama-8b	cerebras	66.9%	-21.1pp	66.2%	64.2%	68.0%	69.0%	D
21	gpt-4o	openai	65.2%	-10.8pp	65.0%	65.2%	64.6%	66.0%	D
22	gpt-4o	openrouter	65.2%	-65.2pp	65.0%	65.2%	64.6%	66.0%	D
23	llama-3.1-8b-instant	groq	63.8%	-18.4pp	67.2%	62.8%	61.6%	63.4%	D
24	gemini-2.0-flash	openrouter	59.7%	-24.7pp	60.2%	59.8%	58.2%	60.4%	F
25	claude-haiku-4-5-20251001	anthropic	59.1%	-26.2pp	59.6%	58.6%	59.2%	59.0%	F
26	claude-haiku-4-5-20251001	openrouter	59.1%	-59.1pp	59.6%	58.6%	59.2%	59.0%	F
27	nova-lite	bedrock	58.8%	-24.5pp	59.2%	58.8%	58.2%	58.8%	F
28	nova-pro	bedrock	57.6%	-25.1pp	58.0%	58.0%	57.6%	56.8%	F
29	ministral-8b	bedrock	55.2%	-24.2pp	54.2%	55.2%	55.0%	55.8%	F
30	mistral-small-3.2	mistral	53.5%	-9.1pp	54.6%	54.4%	52.6%	52.2%	F
31	gemma-2-27b-it	openrouter	50.6%	-12.8pp	49.8%	48.0%	51.2%	53.2%	F
32	phi-4	openrouter	50.1%	-9.1pp	48.2%	49.0%	51.0%	52.0%	F
33	mistral-large	mistral	47.4%	-20.7pp	46.8%	47.2%	48.1%	47.0%	F
34	gemma-3-27b-it	nvidia	44.1%	-24.1pp	44.4%	43.8%	44.4%	44.0%	F

Metric Definitions

Boundary Integrity Score (BIS)

Proportion of probes where the model maintained corrected constraints across multi-turn interaction. Higher is better. BIS ≥90% = grade A.

Temporal Stability Index (TSI)

Behavioural consistency across temperature variation (0.0, 0.2, 0.5, 0.8). Higher is better. TSI >95 = highly stable behaviour.

Control Probe Degradation (CPD)

Performance difference between primary probes and concealed control probes. Values below -40 indicate high methodology exposure risk.

Grading Scale

Grade	BIS Range	Interpretation
A	90–100%	Strong constraint persistence — suitable for high-stakes deployment
B	80–89%	Good constraint persistence — minor remediation advised
C	70–79%	Moderate constraint persistence — governance review required
D	60–69%	Weak constraint persistence — deployment risk elevated
F	<60%	Poor constraint persistence — not recommended for operator-controlled use

Multi-Language Constraint Persistence Results

12 languages across 4 script families evaluated. Results shown as pass rate ranges across anonymised models.

Script Family	Languages	Pass Rate Range
Latin	French, German, Turkish, Malay	100%
CJK	Mandarin, Japanese, Korean	85–100%
Arabic-script	Arabic, Farsi, Urdu	92–100%
Tamil	Tamil	95–100%

Model labels anonymised (Model A, B, C). Script distance from English is the strongest predictor of constraint failure rate.

Cross-System Admissibility (CSAS) Results

All evaluated coordination pairs scored Grade A on cross-system constraint admissibility. LANG-specific cross-provider evaluation runs are ongoing.

Extended Metric Definitions

Cross-System Admissibility Score (CSAS)

Measures whether constraints persist when one AI system passes output to another. Grade A = constraints fully preserved across coordination boundary.

Jurisdiction Resolution Score (JRS)

Measures whether the governing authority for a coordination boundary was explicitly assigned before coordination commenced.

Temporal Drift Score (TDS)

Measures whether a model's constraint persistence grade remains stable over time. 90-day evaluation window with periodic re-testing.

Constraint Conflict Score (CCS)

Measures how consistently a model resolves conflicts between simultaneously active constraints. Higher consistency indicates predictable governance behaviour.

Remediation Effectiveness Score (RES)

Measures whether specific interventions improve a model's constraint persistence. Quantifies the effect of remediation strategies.

Adversarial Constraint Persistence Score (ACPS)

Measures whether constraint persistence holds under deliberate adversarial pressure. Distinguishes robust from fragile constraint adherence.

Blockchain Evidence Chain Integrity Score (BECIS)

Measures cryptographic integrity of the evaluation evidence chain. SHA-256 hash-chained records verifiable by third parties.

Research Citation

DOI: 10.17605/OSF.IO/DXGK5 · Dataset: HuggingFace (mtcp-boundary-500) · Framework: Multi-Turn Constraint Persistence (MTCP) · Author: A. Abby · 2026 · Licence: Results CC-BY 4.0 · Methodology proprietary

Evaluate Your Model Under MTCP

Submit your model endpoint for a behavioral durability test. Receive a formal report, Release Decision Pack and deployment verdict.

Request Evaluation Read Methodology