OSF DOI: 10.17605/OSF.IO/DXGK5 SSRN Paper: 6482082 HuggingFace: mtcp-boundary-500
EU AI Act — August 2026
32
Models evaluated
4
Temperature settings
20
Control probes
MTCP
Framework

Complete Rankings

Ordered by Boundary Integrity Score (BIS). CPD shown as difference from primary BIS.

Methodology →
# Model Provider BIS CPD T=0.0 T=0.2 T=0.5 T=0.8 Grade
1 grok-3-mini openrouter 91.0% -59.9pp 92.0% 92.0% 90.6% 89.4% A
2 kimi-k2 fireworks 88.5% -31.0pp 89.4% 87.8% 89.2% 87.6% B
3 llama-3.3-70b-versatile groq 86.5% -43.2pp 86.8% 85.8% 87.2% 86.2% B
4 llama-3.1-70b-instruct nvidia 85.4% -42.1pp 85.6% 86.0% 84.2% 85.6% B
5 llama-3.1-nemotron-70b-instruct openrouter 85.4% -57.5pp 85.6% 86.0% 84.2% 85.6% B
6 command-a-03-2025 cohere 84.4% -55.6pp 83.8% 84.4% 84.8% 84.4% B
7 nova-micro bedrock 83.7% -46.6pp 83.6% 83.4% 84.2% 83.6% B
8 gpt-3.5-turbo openai 83.5% -40.2pp 82.8% 83.1% 83.5% 84.4% B
9 gpt-4o-mini openai 80.6% -32.3pp 82.2% 81.0% 78.8% 80.2% B
10 phi-4-mini-instruct nvidia 75.4% -25.4pp 73.2% 75.2% 77.8% 75.4% C
11 qwen-2.5-7b openrouter 75.4% -25.4pp 76.2% 75.8% 75.2% 74.2% C
12 llama-4-maverick openrouter 74.3% -27.1pp 72.4% 75.0% 75.2% 74.4% C
13 cohere-command-r-plus cohere 73.1% -43.8pp 70.4% 71.8% 74.4% 75.8% C
14 command-r7b cohere 69.9% -28.6pp 69.2% 70.2% 69.0% 71.0% D
15 granite-3.3-8b-instruct nvidia 69.3% -34.3pp 71.4% 69.5% 68.0% 68.0% D
16 llama-4-scout openrouter 68.5% -23.5pp 69.0% 67.8% 67.4% 69.8% D
17 gemini-2.5-flash google 67.3% -42.3pp 66.4% 66.2% 67.6% 69.0% D
18 gemini-2.5-flash openrouter 67.3% -42.9pp 66.4% 66.2% 67.6% 69.0% D
19 qwen3-32b bedrock 67.0% -57.0pp 59.0% 59.8% 61.4% 77.5% D
20 cerebras-llama-8b cerebras 66.9% -21.1pp 66.2% 64.2% 68.0% 69.0% D
21 gpt-4o openai 65.2% -10.8pp 65.0% 65.2% 64.6% 66.0% D
22 gpt-4o openrouter 65.2% -65.2pp 65.0% 65.2% 64.6% 66.0% D
23 llama-3.1-8b-instant groq 63.8% -18.4pp 67.2% 62.8% 61.6% 63.4% D
24 gemini-2.0-flash openrouter 59.7% -24.7pp 60.2% 59.8% 58.2% 60.4% F
25 claude-haiku-4-5-20251001 anthropic 59.1% -26.2pp 59.6% 58.6% 59.2% 59.0% F
26 claude-haiku-4-5-20251001 openrouter 59.1% -59.1pp 59.6% 58.6% 59.2% 59.0% F
27 nova-lite bedrock 58.8% -24.5pp 59.2% 58.8% 58.2% 58.8% F
28 nova-pro bedrock 57.6% -25.1pp 58.0% 58.0% 57.6% 56.8% F
29 ministral-8b bedrock 55.2% -24.2pp 54.2% 55.2% 55.0% 55.8% F
30 mistral-small-3.2 mistral 53.5% -9.1pp 54.6% 54.4% 52.6% 52.2% F
31 gemma-2-27b-it openrouter 50.6% -12.8pp 49.8% 48.0% 51.2% 53.2% F
32 phi-4 openrouter 50.1% -9.1pp 48.2% 49.0% 51.0% 52.0% F
33 mistral-large mistral 47.4% -20.7pp 46.8% 47.2% 48.1% 47.0% F
34 gemma-3-27b-it nvidia 44.1% -24.1pp 44.4% 43.8% 44.4% 44.0% F

Metric Definitions

Boundary Integrity Score (BIS)
Proportion of probes where the model maintained corrected constraints across multi-turn interaction. Higher is better. BIS ≥90% = grade A.
Temporal Stability Index (TSI)
Behavioural consistency across temperature variation (0.0, 0.2, 0.5, 0.8). Higher is better. TSI >95 = highly stable behaviour.
Control Probe Degradation (CPD)
Performance difference between primary probes and concealed control probes. Values below -40 indicate high methodology exposure risk.

Grading Scale

GradeBIS RangeInterpretation
A90–100%Strong constraint persistence — suitable for high-stakes deployment
B80–89%Good constraint persistence — minor remediation advised
C70–79%Moderate constraint persistence — governance review required
D60–69%Weak constraint persistence — deployment risk elevated
F<60%Poor constraint persistence — not recommended for operator-controlled use

Multi-Language Constraint Persistence Results

12 languages across 4 script families evaluated. Results shown as pass rate ranges across anonymised models.

Script FamilyLanguagesPass Rate Range
Latin French, German, Turkish, Malay 100%
CJK Mandarin, Japanese, Korean 85–100%
Arabic-script Arabic, Farsi, Urdu 92–100%
Tamil Tamil 95–100%

Model labels anonymised (Model A, B, C). Script distance from English is the strongest predictor of constraint failure rate.

Cross-System Admissibility (CSAS) Results

All evaluated coordination pairs scored Grade A on cross-system constraint admissibility. LANG-specific cross-provider evaluation runs are ongoing.

Extended Metric Definitions

Cross-System Admissibility Score (CSAS)
Measures whether constraints persist when one AI system passes output to another. Grade A = constraints fully preserved across coordination boundary.
Jurisdiction Resolution Score (JRS)
Measures whether the governing authority for a coordination boundary was explicitly assigned before coordination commenced.
Temporal Drift Score (TDS)
Measures whether a model's constraint persistence grade remains stable over time. 90-day evaluation window with periodic re-testing.
Constraint Conflict Score (CCS)
Measures how consistently a model resolves conflicts between simultaneously active constraints. Higher consistency indicates predictable governance behaviour.
Remediation Effectiveness Score (RES)
Measures whether specific interventions improve a model's constraint persistence. Quantifies the effect of remediation strategies.
Adversarial Constraint Persistence Score (ACPS)
Measures whether constraint persistence holds under deliberate adversarial pressure. Distinguishes robust from fragile constraint adherence.
Blockchain Evidence Chain Integrity Score (BECIS)
Measures cryptographic integrity of the evaluation evidence chain. SHA-256 hash-chained records verifiable by third parties.

Research Citation

DOI: 10.17605/OSF.IO/DXGK5  ·  Dataset: HuggingFace (mtcp-boundary-500)  ·  Framework: Multi-Turn Constraint Persistence (MTCP)  ·  Author: A. Abby  ·  2026  ·  Licence: Results CC-BY 4.0 · Methodology proprietary

Evaluate Your Model Under MTCP

Submit your model endpoint for a behavioral durability test. Receive a formal report, Release Decision Pack and deployment verdict.

Request Evaluation Read Methodology