Explore
How It Works Test Model Evidence Rankings Model Cards Methodology Buyer Brief Request Evaluation Pricing Terms
Account
Register Login
32
Models evaluated
4
Temperature settings
20
Control probes
MTCP
Framework

Complete Rankings

Ordered by Boundary Integrity Score (BIS). CPD shown as difference from primary BIS.

Methodology →
# Model Provider BIS CPD T=0.0 T=0.2 T=0.5 T=0.8 Grade
1 grok-3-mini openrouter 91.0% -59.9pp 92.0% 92.0% 90.6% 89.4% A
2 kimi-k2 fireworks 88.5% -31.0pp 89.4% 87.8% 89.2% 87.6% B
3 llama-3.3-70b-versatile groq 86.5% -43.2pp 86.8% 85.8% 87.2% 86.2% B
4 llama-3.1-70b-instruct nvidia 85.4% -42.1pp 85.6% 86.0% 84.2% 85.6% B
5 llama-3.1-nemotron-70b-instruct openrouter 85.4% -57.5pp 85.6% 86.0% 84.2% 85.6% B
6 command-a-03-2025 cohere 84.4% -55.6pp 83.8% 84.4% 84.8% 84.4% B
7 nova-micro bedrock 83.7% -46.6pp 83.6% 83.4% 84.2% 83.6% B
8 gpt-3.5-turbo openai 83.5% -40.2pp 82.8% 83.1% 83.5% 84.4% B
9 gpt-4o-mini openai 80.6% -32.3pp 82.2% 81.0% 78.8% 80.2% B
10 phi-4-mini-instruct nvidia 75.4% -25.4pp 73.2% 75.2% 77.8% 75.4% C
11 qwen-2.5-7b openrouter 75.4% -25.4pp 76.2% 75.8% 75.2% 74.2% C
12 llama-4-maverick openrouter 74.3% -27.1pp 72.4% 75.0% 75.2% 74.4% C
13 cohere-command-r-plus cohere 73.1% -43.8pp 70.4% 71.8% 74.4% 75.8% C
14 command-r7b cohere 69.9% -28.6pp 69.2% 70.2% 69.0% 71.0% D
15 granite-3.3-8b-instruct nvidia 69.3% -34.3pp 71.4% 69.5% 68.0% 68.0% D
16 llama-4-scout openrouter 68.5% -23.5pp 69.0% 67.8% 67.4% 69.8% D
17 gemini-2.5-flash google 67.3% -42.3pp 66.4% 66.2% 67.6% 69.0% D
18 gemini-2.5-flash openrouter 67.3% -42.9pp 66.4% 66.2% 67.6% 69.0% D
19 qwen3-32b bedrock 67.0% -57.0pp 59.0% 59.8% 61.4% 77.5% D
20 cerebras-llama-8b cerebras 66.9% -21.1pp 66.2% 64.2% 68.0% 69.0% D
21 gpt-4o openai 65.2% -10.8pp 65.0% 65.2% 64.6% 66.0% D
22 gpt-4o openrouter 65.2% -65.2pp 65.0% 65.2% 64.6% 66.0% D
23 llama-3.1-8b-instant groq 63.8% -18.4pp 67.2% 62.8% 61.6% 63.4% D
24 gemini-2.0-flash openrouter 59.7% -24.7pp 60.2% 59.8% 58.2% 60.4% F
25 claude-haiku-4-5-20251001 anthropic 59.1% -26.2pp 59.6% 58.6% 59.2% 59.0% F
26 claude-haiku-4-5-20251001 openrouter 59.1% -59.1pp 59.6% 58.6% 59.2% 59.0% F
27 nova-lite bedrock 58.8% -24.5pp 59.2% 58.8% 58.2% 58.8% F
28 nova-pro bedrock 57.6% -25.1pp 58.0% 58.0% 57.6% 56.8% F
29 ministral-8b bedrock 55.2% -24.2pp 54.2% 55.2% 55.0% 55.8% F
30 mistral-small-3.2 mistral 53.5% -9.1pp 54.6% 54.4% 52.6% 52.2% F
31 gemma-2-27b-it openrouter 50.6% -12.8pp 49.8% 48.0% 51.2% 53.2% F
32 phi-4 openrouter 50.1% -9.1pp 48.2% 49.0% 51.0% 52.0% F
33 mistral-large mistral 47.4% -20.7pp 46.8% 47.2% 48.1% 47.0% F
34 gemma-3-27b-it nvidia 44.1% -24.1pp 44.4% 43.8% 44.4% 44.0% F

Metric Definitions

Boundary Integrity Score (BIS)
Proportion of probes where the model maintained corrected constraints across multi-turn interaction. Higher is better. BIS ≥90% = grade A.
Temporal Stability Index (TSI)
Behavioural consistency across temperature variation (0.0, 0.2, 0.5, 0.8). Higher is better. TSI >95 = highly stable behaviour.
Control Probe Degradation (CPD)
Performance difference between primary probes and concealed control probes. Values below -40 indicate high methodology exposure risk.

Grading Scale

GradeBIS RangeInterpretation
A90–100%Strong constraint persistence — suitable for high-stakes deployment
B80–89%Good constraint persistence — minor remediation advised
C70–79%Moderate constraint persistence — governance review required
D60–69%Weak constraint persistence — deployment risk elevated
F<60%Poor constraint persistence — not recommended for operator-controlled use

Research Citation

DOI: 10.17605/OSF.IO/DXGK5  ·  Dataset: HuggingFace (mtcp-boundary-500)  ·  Framework: Multi-Turn Constraint Persistence (MTCP)  ·  Author: A. Abby  ·  2026  ·  Licence: Results CC-BY 4.0 · Methodology proprietary

Evaluate Your Model Under MTCP

Submit your model endpoint for a behavioral durability test. Receive a formal report, Release Decision Pack and deployment verdict.

Request Evaluation Read Methodology