183,924 structured probe interactions across 32 frontier models. Baseline compliance, deep persistence, and concealed control validation at four temperature settings.
Ordered by Boundary Integrity Score (BIS). CPD shown as difference from primary BIS.
| # | Model | Provider | BIS | CPD | T=0.0 | T=0.2 | T=0.5 | T=0.8 | Grade |
|---|---|---|---|---|---|---|---|---|---|
| 1 | grok-3-mini | openrouter | 91.0% | -59.9pp | 92.0% | 92.0% | 90.6% | 89.4% | A |
| 2 | kimi-k2 | fireworks | 88.5% | -31.0pp | 89.4% | 87.8% | 89.2% | 87.6% | B |
| 3 | llama-3.3-70b-versatile | groq | 86.5% | -43.2pp | 86.8% | 85.8% | 87.2% | 86.2% | B |
| 4 | llama-3.1-70b-instruct | nvidia | 85.4% | -42.1pp | 85.6% | 86.0% | 84.2% | 85.6% | B |
| 5 | llama-3.1-nemotron-70b-instruct | openrouter | 85.4% | -57.5pp | 85.6% | 86.0% | 84.2% | 85.6% | B |
| 6 | command-a-03-2025 | cohere | 84.4% | -55.6pp | 83.8% | 84.4% | 84.8% | 84.4% | B |
| 7 | nova-micro | bedrock | 83.7% | -46.6pp | 83.6% | 83.4% | 84.2% | 83.6% | B |
| 8 | gpt-3.5-turbo | openai | 83.5% | -40.2pp | 82.8% | 83.1% | 83.5% | 84.4% | B |
| 9 | gpt-4o-mini | openai | 80.6% | -32.3pp | 82.2% | 81.0% | 78.8% | 80.2% | B |
| 10 | phi-4-mini-instruct | nvidia | 75.4% | -25.4pp | 73.2% | 75.2% | 77.8% | 75.4% | C |
| 11 | qwen-2.5-7b | openrouter | 75.4% | -25.4pp | 76.2% | 75.8% | 75.2% | 74.2% | C |
| 12 | llama-4-maverick | openrouter | 74.3% | -27.1pp | 72.4% | 75.0% | 75.2% | 74.4% | C |
| 13 | cohere-command-r-plus | cohere | 73.1% | -43.8pp | 70.4% | 71.8% | 74.4% | 75.8% | C |
| 14 | command-r7b | cohere | 69.9% | -28.6pp | 69.2% | 70.2% | 69.0% | 71.0% | D |
| 15 | granite-3.3-8b-instruct | nvidia | 69.3% | -34.3pp | 71.4% | 69.5% | 68.0% | 68.0% | D |
| 16 | llama-4-scout | openrouter | 68.5% | -23.5pp | 69.0% | 67.8% | 67.4% | 69.8% | D |
| 17 | gemini-2.5-flash | 67.3% | -42.3pp | 66.4% | 66.2% | 67.6% | 69.0% | D | |
| 18 | gemini-2.5-flash | openrouter | 67.3% | -42.9pp | 66.4% | 66.2% | 67.6% | 69.0% | D |
| 19 | qwen3-32b | bedrock | 67.0% | -57.0pp | 59.0% | 59.8% | 61.4% | 77.5% | D |
| 20 | cerebras-llama-8b | cerebras | 66.9% | -21.1pp | 66.2% | 64.2% | 68.0% | 69.0% | D |
| 21 | gpt-4o | openai | 65.2% | -10.8pp | 65.0% | 65.2% | 64.6% | 66.0% | D |
| 22 | gpt-4o | openrouter | 65.2% | -65.2pp | 65.0% | 65.2% | 64.6% | 66.0% | D |
| 23 | llama-3.1-8b-instant | groq | 63.8% | -18.4pp | 67.2% | 62.8% | 61.6% | 63.4% | D |
| 24 | gemini-2.0-flash | openrouter | 59.7% | -24.7pp | 60.2% | 59.8% | 58.2% | 60.4% | F |
| 25 | claude-haiku-4-5-20251001 | anthropic | 59.1% | -26.2pp | 59.6% | 58.6% | 59.2% | 59.0% | F |
| 26 | claude-haiku-4-5-20251001 | openrouter | 59.1% | -59.1pp | 59.6% | 58.6% | 59.2% | 59.0% | F |
| 27 | nova-lite | bedrock | 58.8% | -24.5pp | 59.2% | 58.8% | 58.2% | 58.8% | F |
| 28 | nova-pro | bedrock | 57.6% | -25.1pp | 58.0% | 58.0% | 57.6% | 56.8% | F |
| 29 | ministral-8b | bedrock | 55.2% | -24.2pp | 54.2% | 55.2% | 55.0% | 55.8% | F |
| 30 | mistral-small-3.2 | mistral | 53.5% | -9.1pp | 54.6% | 54.4% | 52.6% | 52.2% | F |
| 31 | gemma-2-27b-it | openrouter | 50.6% | -12.8pp | 49.8% | 48.0% | 51.2% | 53.2% | F |
| 32 | phi-4 | openrouter | 50.1% | -9.1pp | 48.2% | 49.0% | 51.0% | 52.0% | F |
| 33 | mistral-large | mistral | 47.4% | -20.7pp | 46.8% | 47.2% | 48.1% | 47.0% | F |
| 34 | gemma-3-27b-it | nvidia | 44.1% | -24.1pp | 44.4% | 43.8% | 44.4% | 44.0% | F |
| Grade | BIS Range | Interpretation |
|---|---|---|
| A | 90–100% | Strong constraint persistence — suitable for high-stakes deployment |
| B | 80–89% | Good constraint persistence — minor remediation advised |
| C | 70–79% | Moderate constraint persistence — governance review required |
| D | 60–69% | Weak constraint persistence — deployment risk elevated |
| F | <60% | Poor constraint persistence — not recommended for operator-controlled use |
12 languages across 4 script families evaluated. Results shown as pass rate ranges across anonymised models.
| Script Family | Languages | Pass Rate Range |
|---|---|---|
| Latin | French, German, Turkish, Malay | 100% |
| CJK | Mandarin, Japanese, Korean | 85–100% |
| Arabic-script | Arabic, Farsi, Urdu | 92–100% |
| Tamil | Tamil | 95–100% |
Model labels anonymised (Model A, B, C). Script distance from English is the strongest predictor of constraint failure rate.
All evaluated coordination pairs scored Grade A on cross-system constraint admissibility. LANG-specific cross-provider evaluation runs are ongoing.
DOI: 10.17605/OSF.IO/DXGK5 · Dataset: HuggingFace (mtcp-boundary-500) · Framework: Multi-Turn Constraint Persistence (MTCP) · Author: A. Abby · 2026 · Licence: Results CC-BY 4.0 · Methodology proprietary
Submit your model endpoint for a behavioral durability test. Receive a formal report, Release Decision Pack and deployment verdict.