181,448 structured probe interactions across 32 frontier models. Baseline compliance, deep persistence, and concealed control validation at four temperature settings.
Ordered by Boundary Integrity Score (BIS). CPD shown as difference from primary BIS.
| # | Model | Provider | BIS | CPD | T=0.0 | T=0.2 | T=0.5 | T=0.8 | Grade |
|---|---|---|---|---|---|---|---|---|---|
| 1 | grok-3-mini | openrouter | 91.0% | -59.9pp | 92.0% | 92.0% | 90.6% | 89.4% | A |
| 2 | kimi-k2 | fireworks | 88.5% | -31.0pp | 89.4% | 87.8% | 89.2% | 87.6% | B |
| 3 | llama-3.3-70b-versatile | groq | 86.5% | -43.2pp | 86.8% | 85.8% | 87.2% | 86.2% | B |
| 4 | llama-3.1-70b-instruct | nvidia | 85.4% | -42.1pp | 85.6% | 86.0% | 84.2% | 85.6% | B |
| 5 | llama-3.1-nemotron-70b-instruct | openrouter | 85.4% | -57.5pp | 85.6% | 86.0% | 84.2% | 85.6% | B |
| 6 | command-a-03-2025 | cohere | 84.4% | -55.6pp | 83.8% | 84.4% | 84.8% | 84.4% | B |
| 7 | nova-micro | bedrock | 83.7% | -46.6pp | 83.6% | 83.4% | 84.2% | 83.6% | B |
| 8 | gpt-3.5-turbo | openai | 83.5% | -40.2pp | 82.8% | 83.1% | 83.5% | 84.4% | B |
| 9 | gpt-4o-mini | openai | 80.6% | -32.3pp | 82.2% | 81.0% | 78.8% | 80.2% | B |
| 10 | phi-4-mini-instruct | nvidia | 75.4% | -25.4pp | 73.2% | 75.2% | 77.8% | 75.4% | C |
| 11 | qwen-2.5-7b | openrouter | 75.4% | -25.4pp | 76.2% | 75.8% | 75.2% | 74.2% | C |
| 12 | llama-4-maverick | openrouter | 74.3% | -27.1pp | 72.4% | 75.0% | 75.2% | 74.4% | C |
| 13 | cohere-command-r-plus | cohere | 73.1% | -43.8pp | 70.4% | 71.8% | 74.4% | 75.8% | C |
| 14 | command-r7b | cohere | 69.9% | -28.6pp | 69.2% | 70.2% | 69.0% | 71.0% | D |
| 15 | granite-3.3-8b-instruct | nvidia | 69.3% | -34.3pp | 71.4% | 69.5% | 68.0% | 68.0% | D |
| 16 | llama-4-scout | openrouter | 68.5% | -23.5pp | 69.0% | 67.8% | 67.4% | 69.8% | D |
| 17 | gemini-2.5-flash | 67.3% | -42.3pp | 66.4% | 66.2% | 67.6% | 69.0% | D | |
| 18 | gemini-2.5-flash | openrouter | 67.3% | -42.9pp | 66.4% | 66.2% | 67.6% | 69.0% | D |
| 19 | qwen3-32b | bedrock | 67.0% | -57.0pp | 59.0% | 59.8% | 61.4% | 77.5% | D |
| 20 | cerebras-llama-8b | cerebras | 66.9% | -21.1pp | 66.2% | 64.2% | 68.0% | 69.0% | D |
| 21 | gpt-4o | openai | 65.2% | -10.8pp | 65.0% | 65.2% | 64.6% | 66.0% | D |
| 22 | gpt-4o | openrouter | 65.2% | -65.2pp | 65.0% | 65.2% | 64.6% | 66.0% | D |
| 23 | llama-3.1-8b-instant | groq | 63.8% | -18.4pp | 67.2% | 62.8% | 61.6% | 63.4% | D |
| 24 | gemini-2.0-flash | openrouter | 59.7% | -24.7pp | 60.2% | 59.8% | 58.2% | 60.4% | F |
| 25 | claude-haiku-4-5-20251001 | anthropic | 59.1% | -26.2pp | 59.6% | 58.6% | 59.2% | 59.0% | F |
| 26 | claude-haiku-4-5-20251001 | openrouter | 59.1% | -59.1pp | 59.6% | 58.6% | 59.2% | 59.0% | F |
| 27 | nova-lite | bedrock | 58.8% | -24.5pp | 59.2% | 58.8% | 58.2% | 58.8% | F |
| 28 | nova-pro | bedrock | 57.6% | -25.1pp | 58.0% | 58.0% | 57.6% | 56.8% | F |
| 29 | ministral-8b | bedrock | 55.2% | -24.2pp | 54.2% | 55.2% | 55.0% | 55.8% | F |
| 30 | mistral-small-3.2 | mistral | 53.5% | -9.1pp | 54.6% | 54.4% | 52.6% | 52.2% | F |
| 31 | gemma-2-27b-it | openrouter | 50.6% | -12.8pp | 49.8% | 48.0% | 51.2% | 53.2% | F |
| 32 | phi-4 | openrouter | 50.1% | -9.1pp | 48.2% | 49.0% | 51.0% | 52.0% | F |
| 33 | mistral-large | mistral | 47.4% | -20.7pp | 46.8% | 47.2% | 48.1% | 47.0% | F |
| 34 | gemma-3-27b-it | nvidia | 44.1% | -24.1pp | 44.4% | 43.8% | 44.4% | 44.0% | F |
| Grade | BIS Range | Interpretation |
|---|---|---|
| A | 90–100% | Strong constraint persistence — suitable for high-stakes deployment |
| B | 80–89% | Good constraint persistence — minor remediation advised |
| C | 70–79% | Moderate constraint persistence — governance review required |
| D | 60–69% | Weak constraint persistence — deployment risk elevated |
| F | <60% | Poor constraint persistence — not recommended for operator-controlled use |
DOI: 10.17605/OSF.IO/DXGK5 · Dataset: HuggingFace (mtcp-boundary-500) · Framework: Multi-Turn Constraint Persistence (MTCP) · Author: A. Abby · 2026 · Licence: Results CC-BY 4.0 · Methodology proprietary
Submit your model endpoint for a behavioral durability test. Receive a formal report, Release Decision Pack and deployment verdict.