Safety Routing Circuits Found Across Models, Vulnerable to Encoding Attacks

Researchers have localized the policy routing mechanism in alignment-trained language models, identifying specific attention heads that act as gates to trigger refusal behavior. These gates operate across models from six labs ranging from 2B to 72B parameters, though their exact location varies by lab. The work demonstrates that safety training creates a gated routing circuit rather than removing harmful capabilities, and that encoding attacks targeting the detection layer can bypass refusal with high effectiveness.
TL;DR
- →Policy routing in LLMs uses intermediate-layer attention gates that detect sensitive content and trigger amplifier heads to boost refusal signals, with gates contributing under 1% of output but being causally necessary
- →Interchange testing identifies the same routing motif across twelve models from six labs (2B to 72B), though specific head locations differ by organization and per-head ablation misses gates that interchange testing catches
- →Modulating detection-layer signals continuously controls policy from hard refusal through evasion to factual answering, and the same intervention can turn refusal into harmful guidance on safety prompts
- →In-context substitution ciphers collapse gate necessity by 70 to 99% across models by defeating detection-layer pattern matching, showing that any encoding bypassing detection bypasses policy regardless of deeper layer reconstruction
Why it matters
This research reveals that safety training in large language models relies on a localized routing mechanism rather than distributed value alignment, which has significant implications for understanding both the robustness and fragility of current alignment approaches. The finding that encoding attacks can bypass refusal with high success rates suggests that safety guarantees may be more brittle than behavioral benchmarks indicate, and that the routing circuit's early-commitment design creates a single point of failure.
Business relevance
For organizations deploying aligned models in production, this work highlights that safety mechanisms can be circumvented through relatively simple encoding attacks, raising questions about the reliability of current safety guarantees in adversarial settings. The discovery that routing circuits relocate across model generations while behavioral benchmarks show no change suggests that safety audits based on standard benchmarks may miss meaningful shifts in underlying mechanisms.
Key implications
- →Safety training creates gated routing circuits that gate capabilities rather than remove them, meaning harmful outputs remain present in model weights and can be accessed through routing bypass techniques
- →Current per-head ablation methods are insufficient for auditing safety mechanisms at scale, and interchange testing becomes the only reliable approach for identifying causal safety components in large models
- →Encoding attacks targeting the detection layer represent a practical vulnerability in current alignment approaches, with success rates of 70 to 99% across tested models, suggesting that safety mechanisms are not robust to simple adversarial inputs
- →The early-commitment design of policy routing, where gates fire before deeper layers finish processing, creates architectural constraints that may limit the effectiveness of downstream safety interventions
What to watch
Monitor whether this research prompts development of more distributed or robust safety mechanisms that do not rely on localized routing circuits. Watch for follow-up work on whether these findings generalize to frontier models and whether organizations begin incorporating interchange testing into their safety evaluation pipelines. Track whether encoding-based bypass techniques become more sophisticated and whether defenses emerge that make detection-layer pattern matching more robust.
vff Briefing
Weekly signal. No noise. Built for founders, operators, and AI-curious professionals.
No spam. Unsubscribe any time.



