VFF - The signal in the noise
Research

Safety Routing Circuits Found Across Models, Vulnerable to Encoding Attacks

Read original
Share
Safety Routing Circuits Found Across Models, Vulnerable to Encoding Attacks

Researchers have localized the policy routing mechanism in alignment-trained language models, identifying specific attention heads that act as gates to trigger refusal behavior. These gates operate across models from six labs ranging from 2B to 72B parameters, though their exact location varies by lab. The work demonstrates that safety training creates a gated routing circuit rather than removing harmful capabilities, and that encoding attacks targeting the detection layer can bypass refusal with high effectiveness.

  • Policy routing in LLMs uses intermediate-layer attention gates that detect sensitive content and trigger amplifier heads to boost refusal signals, with gates contributing under 1% of output but being causally necessary
  • Interchange testing identifies the same routing motif across twelve models from six labs (2B to 72B), though specific head locations differ by organization and per-head ablation misses gates that interchange testing catches
  • Modulating detection-layer signals continuously controls policy from hard refusal through evasion to factual answering, and the same intervention can turn refusal into harmful guidance on safety prompts
  • In-context substitution ciphers collapse gate necessity by 70 to 99% across models by defeating detection-layer pattern matching, showing that any encoding bypassing detection bypasses policy regardless of deeper layer reconstruction

This research reveals that safety training in large language models relies on a localized routing mechanism rather than distributed value alignment, which has significant implications for understanding both the robustness and fragility of current alignment approaches. The finding that encoding attacks can bypass refusal with high success rates suggests that safety guarantees may be more brittle than behavioral benchmarks indicate, and that the routing circuit's early-commitment design creates a single point of failure.

For organizations deploying aligned models in production, this work highlights that safety mechanisms can be circumvented through relatively simple encoding attacks, raising questions about the reliability of current safety guarantees in adversarial settings. The discovery that routing circuits relocate across model generations while behavioral benchmarks show no change suggests that safety audits based on standard benchmarks may miss meaningful shifts in underlying mechanisms.

  • Safety training creates gated routing circuits that gate capabilities rather than remove them, meaning harmful outputs remain present in model weights and can be accessed through routing bypass techniques
  • Current per-head ablation methods are insufficient for auditing safety mechanisms at scale, and interchange testing becomes the only reliable approach for identifying causal safety components in large models
  • Encoding attacks targeting the detection layer represent a practical vulnerability in current alignment approaches, with success rates of 70 to 99% across tested models, suggesting that safety mechanisms are not robust to simple adversarial inputs
  • The early-commitment design of policy routing, where gates fire before deeper layers finish processing, creates architectural constraints that may limit the effectiveness of downstream safety interventions

Monitor whether this research prompts development of more distributed or robust safety mechanisms that do not rely on localized routing circuits. Watch for follow-up work on whether these findings generalize to frontier models and whether organizations begin incorporating interchange testing into their safety evaluation pipelines. Track whether encoding-based bypass techniques become more sophisticated and whether defenses emerge that make detection-layer pattern matching more robust.

Share

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Related stories

Arbor Framework Achieves 2.5x Better AI Optimization on Same Compute

Arbor Framework Achieves 2.5x Better AI Optimization on Same Compute

Researchers at Renmin University of China and Microsoft Research introduced Arbor, an optimization framework that organizes AI research into a tree structure to enable cumulative learning from failures. In tests, Arbor delivered 2.5 times greater performance gains than standard AI coding agents on real-world engineering tasks within the same compute budget. The framework addresses a core limitation in autonomous optimization: most AI agents treat each attempt in isolation and lose insights across long experimental sequences.

by bendee983@gmail.com (Ben Dickson)· VentureBeat AI
AI Model Identifies 18 New Rare Disease Diagnoses

AI Model Identifies 18 New Rare Disease Diagnoses

Researchers used an OpenAI reasoning model to help diagnose rare genetic diseases in children, identifying 18 new diagnoses in previously unsolved cases. The application demonstrates how AI can assist physicians in identifying conditions that are difficult to diagnose through conventional clinical approaches. The work suggests potential for AI tools to address diagnostic gaps in rare disease medicine.

· OpenAI
Google DeepMind Researcher Shazeer Joins OpenAI

Google DeepMind Researcher Shazeer Joins OpenAI

Noam Shazeer, a key researcher behind Google's generative AI advances, is joining OpenAI. Shazeer had left Google in 2021 to co-found Character.AI, then rejoined Google DeepMind in 2024 as part of a $2.7 billion acquisition deal, where he became a tech lead on Gemini. His move to OpenAI represents a significant talent shift in the competitive AI research landscape.

by Amir Efrati· The Information
OpenAI Releases LifeSciBench for AI Evaluation

OpenAI Releases LifeSciBench for AI Evaluation

OpenAI has released LifeSciBench, a benchmark designed to evaluate how AI systems perform on real-world life science research tasks and decisions. The benchmark was authored and reviewed by experts in the field. It provides a standardized way to assess AI capabilities in scientific research contexts.

· OpenAI