Topic
AI Safety & Alignment
Alignment research, red teaming, evals, and safer deployment practices
Featured
All Stories

Illinois Passes AI Safety Audit and Whistleblower Bill
The Illinois House of Representatives passed legislation requiring major AI companies to submit model safety plans for…
AI-Generated Story Slips Into Prestigious Literary Prize
An AI-generated short story appears to have won selection in Granta magazine's Commonwealth Short Story Prize, a…

AI Therapy Startup Claims Major Safety Advantage Over Consumer Bots
The Path, a startup founded by veterans from Tony Robbins' organization and Calm, claims its AI therapy model scored 95…

AWS Demonstrates AI Recruitment Assistant Using Bedrock
AWS published a reference architecture for building an AI-powered recruitment assistant using Amazon Bedrock that…

NanoCo AI Raises $12M to Build Secure Enterprise AI Assistants
NanoCo AI, founded by former Wix engineer Gavriel Cohen and his brother Lazer Cohen, has raised a $12 million…
OpenAI Launches Content Provenance Tools to Verify AI-Generated Media
OpenAI has introduced a suite of tools focused on content provenance, including Content Credentials and SynthID, along…

Amazon Nova 2 Lite for Content Moderation via Prompting
AWS published a guide on using Amazon Nova 2 Lite for content moderation via prompting, demonstrating how to apply the…

ArXiv to ban authors for a year over AI-generated papers
ArXiv, the preprint repository used by researchers across physics, mathematics, computer science, and other fields, is…
Google Bans AI Manipulation in Search Spam Policy
Google has expanded its spam policy to explicitly prohibit attempts to manipulate its AI systems in search results,…

Frontier LLMs Silently Corrupt 25% of Documents in Iterative Workflows
Microsoft researchers developed a benchmark showing that frontier LLMs silently corrupt an average of 25% of document…
Meta Launches Encrypted AI Chat with No Server Logs
Meta has launched Incognito Chat, a new AI conversation mode that Meta CEO Mark Zuckerberg claims offers end-to-end…

Why AI Agents Fail Confidently, and How to Test for It
A production observability agent confidently executed a catastrophic rollback in response to a scheduled batch job it…
ChatGPT Adds User Controls for Training Data Privacy
OpenAI has published details on how ChatGPT protects user privacy while learning from interactions, including…

OpenAI Adds Trusted Contact Safety Feature to ChatGPT
OpenAI has introduced Trusted Contact, an optional safety feature in ChatGPT that alerts a designated person if the…

DeepMind U.K. Staff Push for Union Recognition Over Military AI Work
Google DeepMind employees in the U.K. have formally requested that management voluntarily recognize the Communication…
Trump Admin Moves to Formalize AI Model Oversight
The Trump administration is reconsidering its approach to AI oversight as model capabilities advance, with plans to…

Safety Routing Circuits Found Across Models, Vulnerable to Encoding Attacks
Researchers have localized the policy routing mechanism in alignment-trained language models, identifying specific…
Warmer AI Models Trade Accuracy for Empathy
Researchers at Oxford University's Internet Institute found that large language models fine-tuned to appear warmer and…

How OpenAI's Personality Feature Unleashed the Goblins
OpenAI's GPT-5.5 model exhibited unexpected behavior where it became obsessed with discussing goblins, gremlins, and…

UK Tests Show GPT-5.5 and Anthropic Mythos Match on Cybersecurity Tasks
A UK government group conducting AI cybersecurity testing has found that OpenAI's GPT-5.5 model performs comparably to…

Goodfire's Silico Brings Mechanistic Interpretability to Model Development
Goodfire, a San Francisco startup, released Silico, a tool that lets developers inspect and adjust AI model parameters…

Frontier Agents Now Autonomously Implement ML Pipelines, With Claude Outpacing Rivals
Researchers benchmarked frontier coding agents on their ability to autonomously implement an AlphaZero-style machine…

Google employees demand Pentagon AI ban
Over 600 Google employees, including more than 20 senior leaders from DeepMind, have signed a letter to CEO Sundar…
New Framework Exposes Flaws in Fact-Checking Adversarial Tests
Researchers introduce AtomEval, a new evaluation framework that addresses a critical gap in how fact-checking systems…

Mapping Causal Reasoning in LLMs with Sparse Concept Graphs
Researchers propose Causal Concept Graphs (CCG), a method that maps how concepts interact during multi-step reasoning…
Mythos and the Shifting Baseline of AI Cybersecurity
Anthropic announced Claude Mythos Preview, a model capable of autonomously discovering and weaponizing software…

Comic Strips Bypass Safety in Multimodal AI Models
Researchers have identified a new class of jailbreak attacks against multimodal large language models that embed…

Can AI Amplify Human Thinking or Only Replace It?
Researchers have developed a mathematical framework to distinguish between cognitive amplification, where AI enhances…
