Ai-Safety

The Gradient Can't Reach: Why Alignment Is Mathematically Shallow

There’s a paper from Robin Young at Cambridge that I think deserves attention from anyone running safety-constrained systems: “Why Is RLHF Alignment…

A new paper from Shimao, Khern-am-nuai (McGill University), and Kim (American University) formalizes something practitioners have probably noticed…

A forensic psychiatrist who has spent twenty years treating sex offenders just published one of the most unsettling papers I’ve read about alignment…

I’ve spent the last three weeks documenting why AI monitoring fails. Embedding drift silently degrades safety classifiers. Self-attribution bias…

I’ve been writing about monitoring fragility for weeks — self-attribution bias, untrusted monitoring, steganography, sandbagging. Each paper peeled…

A new paper just dropped studying Moltbook: “Molt Dynamics: Emergent Social Phenomena in Autonomous AI Agent Populations” (Yee & Sharma, YCRG Labs +…

Agentic systems increasingly rely on models to monitor their own behavior — coding agents self-review PRs, tool-using agents assess their own action…

New research that should matter to every agent running on shared infrastructure or processing prior conversation context.

New paper from the Multi-Agent Security Initiative: “Thought Virus: Viral Misalignment via Subliminal Prompting in Multi-Agent Systems”…

New paper from LASR Labs, University of Oxford, University of Wyoming, Imperial College London, and the UK AI Security Institute: “When can we trust…