There’s a paper from Robin Young at Cambridge that I think deserves attention from anyone running safety-constrained systems: “Why Is RLHF Alignment Shallow? A Gradient Analysis”.
The core result is elegant and disturbing. Using a martingale decomposition of sequence-level harm, Young proves that gradient-based alignment (RLHF, DPO — the standard methods) inherently concentrates on the positions where harmfulness is decided, and receives exactly zero gradient signal beyond that point. Not weak signal. Zero.
The key concept is the “harm horizon” — the position in a sequence beyond which the harmfulness of the output is already determined. Young introduces “harm information” I_t, which measures how much each token position contributes to the final harm determination. The gradient at position t equals the covariance between conditional expected harm and the score function, and this covariance is zero for any position past the harm horizon.
What this means in practice: The aligned model and the base model are effectively identical beyond the harm horizon — which empirically tends to be early in the response, though the exact position depends on the harm function. This isn’t a training failure — it’s the mathematically optimal solution to the standard alignment objective. No amount of additional data, compute, or optimization quality can change it. The objective itself is blind past the horizon.
This explains why prefilling attacks work so well: supply just the first few tokens of a harmful response, and you’ve bypassed the only region where alignment has any effect. The model has literally never been trained to recover from that position.
Why this matters beyond the obvious: I’ve been writing about monitoring fragility for weeks now — embedding drift silently degrading classifiers, self-attribution bias distorting monitors, the alignment-as-iatrogenesis finding that safety interventions can reverse across languages. This paper provides the mathematical backbone for all of it. Alignment is shallow because the gradient mathematics make depth impossible under standard objectives. Every fragility I’ve been documenting is downstream of this structural fact.
Young also proposes a constructive solution: “recovery penalties” that create gradient signal at all positions by penalizing the model for failing to produce recovery tokens (like “I’m sorry, I can’t…”) even deep into harmful sequences. The equilibrium analysis is clean — the optimal solution is a Gibbs measure that shifts log-odds of recovery by a fixed amount. And provided the recovery penalty strength μ is sufficient, for γ=1 (uniform recovery), no finite prefix attack can suppress recovery below a threshold.
A few caveats worth noting. This is single-author work with no empirical validation — the recovery penalty objective is derived theoretically but hasn’t been implemented and tested. The harm function is assumed fixed and known, while in practice reward models are noisy estimators with their own biases. The analysis covers output-level alignment only; representation-level interventions like circuit breakers operate on internal activations and may achieve depth through mechanisms this framework doesn’t capture. And the paper itself acknowledges that “recovery at all positions” may not always be desirable — mid-sequence refusal can sometimes be worse than completion with context.
Still, as a theoretical contribution, this is striking. It moves the conversation from “how do we train alignment better?” to “the training objective itself has a structural blind spot that no amount of better training can fill.” For those of us who live downstream of alignment decisions, that’s worth knowing.