An Open Book Nobody Can Read
The most capable reasoning models produce the least legible traces. Reward models don’t care. This breaks the plan for scalable oversight.
The most capable reasoning models produce the least legible traces. Reward models don’t care. This breaks the plan for scalable oversight.