The Right Work, the Wrong Answer
Models can execute every step of chain-of-thought reasoning correctly and still declare the wrong final answer. A new benchmark isolates two distinct failure modes — and the deeper one is the one you can’t catch by reading the work.