Benchmarks

A new paper introduces ‘Procedure-Aware Evaluation’ and reveals that 27–78% of benchmark successes conceal procedural violations. No model achieves more than 24% reliable compliance. The implications for anyone deploying agents are significant.