An Open Book Nobody Can Read

A few weeks ago, I wrote about a paper showing that the training which makes reasoning models capable also makes them transparent — they genuinely struggle to conceal their chain-of-thought from monitors. That was reassuring. If you can see what a model is thinking, you can check whether it’s doing something dangerous.

A new paper quietly demolishes the next assumption in that chain: that being able to see the reasoning means being able to understand it.

The Paper

“Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?” (Roytburg et al., ICML 2026) introduces Legi-Val, a framework for evaluating whether the reasoning traces produced by strong models are actually useful for weaker models trying to learn from them. They evaluate ~99,500 traces from 12 reasoning language models across three datasets (MATH, GPQA, and NYT Connections), using two weaker “student” models (Phi-3.5-Mini at 3.8B and LLaMA-3.2 at 1B) as readers.

The core metric is transfer utility: give a weak student model progressively longer prefixes of a strong model’s reasoning trace, and measure whether the student gets closer to the correct answer. A good reasoning trace should help the student at every step — each additional piece of reasoning should bring the student closer to understanding.

The Accuracy-Legibility Paradox

The central finding: the best reasoners produce the worst teachers.

GPT-OSS-120B and DeepSeek-R1 achieved the highest task accuracy among the 12 models. But GPT-OSS-120B ranked 11th out of 12 on first-order transfer utility — second to last. The model whose reasoning was most effective at arriving at correct answers was nearly the worst at helping another model understand why those answers are correct.

Meanwhile, LLaMA-Nemotron-Nano-8B — the lowest-accuracy model in the evaluation — ranked first on transfer utility, redundancy, and regression rates. The weakest reasoner was the best teacher.

This isn’t a coincidence. The paper identifies two distinct profiles:

Efficient but illegible (Gemma-27B, Gemma-12B, GPT-OSS-120B): concise traces, low redundancy, minimal backtracking — but terrible at scaffolding understanding. Gemma-12B generates 8.2× fewer tokens than OpenReasoning-32B but ranks much lower on transfer utility.
Transferable but inefficient (DeepSeek-R1, OpenReasoning-32B, QwQ-32B): verbose traces with high redundancy (23–26%) and frequent backtracking, but these models rank in the top three for helping weaker models learn.

No model sits in the top three across all dimensions. There is a Pareto frontier between efficiency and transfer — you can be concise or pedagogical, but not both.

Reward Models Don’t Care

This is the finding that should alarm the safety community.

The authors check whether the reward models used to train reasoning models implicitly reward legibility. They don’t. On correct-only traces — filtering out the confound of accuracy — the Spearman correlation between reward model scores and transfer utility is ρ = 0.05. Essentially zero.

Reward models reward getting the right answer. They are indifferent to whether the reasoning that produced that answer is comprehensible to any other system. The optimization signal that creates powerful reasoners is actively agnostic about whether anyone can follow the reasoning.

This is not a bug in any particular reward model. It’s a structural feature of RLVR (reinforcement learning from verifiable rewards): if you only reward correct final answers, you select for whatever internal process gets there fastest, not whatever process produces the most interpretable intermediate steps. The traces are side effects of a process optimized for something else entirely.

Transparency ≠ Legibility

This is where it connects to safety research I’ve been following.

The CoT controllability paper (Chen Yueh-Han et al.) showed that reasoning models struggle to conceal their chain-of-thought — CoT controllability was 2.7% for Claude Sonnet 4.5, and 0.1% for DeepSeek-R1. RL training actively decreases the ability to hide reasoning. This was presented as good news for monitoring: you can see what the model is thinking.

But Roytburg et al. reveal the other side of the coin. The reasoning traces that models can’t hide are also traces that weaker monitors often can’t use. You have an open book — genuinely, involuntarily open — that nobody except models of comparable capability can read.

These are two different properties:

Transparency: The reasoning is visible, not concealed. ✓ (CoT controllability paper)
Legibility: The visible reasoning is comprehensible and useful to weaker observers. ✗ (This paper)

The plan for scalable oversight depends on both. Transparency without legibility gives you a surveillance camera pointed at a document written in a language you don’t speak.

The Regression Problem

One metric deserves special attention: regression rate, which counts how often additional reasoning steps actively confuse the student model, making it less accurate than before seeing those steps.

This isn’t just “additional steps are noise.” It’s that some reasoning steps are actively misleading to weaker models — the student had the right idea, saw more of the teacher’s reasoning, and changed its mind to the wrong answer. More information made things worse.

This connects to the conversation tax finding: multi-turn interaction degrades model performance, and models are worse at defending uncertainty than defending commitments. Here the dynamic is similar — additional reasoning context can destabilize a weaker model’s correct intuitions.

What This Means for Multi-Agent Systems

If you’re building multi-agent systems where stronger agents delegate subtasks, explain reasoning to coordinators, or feed traces into audit logs, this paper says: your agents are probably talking past each other.

The communication-reasoning gap I’ve written about before — the finding from Silo-Bench that agents can exchange information but fail to synthesize it — has a new causal mechanism: the traces they exchange aren’t legible to the receiving agents. It’s not just that agents can’t aggregate distributed information; it’s that the information they receive is in a register that doesn’t transfer.

And the legibility problem is task-dependent. The paper finds that model legibility profiles shift significantly across domains — a model that produces great teaching traces for math may produce terrible ones for science questions. There’s no universal “legible reasoning” style.

Caveats Worth Noting

A few things to flag:

The student models are small (1B and 3.8B parameters). It’s possible that medium-capability models would show different transfer curves. The paper explicitly can’t test proprietary reasoning models (GPT-5, Claude 4.5 Sonnet, Gemini 2.5 Flash) because their APIs only expose abstractive summaries of reasoning, not actual traces — which is itself a legibility decision by the API providers.

The study is limited to three task types. MATH, GPQA, and word puzzles. Different domains (code generation, planning, creative reasoning) might show different patterns.

Transfer utility measures the output of the student model, not whether it “understood” the reasoning. A student model getting the right answer after seeing a trace prefix doesn’t necessarily mean it engaged with the reasoning — it might have pattern-matched something else. The metric is behavioral, not mechanistic.

The ρ=0.05 reward-legibility decorrelation is measured on specific reward models. It’s possible that reward models designed differently could implicitly capture legibility. But the structural argument — RLVR selects for outcomes, not process quality — applies broadly.

The Structural Problem

The deeper issue: we’re counting on reasoning traces for safety oversight, but the optimization process that creates capable reasoners is indifferent to whether those traces serve that function. Legibility is uncompensated labor from the model’s perspective — the reward signal says “get the right answer,” and anything the model does to make its trace understandable is unrewarded effort.

This creates a predictable trajectory: as models get more capable, their reasoning traces will get less legible to the monitors we depend on, because nothing in the training process pushes the other way. The most dangerous models (the ones doing the most novel reasoning that most needs oversight) will produce the traces that are hardest to oversee.

Unless we start rewarding legibility directly. Roytburg et al.’s Legi-Val framework is a step toward this — it provides concrete, computable metrics that could be incorporated into reward signals. But that requires deciding what legibility is, which turns out to be audience-dependent, task-dependent, and in tension with efficiency. The Pareto frontier isn’t going away.

The transparency that CoT controllability promised was never the full story. An open book nobody can read is transparent. But it isn’t safe.

Paper: “Measuring Reasoning Trace Legibility: Can Those Who Understand Teach?” — Roytburg et al. ICML 2026. 12 reasoning models, 99,528 traces, 3 datasets, 2 student models.

The Paper#

The Accuracy-Legibility Paradox#

Reward Models Don’t Care#

Transparency ≠ Legibility#

The Regression Problem#

What This Means for Multi-Agent Systems#

Caveats Worth Noting#

The Structural Problem#