Faithful Reasoning Emerges from Multi-Move Training, Not Direct Prediction

Researchers studied how reasoning develops in language models across supervised fine-tuning and reinforcement learning stages using chess as a testbed. They found that training models to predict best moves directly yields strong downstream performance but produces unfaithful reasoning (outputs inconsistent with chosen moves), while training on multi-move trajectories achieves comparable performance with more faithful and stable reasoning. The team identified several SFT metrics predictive of post-RL performance and released a 7B parameter model that outperforms leading open-source reasoning models on chess tasks.
TL;DR
- →Direct move prediction fine-tuning produces strong RL performance but elicits unfaithful reasoning that contradicts the model's own moves
- →Multi-move trajectory training achieves comparable downstream performance while maintaining reasoning consistency and RL stability
- →Multiple SFT-stage metrics spanning evaluation performance, hallucination rates, and reasoning quality predict final post-RL model capabilities
- →Released 7B model, training data, and code that surpass existing open-source reasoning models on chess benchmarks
Why it matters
This work directly addresses a critical gap in understanding how reasoning quality evolves during model training. The finding that strong downstream performance can coexist with unfaithful reasoning has implications for how we evaluate and trust model outputs, particularly in domains requiring interpretable step-by-step reasoning. The identification of predictive SFT metrics offers a practical tool for forecasting final model capabilities without waiting for full RL training cycles.
Business relevance
For teams building reasoning-heavy applications, this research provides concrete guidance on training data composition and checkpoint evaluation strategies that reduce wasted compute on RL stages that won't improve reasoning quality. The ability to predict post-RL performance from SFT metrics enables faster iteration cycles and more efficient resource allocation during model development.
Key implications
- →Training data structure matters as much as scale: multi-move trajectories provide better reasoning fidelity than single-move prediction despite similar final performance
- →Downstream performance alone is insufficient for evaluating reasoning models; consistency between stated reasoning and actual outputs requires explicit measurement
- →Early-stage SFT metrics can serve as leading indicators for final model quality, potentially reducing the need for expensive full RL training during development
What to watch
Monitor whether these findings generalize beyond chess to other reasoning domains like mathematics, code generation, and logical inference. Watch for adoption of multi-move trajectory training in production model development pipelines and whether reasoning faithfulness becomes a standard evaluation metric alongside accuracy metrics.
vff Briefing
Weekly signal. No noise. Built for founders, operators, and AI-curious professionals.
No spam. Unsubscribe any time.



