VFF - The signal in the noise
Research

Faithful Reasoning Emerges from Multi-Move Training, Not Direct Prediction

Read original
Share
Faithful Reasoning Emerges from Multi-Move Training, Not Direct Prediction

Researchers studied how reasoning develops in language models across supervised fine-tuning and reinforcement learning stages using chess as a testbed. They found that training models to predict best moves directly yields strong downstream performance but produces unfaithful reasoning (outputs inconsistent with chosen moves), while training on multi-move trajectories achieves comparable performance with more faithful and stable reasoning. The team identified several SFT metrics predictive of post-RL performance and released a 7B parameter model that outperforms leading open-source reasoning models on chess tasks.

  • Direct move prediction fine-tuning produces strong RL performance but elicits unfaithful reasoning that contradicts the model's own moves
  • Multi-move trajectory training achieves comparable downstream performance while maintaining reasoning consistency and RL stability
  • Multiple SFT-stage metrics spanning evaluation performance, hallucination rates, and reasoning quality predict final post-RL model capabilities
  • Released 7B model, training data, and code that surpass existing open-source reasoning models on chess benchmarks

This work directly addresses a critical gap in understanding how reasoning quality evolves during model training. The finding that strong downstream performance can coexist with unfaithful reasoning has implications for how we evaluate and trust model outputs, particularly in domains requiring interpretable step-by-step reasoning. The identification of predictive SFT metrics offers a practical tool for forecasting final model capabilities without waiting for full RL training cycles.

For teams building reasoning-heavy applications, this research provides concrete guidance on training data composition and checkpoint evaluation strategies that reduce wasted compute on RL stages that won't improve reasoning quality. The ability to predict post-RL performance from SFT metrics enables faster iteration cycles and more efficient resource allocation during model development.

  • Training data structure matters as much as scale: multi-move trajectories provide better reasoning fidelity than single-move prediction despite similar final performance
  • Downstream performance alone is insufficient for evaluating reasoning models; consistency between stated reasoning and actual outputs requires explicit measurement
  • Early-stage SFT metrics can serve as leading indicators for final model quality, potentially reducing the need for expensive full RL training during development

Monitor whether these findings generalize beyond chess to other reasoning domains like mathematics, code generation, and logical inference. Watch for adoption of multi-move trajectory training in production model development pipelines and whether reasoning faithfulness becomes a standard evaluation metric alongside accuracy metrics.

Share

Subscribe to the newsletter

The latest stories and analysis, delivered to your inbox.

Free. No spam. Unsubscribe any time.

Related stories

Arbor Framework Achieves 2.5x Better AI Optimization on Same Compute

Arbor Framework Achieves 2.5x Better AI Optimization on Same Compute

Researchers at Renmin University of China and Microsoft Research introduced Arbor, an optimization framework that organizes AI research into a tree structure to enable cumulative learning from failures. In tests, Arbor delivered 2.5 times greater performance gains than standard AI coding agents on real-world engineering tasks within the same compute budget. The framework addresses a core limitation in autonomous optimization: most AI agents treat each attempt in isolation and lose insights across long experimental sequences.

by bendee983@gmail.com (Ben Dickson)· VentureBeat AI
AI Model Identifies 18 New Rare Disease Diagnoses

AI Model Identifies 18 New Rare Disease Diagnoses

Researchers used an OpenAI reasoning model to help diagnose rare genetic diseases in children, identifying 18 new diagnoses in previously unsolved cases. The application demonstrates how AI can assist physicians in identifying conditions that are difficult to diagnose through conventional clinical approaches. The work suggests potential for AI tools to address diagnostic gaps in rare disease medicine.

· OpenAI
Google DeepMind Researcher Shazeer Joins OpenAI

Google DeepMind Researcher Shazeer Joins OpenAI

Noam Shazeer, a key researcher behind Google's generative AI advances, is joining OpenAI. Shazeer had left Google in 2021 to co-found Character.AI, then rejoined Google DeepMind in 2024 as part of a $2.7 billion acquisition deal, where he became a tech lead on Gemini. His move to OpenAI represents a significant talent shift in the competitive AI research landscape.

by Amir Efrati· The Information
OpenAI Releases LifeSciBench for AI Evaluation

OpenAI Releases LifeSciBench for AI Evaluation

OpenAI has released LifeSciBench, a benchmark designed to evaluate how AI systems perform on real-world life science research tasks and decisions. The benchmark was authored and reviewed by experts in the field. It provides a standardized way to assess AI capabilities in scientific research contexts.

· OpenAI