iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

1VIT Chennai, India    2Loughborough University, UK    3Nagasaki University, Japan

iReasoner is a fully unsupervised self-evolving framework that improves LMM reasoning by explicitly eliciting and rewarding chain-of-thought (CoT) through trajectory-aware intrinsic supervision. Unlike prior methods that reward only final outcomes, iReasoner provides learning signals over intermediate reasoning steps, distinguishing reasoning paths that lead to the same answer without requiring ground-truth labels or external judges.

Abstract

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer–Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to +2.1 points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.

🔥 Highlights

  1. We introduce iReasoner, a fully unsupervised self-evolving framework that brings intermediate reasoning into the optimization loop for LMMs. Unlike prior work that rewards only final outcomes, iReasoner explicitly elicits chain-of-thought and provides trajectory-aware supervision over intermediate steps.

  2. We propose an intrinsic CoT agreement reward that scores step-level alignment among Solver rollouts converging to the same answer. This provides learning signals that distinguish reasoning trajectories without labeled data or external judges, addressing the key limitation that outcome-only rewards cannot differentiate between stable and unstable reasoning paths.

  3. Starting from Qwen2.5-VL-7B and training exclusively on unlabeled images, iReasoner achieves consistent improvements across eight multimodal reasoning benchmarks, with gains of up to +2.1 points. Our analysis shows that step-wise reasoning reward improves general-purpose transfer beyond answer-level agreement alone, particularly on tasks where intermediate structure matters.

Intrinsic CoT Agreement Reward

iReasoner CoT Agreement
iReasoner's intrinsic step-level CoT agreement. Given an unlabeled image, the Proposer generates a visually grounded question, and the Solver samples N reasoning rollouts, each producing a CoT with multiple intermediate steps and a final answer. Among rollouts in the dominant (majority-answer) group, we embed each step text into a vector and form per-step prototypes. Step agreement is computed via similarity and aggregated with higher weight on earlier, grounding-heavy steps to produce a scalar Intrinsic CoT Agreement Reward.

iReasoner Pipeline

iReasoner Pipeline
Overview of the iReasoner pipeline. From an unlabeled image, a Proposer generates a question, and a Solver produces N reasoning rollouts. The answer distribution entropy shapes the Proposer reward and selects the dominant answer group. The Solver reward combines answer-level self-consistency with an intrinsic step-level agreement signal computed over intermediate reasoning traces, providing trajectory-aware supervision without annotated data or external verifiers.

Why Trajectory-Aware Supervision Matters

Outcome-only self-consistency treats distinct reasoning traces similarly when they reach the same final answer. As illustrated below, three Solver rollouts produce identical answers, but their intermediate steps differ substantially. Rollouts 1 and 3 follow a consistent signed-area decomposition, while Rollout 2 deviates via incorrect intermediate claims yet still arrives at the same answer. Since outcome-only intrinsic rewards depend only on answer agreement, these rollouts receive nearly identical learning signals despite qualitatively different reasoning traces. This motivates step-aware supervision in iReasoner, which directly evaluates and optimizes intermediate reasoning structure.

Outcome-only limitation

Main Results

Evaluation results across eight multimodal reasoning benchmarks.
Model InfoGraphic-VQAval AI2D ScienceQA MMMUval ChartQA MathVista MathVision MathVerse
Vision-Zero† (CLEVR) 80.35 82.64 88.50 51.44 84.24 68.43 23.96 43.86
VisPlay* 38.27 31.15 39.14
Qwen2.5-VL-7B (Baseline) 80.44 82.61 88.30 51.11 84.00 68.47 23.91 43.78
Qwen2.5-VL-7B w/ Discrete Reward 80.52 82.18 87.98 50.84 84.62 68.88 22.52 42.10
EvoLMM 81.06 83.41 89.50 52.01 86.70 70.52 24.81 44.88
Qwen2.5-VL-7B w/ Discrete Reward + Step-level 80.78 82.95 88.92 51.48 85.42 69.31 24.12 44.18
iReasoner (Ours) 81.56 83.89 89.92 52.37 85.78 69.74 25.29 45.91

† uses external supervision. * uses LLM-as-a-judge evaluation.


Ablation study of intrinsic reasoning supervision.
Ablation InfoGraphic-VQAval AI2D ScienceQA MMMUval ChartQA MathVista MathVision MathVerse
Qwen2.5-VL-7B (Baseline) 80.44 82.61 88.30 51.11 84.00 68.47 23.91 43.78
Step-level majority (Full) 81.56 83.89 89.92 52.37 85.78 69.74 25.29 45.91
Soft majority reward only 81.12 83.36 89.41 51.92 86.64 70.41 24.62 44.71
Step-level reward only 80.61 82.69 88.44 50.98 84.38 68.73 24.18 43.87
w/o Warmup schedule 81.04 83.21 89.26 51.74 85.02 68.97 24.63 45.11
w/o Position decay 81.29 83.58 89.55 52.02 85.41 69.34 25.02 45.49
w/o Density weighting 81.18 83.46 89.47 51.88 85.29 69.19 24.91 45.32

Training Dynamics and Step Structure

We analyze how the self-evolution process shapes question difficulty and reasoning structure over training. The top plots show stable Proposer rewards and sustained answer entropy, indicating that questions remain at intermediate difficulty without collapsing to trivial or unsolvable extremes. The bottom plots demonstrate increasing majority-group density and step similarity, showing that Solver rollouts progressively converge on both final answers and intermediate reasoning structure.

Training Dynamics

Within-Mode Divergence Analysis

Even when rollouts agree on the final answer, their intermediate reasoning can vary substantially. We visualize per-step agreement within the dominant-answer group using leave-one-out similarity scores. The heatmap shows that some rollouts deviate sharply at specific step indices while remaining aligned elsewhere. Disagreement concentrates in the middle of reasoning traces (steps 2–3), supporting that outcome-only rewards cannot distinguish stable from unstable reasoning within the dominant answer mode.

Within-mode analysis

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to open-source LMM projects for releasing models/code and templates.