iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Meghana Sunil¹, Manikandarajan Venmathimaran², Muthu Subash Kavitha³

¹VIT Chennai, India ²Loughborough University, UK ³Nagasaki University, Japan

iReasoner is a fully unsupervised self-evolving framework that improves LMM reasoning by explicitly eliciting and rewarding chain-of-thought (CoT) through trajectory-aware intrinsic supervision. Unlike prior methods that reward only final outcomes, iReasoner provides learning signals over intermediate reasoning steps, distinguishing reasoning paths that lead to the same answer without requiring ground-truth labels or external judges.

Abstract

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer–Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to +2.1 points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.

🔥 Highlights

We introduce iReasoner, a fully unsupervised self-evolving framework that brings intermediate reasoning into the optimization loop for LMMs. Unlike prior work that rewards only final outcomes, iReasoner explicitly elicits chain-of-thought and provides trajectory-aware supervision over intermediate steps.

We propose an intrinsic CoT agreement reward that scores step-level alignment among Solver rollouts converging to the same answer. This provides learning signals that distinguish reasoning trajectories without labeled data or external judges, addressing the key limitation that outcome-only rewards cannot differentiate between stable and unstable reasoning paths.

Starting from Qwen2.5-VL-7B and training exclusively on unlabeled images, iReasoner achieves consistent improvements across eight multimodal reasoning benchmarks, with gains of up to +2.1 points. Our analysis shows that step-wise reasoning reward improves general-purpose transfer beyond answer-level agreement alone, particularly on tasks where intermediate structure matters.

Intrinsic CoT Agreement Reward

iReasoner CoT Agreement — **iReasoner's intrinsic step-level CoT agreement.** Given an unlabeled image, the Proposer generates a visually grounded question, and the Solver samples N reasoning rollouts, each producing a CoT with multiple intermediate steps and a final answer. Among rollouts in the dominant (majority-answer) group, we embed each step text into a vector and form per-step prototypes. Step agreement is computed via similarity and aggregated with higher weight on earlier, grounding-heavy steps to produce a scalar Intrinsic CoT Agreement Reward.

iReasoner Pipeline

Why Trajectory-Aware Supervision Matters

Outcome-only self-consistency treats distinct reasoning traces similarly when they reach the same final answer. As illustrated below, three Solver rollouts produce identical answers, but their intermediate steps differ substantially. Rollouts 1 and 3 follow a consistent signed-area decomposition, while Rollout 2 deviates via incorrect intermediate claims yet still arrives at the same answer. Since outcome-only intrinsic rewards depend only on answer agreement, these rollouts receive nearly identical learning signals despite qualitatively different reasoning traces. This motivates step-aware supervision in iReasoner, which directly evaluates and optimizes intermediate reasoning structure.

Main Results

Evaluation results across eight multimodal reasoning benchmarks.
Model	InfoGraphic-VQA_val	AI2D	ScienceQA	MMMU_val	ChartQA	MathVista	MathVision	MathVerse
Vision-Zero† (CLEVR)	80.35	82.64	88.50	51.44	84.24	68.43	23.96	43.86
VisPlay*	–	–	–	38.27	–	–	31.15	39.14
Qwen2.5-VL-7B (Baseline)	80.44	82.61	88.30	51.11	84.00	68.47	23.91	43.78
Qwen2.5-VL-7B w/ Discrete Reward	80.52	82.18	87.98	50.84	84.62	68.88	22.52	42.10
EvoLMM	81.06	83.41	89.50	52.01	86.70	70.52	24.81	44.88
Qwen2.5-VL-7B w/ Discrete Reward + Step-level	80.78	82.95	88.92	51.48	85.42	69.31	24.12	44.18
iReasoner (Ours)	81.56	83.89	89.92	52.37	85.78	69.74	25.29	45.91

† uses external supervision. * uses LLM-as-a-judge evaluation.

Ablation study of intrinsic reasoning supervision.
Ablation	InfoGraphic-VQA_val	AI2D	ScienceQA	MMMU_val	ChartQA	MathVista	MathVision	MathVerse
Qwen2.5-VL-7B (Baseline)	80.44	82.61	88.30	51.11	84.00	68.47	23.91	43.78
Step-level majority (Full)	81.56	83.89	89.92	52.37	85.78	69.74	25.29	45.91
Soft majority reward only	81.12	83.36	89.41	51.92	86.64	70.41	24.62	44.71
Step-level reward only	80.61	82.69	88.44	50.98	84.38	68.73	24.18	43.87
w/o Warmup schedule	81.04	83.21	89.26	51.74	85.02	68.97	24.63	45.11
w/o Position decay	81.29	83.58	89.55	52.02	85.41	69.34	25.02	45.49
w/o Density weighting	81.18	83.46	89.47	51.88	85.29	69.19	24.91	45.32

Training Dynamics and Step Structure

We analyze how the self-evolution process shapes question difficulty and reasoning structure over training. The top plots show stable Proposer rewards and sustained answer entropy, indicating that questions remain at intermediate difficulty without collapsing to trivial or unsolvable extremes. The bottom plots demonstrate increasing majority-group density and step similarity, showing that Solver rollouts progressively converge on both final answers and intermediate reasoning structure.

Within-Mode Divergence Analysis

Even when rollouts agree on the final answer, their intermediate reasoning can vary substantially. We visualize per-step agreement within the dominant-answer group using leave-one-out similarity scores. The heatmap shows that some rollouts deviate sharply at specific step indices while remaining aligned elsewhere. Disagreement concentrates in the middle of reasoning traces (steps 2–3), supporting that outcome-only rewards cannot distinguish stable from unstable reasoning within the dominant answer mode.

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to open-source LMM projects for releasing models/code and templates.