Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer–Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to +2.1 points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.
Outcome-only self-consistency treats distinct reasoning traces similarly when they reach the same final answer. As illustrated below, three Solver rollouts produce identical answers, but their intermediate steps differ substantially. Rollouts 1 and 3 follow a consistent signed-area decomposition, while Rollout 2 deviates via incorrect intermediate claims yet still arrives at the same answer. Since outcome-only intrinsic rewards depend only on answer agreement, these rollouts receive nearly identical learning signals despite qualitatively different reasoning traces. This motivates step-aware supervision in iReasoner, which directly evaluates and optimizes intermediate reasoning structure.
| Model | InfoGraphic-VQAval | AI2D | ScienceQA | MMMUval | ChartQA | MathVista | MathVision | MathVerse |
|---|---|---|---|---|---|---|---|---|
| Vision-Zero† (CLEVR) | 80.35 | 82.64 | 88.50 | 51.44 | 84.24 | 68.43 | 23.96 | 43.86 |
| VisPlay* | – | – | – | 38.27 | – | – | 31.15 | 39.14 |
| Qwen2.5-VL-7B (Baseline) | 80.44 | 82.61 | 88.30 | 51.11 | 84.00 | 68.47 | 23.91 | 43.78 |
| Qwen2.5-VL-7B w/ Discrete Reward | 80.52 | 82.18 | 87.98 | 50.84 | 84.62 | 68.88 | 22.52 | 42.10 |
| EvoLMM | 81.06 | 83.41 | 89.50 | 52.01 | 86.70 | 70.52 | 24.81 | 44.88 |
| Qwen2.5-VL-7B w/ Discrete Reward + Step-level | 80.78 | 82.95 | 88.92 | 51.48 | 85.42 | 69.31 | 24.12 | 44.18 |
| iReasoner (Ours) | 81.56 | 83.89 | 89.92 | 52.37 | 85.78 | 69.74 | 25.29 | 45.91 |
† uses external supervision. * uses LLM-as-a-judge evaluation.
| Ablation | InfoGraphic-VQAval | AI2D | ScienceQA | MMMUval | ChartQA | MathVista | MathVision | MathVerse |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B (Baseline) | 80.44 | 82.61 | 88.30 | 51.11 | 84.00 | 68.47 | 23.91 | 43.78 |
| Step-level majority (Full) | 81.56 | 83.89 | 89.92 | 52.37 | 85.78 | 69.74 | 25.29 | 45.91 |
| Soft majority reward only | 81.12 | 83.36 | 89.41 | 51.92 | 86.64 | 70.41 | 24.62 | 44.71 |
| Step-level reward only | 80.61 | 82.69 | 88.44 | 50.98 | 84.38 | 68.73 | 24.18 | 43.87 |
| w/o Warmup schedule | 81.04 | 83.21 | 89.26 | 51.74 | 85.02 | 68.97 | 24.63 | 45.11 |
| w/o Position decay | 81.29 | 83.58 | 89.55 | 52.02 | 85.41 | 69.34 | 25.02 | 45.49 |
| w/o Density weighting | 81.18 | 83.46 | 89.47 | 51.88 | 85.29 | 69.19 | 24.91 | 45.32 |
We analyze how the self-evolution process shapes question difficulty and reasoning structure over training. The top plots show stable Proposer rewards and sustained answer entropy, indicating that questions remain at intermediate difficulty without collapsing to trivial or unsolvable extremes. The bottom plots demonstrate increasing majority-group density and step similarity, showing that Solver rollouts progressively converge on both final answers and intermediate reasoning structure.
Even when rollouts agree on the final answer, their intermediate reasoning can vary substantially. We visualize per-step agreement within the dominant-answer group using leave-one-out similarity scores. The heatmap shows that some rollouts deviate sharply at specific step indices while remaining aligned elsewhere. Disagreement concentrates in the middle of reasoning traces (steps 2–3), supporting that outcome-only rewards cannot distinguish stable from unstable reasoning within the dominant answer mode.
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. We are thankful to open-source LMM projects for releasing models/code and templates.