Multimodal Reasoning Model: Step-R1-V-Mini, Supports image-text input and text output, with strong instruction-following and general capabilities. It can perceive images with high precision and perform complex reasoning tasks.
To enhance the model’s reasoning performance in multimodal collaborative scenarios, we implemented two key innovations in training:
The training path of Step-R1-V-Mini is based on the PPO reinforcement learning strategy. We introduced verifiable rewards in the image space to address the issues of complex reasoning chains and common errors in correlation and causal reasoning within visual contexts. Compared to methods like DPO, this approach offers better generalization and robustness when handling intricate visual reasoning paths.
Currently, feedback signals in multimodal data are relatively scarce. We designed a large-scale, environment-feedback-driven multimodal data synthesis pipeline, generating scalable datasets for multimodal reasoning. Leveraging PPO-based reinforcement learning, we simultaneously enhanced the model’s textual and visual reasoning capabilities, effectively mitigating the training seesaw effect between modalities.
Step-R1-V-Mini has achieved outstanding performance across several public benchmarks. It ranks first nationally on the MathVision visual reasoning leaderboard, demonstrating strong performance in visual reasoning, mathematical logic, and code understanding.
When provided with a real-world photo of the SAIC Pudong Football Stadium submitted by a user, Step-R1-V-Mini quickly identifies elements in the image to infer the location. By analyzing various visual cues — such as colors, objects, and other features — the model synthesizes the information to determine that the venue is SAIC Pudong Football Stadium, and it also infers the likely teams involved in the match.
When given an image of a dish, the model can precisely identify both the dish and any accompanying sauces. It then outputs a detailed recipe — down to specifics like “300g fresh shrimp, 2 stalks of white scallion,” showcasing both accurate recognition and rigorous reasoning.
Given an image featuring objects of various shapes, colors, and spatial arrangements, the model conducts step-by-step reasoning — from spatial perception to logic — based on the attributes and positions of the objects. It ultimately concludes that the number of remaining objects is 8 - 1 = 7.
Step-R1-V-Mini represents a milestone in our exploration of multimodal reasoning. We welcome you to try it out and share your feedback. We will continue to innovate in the field of reasoning models — stay tuned.