login
Step-R1-V-Mini: A Lightweight Yet Powerful Multimodal Reasoning Model
A large multimodal reasoning model that supports image-text input and text output, capable of high-precision visual perception and complex reasoning tasks.
Try Now

Multimodal Reasoning Model: Step-R1-V-Mini, Supports image-text input and text output, with strong instruction-following and general capabilities. It can perceive images with high precision and perform complex reasoning tasks.

Technical Highlights

To enhance the model’s reasoning performance in multimodal collaborative scenarios, we implemented two key innovations in training:

1. Multimodal Joint Reinforcement Learning

The training path of Step-R1-V-Mini is based on the PPO reinforcement learning strategy. We introduced verifiable rewards in the image space to address the issues of complex reasoning chains and common errors in correlation and causal reasoning within visual contexts. Compared to methods like DPO, this approach offers better generalization and robustness when handling intricate visual reasoning paths.

2.Extensive Use of Multimodal Synthetic Data

Currently, feedback signals in multimodal data are relatively scarce. We designed a large-scale, environment-feedback-driven multimodal data synthesis pipeline, generating scalable datasets for multimodal reasoning. Leveraging PPO-based reinforcement learning, we simultaneously enhanced the model’s textual and visual reasoning capabilities, effectively mitigating the training seesaw effect between modalities.

Leading Results on Visual Reasoning Leaderboards

Step-R1-V-Mini has achieved outstanding performance across several public benchmarks. It ranks first nationally on the MathVision visual reasoning leaderboard, demonstrating strong performance in visual reasoning, mathematical logic, and code understanding.

step-r1-v-mini visual reasoning leaderboard results

Use Case Highlights

Case 1: Location Recognition from Image

When provided with a real-world photo of the SAIC Pudong Football Stadium submitted by a user, Step-R1-V-Mini quickly identifies elements in the image to infer the location. By analyzing various visual cues — such as colors, objects, and other features — the model synthesizes the information to determine that the venue is SAIC Pudong Football Stadium, and it also infers the likely teams involved in the match.

address recognition
After cross-referencing with match records, the model’s reasoning proved to be accurate.

Case 2: Recipe Recognition from Food Image

When given an image of a dish, the model can precisely identify both the dish and any accompanying sauces. It then outputs a detailed recipe — down to specifics like “300g fresh shrimp, 2 stalks of white scallion,” showcasing both accurate recognition and rigorous reasoning.

recipe recognition

Case 3: Object Counting

Given an image featuring objects of various shapes, colors, and spatial arrangements, the model conducts step-by-step reasoning — from spatial perception to logic — based on the attributes and positions of the objects. It ultimately concludes that the number of remaining objects is 8 - 1 = 7.

object counting

Step-R1-V-Mini represents a milestone in our exploration of multimodal reasoning. We welcome you to try it out and share your feedback. We will continue to innovate in the field of reasoning models — stay tuned.