login
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
In this blog, we introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility.
Open-Reasoner-Zero:Github
Hugging Face

Introduction

Large-scale reinforcement learning (RL) training of language models on reasoning tasks has emerged as a promising paradigm for mastering complex problem-solving skills.
Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our Open-Reasoner-Zero-32B achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency—requiring only 1/10 of the training steps.
In the spirit of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.

图1
Benchmark performance of Open-Reasoner-Zero-{7B, 32B} during training. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, Open-Reasoner-Zero-32B achieves superior performance on AIME2024, MATH500, and GPQA Diamond benchmarks—requiring only a tenth of the training steps.

图2
Train-time Scale up on Train Reward and Response Length of ORZ-{0.5B, 1.5B, 7B, 32B}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ-32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.

ModelAIME 2024AIME 2025MATH500GPQA Dia.
DeepSeek-R1-Zero-Qwen-32B4791.655
DAPO-Qwen-32B50
DAPO-Qwen-32B*48.337.971.816
Open-Reasoner-Zero-32B48.13692.255.5

Comparison of Open-Reasoner-Zero-32B with DeepSeek-R1-Zero-Qwen-32B DAPO-Qwen-32B on reasoning-related benchmarks. DeepSeek-R1-Zero-Qwen-32B results are from ~\cite{dsr1_cite}. DAPO-Qwen-32B\textsuperscript{*} results were obtained using our evaluation metric on the released checkpoint.

ModelMMLUMMLU_PRO
Qwen2.5-32B-Base83.355.1
Qwen2.5-32B-Instruct83.269.2
DAPO-Qwen-32B79.764.5
Open-Reasoner-Zero-32B84.974.4

Generalization performance of Open-Reasoner-Zero on MMLU and MMLU_PRO benchmarks. ORZ achieves superior performance on both benchmarks through RL training on reasoning tasks alone, surpassing Qwen2.5-Instruct without additional instruction tuning.

ModelAIME 2024AIME 2025MATH500GPQA Dia.
DeepSeek-R1-Distill-Qwen-14B69.749.193.959.1
DeepSeek-R1-Distill-Qwen-32B72.66094.362.1
ORZ-R1-Distill-Qwen-14B75.26095.660.4

We apply ORZ training recipe also to reasoning-enhanced models like DeepSeek-R1-Distill-Qwen-14B model, enabling it to grasp advanced reasoning patterns distilled from stronger reasoning models, substantially boosting its performance. This ORZ-R1-Distill-Qwen-14B achieves strong results on reasoning benchmarks, even surpassing the larger DeepSeek-R1-Distill-Qwen-32B model.

Scale-up Reinforcement Learning on a Base Model

In this section, we describe the strategy and critical components for scale-up reasoning-oriented RL directly on a base model, including algorithm choice and implementation, data curation, prompt design and reward function specification. Concretely, we show that a minimalist approach, vanilla PPO with GAE (λ=1\lambda=1, γ=1\gamma=1) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length.

Choosing PPO over GRPO

We adopt Proximal Policy Optimization as the RL algorithm, unlike GRPO used in DeepSeek-R1-Zero:

JPPO(θ)=Eτπθold[t=0T1min(ρt(θ)A^t,clip(ρt(θ),1ϵ,1+ϵ)A^t)]\mathcal{J}_{\text{PPO}}(\theta) = \mathbb{E}_{\tau \sim \pi_{\theta_{\text{old}}}} \left[ \sum_{t=0}^{T-1} \min \left( \rho_t(\theta) \hat{A}_t, \text{clip}(\rho_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] Jvalue(ϕ)=12Eτπθold[t=0T1(Vϕ(st)Vttarget)2]\mathcal{J}_{\text{value}}(\phi) = \frac{1}{2} \mathbb{E}_{\tau \sim \pi_{\theta_{\text{old}}}} \left[ \sum_{t=0}^{T-1} (V_\phi(s_t) - V_t^{\text{target}})^2 \right]

We select PPO over GRPO due to its superior value estimation enabled by a learned critic. This critic facilitates accurate token-level value estimation, effectively identifying and devaluing detrimental patterns such as repetitive behaviors, named credit assignment. Consequently, PPO achieves notably more robust advantage estimation compared to GRPO. Lacking a dedicated value network, GRPO struggles to distinguish genuinely correct responses from those occurring within negative patterns (\eg, repetitive loops). This deficiency can misdirect reinforcement, leading to training instability and eventual collapse, an observation supported by community discussions (OpenR1: discussion about vanilla GRPO reproduction link).

图3
Left: Advantage comparison between PPO and GRPO on repetitive tokens. Our PPO are more negative advantages to repetitive patterns than GRPO, demonstrating superior penalization of undesirable. Right: Visualization of value approximations showing how assigns lower values to repetitive patterns and higher values to coherent text, reflecting how the critic effectively identifies undesirable generation patterns.

Algorithm Implementations

Our empirical studies suggests that vanilla PPO already provides a highly stable and robust training across different model scales and training durations.
Nonetheless, appropriate implementations matter. Through extensive experiments, we found that the choice of GAE parameters substantially impacts performance in reasoning-oriented tasks. Specifically, the discount factor γ\gamma controls the effective sequence length considered during training: a lower γ\gamma assigns exponentially decreasing weights to future rewards, inducing the model to prematurely terminate generation in order to more immediately obtain rewards. On the other hand, the GAE parameter λ\lambda balances bias and variance in advantage estimation. Crucially, in large-scale training scenarios, the substantial data volume naturally mitigates variance concerns, encouraging us to adopt a bias-free configuration. Consequently, by setting γ=1\gamma=1 and λ=1\lambda=1, we fully capture the long-term dependencies critical for reasoning tasks and achieve stable training. Fortuitously, this also leads to a significant simplification of the GAE advantage computation in our case:

A^tGAE(γ=1,λ=1)=RVϕ(st)\hat{A}_t^{GAE(\gamma=1, \lambda=1)} = R - V_\phi(s_t) Jvalue(ϕ)=12Eτπθold[t=0T1(Vϕ(st)R)2]\mathcal{J}_{\text{value}}(\phi) = \frac{1}{2}\mathbb{E}_{\tau \sim \pi_{\theta_{\text{old}}}} \left[ \sum_{t=0}^{T-1} (V_\phi(s_t) - R)^2 \right]

where RR is the single terminal reward.

Removing KL regularization

We achieve stable training without relying on any KL-based regularization techniques (\eg, KL shaped rewards and loss), different from the de facto RLHF community and Reasoner model. Intuitively, KL regularization constrains the policy model to remain close to the original base model distribution, potentially limiting exploration during policy optimization. By omitting KL regularization, our approach offers several practical advantages: (1) it obviates the need to navigate the large and challenging-to-tune design space inherent to KL regularization, greatly simplifying the training procedure; and (2) it lowers computational overhead and memory usage, eliminating the need to load the weight of a separate reference model and calculate log probabilities using it. Together, these benefits facilitate efficient and scalable large-scale RL training.

Scale up Training Data.

We identify that scaling up data quantity and diversity is pivotal for Reasoner-Zero training. While training on limited academic datasets like MATH train set leads to quick performance plateaus, our curated large-scale diverse dataset demonstrates impressive potential for continuous improvement without signs of saturation on both training and test sets.

图4
Ablation studies for key design choices in ORZ. We use reward on training set or MATH500 as performance metrics. \textbf{Left.} Comparison of different GAE λ\lambda values. \textbf{Mid.} Comparisons of KL-related regularizations. \textbf{Right.} Data scale ablation study. These findings collectively inform our minimalist yet effective ORZ training recipe.

Minimal Reward Function Design

In contrast to approaches such as DeepSeek R1, which utilize a dedicated format reward to enforce structured reasoning (e.g., enclosing thought processes within ), we demonstrate that the simplest, rule-based reward function is not only sufficient but also optimal, as minimal design leaves no room for potential reward hacking:

rt={0,t<Ti1,R{0,1},t=Ti1r_t = \begin{cases} 0, & t < T_i-1, \\ R \in \{0,1\}, & t=T_i -1 \end{cases}

Notably, even unaligned base models quickly adpot to desired format, suggesting this is a straightforward task without requiring complex reward engineering.

图5
Left. Correct Format Ratio.
Results demonstrate rapid adoption of structured reasoning patterns even by the base model trained on a simple outcome reward function, suggesting complex reward functions are unnecessary for Reasoner-Zero. Right. Reflection patterns in generation. Average Correct Reflection Length consistently exceeds Average Response Length during training, indicating that reflection pattern can be naturally incentivized with our minimal reward design during large scale training

Use ORZ Models

You can easily start a service of ORZ-{0.5B, 1.5B, 7B, 32B} using vLLM:

shell

You can also try out ORZ-R1-Distill-Qwen-14B:

shell

Next Step

While Open-Reasoner-Zero (ORZ) marks a significant step forward in accessible and scalable reasoning-oriented RL, we’re excited about several avenues for future exploration:

Continuing Training-Time Scaling. Building upon the strong training-time scaling already demonstrated by ORZ, one immediate direction is to further amplify these capabilities. We plan to push these boundaries by harnessing substantially larger quantities of high-quality and diverse training data and explore extending model sequence lengths to accommodate more complex reasoning chains.

Advancing Test-Time Compute Scaling. A crucial next step is to investigate new training paradigms that can further scale up test-time compute scaling. We will explore techniques such as multi-turn interactions in multi-agent learning systems for more sophisticated contextual reasoning to tackling genuinely challenging tasks. We believe these directions will be critical in developing artificial general intelligence, and we invite the community to join us in exploring these exciting frontiers with Open-Reasoner-Zero.