[algo] support RLOO algorithm (#6325)

hjh0119 · web-flow · commit 8472da8c263c · 2025-10-28T20:16:13.000+08:00
* rloo init

* nits

* fix script

* fix script &amp; batch norm

* update script

* typo
diff --git a/docs/source/Instruction/GRPO/AdvancedResearch/RLOO.md b/docs/source/Instruction/GRPO/AdvancedResearch/RLOO.md
@@ -0,0 +1,96 @@
+# REINFORCE Leave-One-Out (RLOO)
+
+**版本依赖**：ms-swift>=3.10
+
+[REINFORCE Leave-One-Out (RLOO)](https://arxiv.org/abs/2402.14740) 基于经典的 REINFORCE 策略梯度方法，通过留一法（Leave-One-Out）构造无偏的优势函数基线。
+
+## 算法原理
+
+为便于理解，我们基于 GRPO（Group Relative Policy Optimization）算法进行对比说明。
+
+### GRPO vs RLOO 的主要区别
+
+GRPO 和 RLOO 都采用组内对比的方式来估计优势函数，避免了全局基线估计带来的高方差问题。两者的核心区别主要体现在以下两个方面：
+
+#### 区别1：优势函数基线的构造方法
+
+**1. GRPO (Group Relative Policy Optimization)**
+
+GRPO 对每个 prompt 生成 $G$ 个响应样本，使用**组内所有样本的均值和标准差**进行标准化：
+
+$$
+\hat{A}_{i} = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}
+$$
+
+其中：
+- $R_i$ 是第 $i$ 个样本的奖励值
+- $\text{mean}(\{R_j\}_{j=1}^G) = \frac{1}{G}\sum_{j=1}^G R_j$ 是组内均值
+- $\text{std}(\{R_j\}_{j=1}^G)$ 是组内标准差
+
+**2. RLOO (REINFORCE Leave-One-Out)**
+
+RLOO 对每个 prompt 生成 $K$ 个响应样本，使用 **留一法（Leave-One-Out）** 构造基线，即第 $i$ 个样本的基线为除自己外的其他 $K-1$ 个样本的均值：
+
+$$
+\hat{A}_{i} = R_i - \frac{1}{K-1}\sum_{j \neq i} R_j
+$$
+
+这个公式可以等价地改写为：
+
+$$
+\hat{A}_{i} = \frac{K}{K-1} \left(R_i - \bar{R}\right)
+$$
+
+其中 $\bar{R} = \frac{1}{K}\sum_{j=1}^K R_j$ 是组内所有样本的均值。
+
+> **说明**：这里使用 $K$ 对齐论文符号，与 GRPO 中的 $G$ 含义一致，均对应配置参数 `num_generations`
+
+**为什么使用留一法？**
+
+留一法的关键优势在于**无偏性**。对于第 $i$ 个样本，其奖励 $R_i$ 和基线 $\frac{1}{K-1}\sum_{j \neq i} R_j$ 是独立的，因此优势估计是无偏的。相比之下，如果使用包含自身的均值作为基线，会引入偏差。
+
+#### 区别2：KL 散度正则化项的处理方式
+
+为防止策略偏离参考策略过远，两种算法都引入了 KL 散度正则化，但处理方式不同：
+
+**GRPO**：将 KL 散度作为独立的正则化项添加到[损失函数](../GetStarted/GRPO.md#算法原理)中：
+
+$$
+\mathcal{L}(\theta) = -\mathbb{E}\left[\hat{A}_i \log \pi_\theta(a_i|s_i)\right] + \beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}})
+$$
+
+**RLOO**：将 KL 散度直接整合到奖励项中，构造修正后的奖励：
+
+$$
+R'_i = R_i - \beta \cdot \text{KL}(\pi_\theta || \pi_{\text{ref}})
+$$
+
+其中 $\beta$ 是 KL 散度的权重系数（对应参数 `beta`），$\pi_{\text{ref}}$ 是参考策略（通常是 SFT 模型或初始策略）。
+
+## 参数设置
+
+我们可以基于 `GRPOTrainer`，通过设置以下参数实现 RLOO 训练：
+```bash
+# 基本 RLOO 配置
+--advantage_estimator rloo  # 使用 RLOO 的留一法优势函数计算
+--kl_in_reward true         # 将 KL 散度项整合到奖励中（RLOO 默认方式）
+```
+
+训练可以参考该[脚本](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/rloo.sh)
+
+### 重要参数说明
+
+- **`--advantage_estimator`**：选择优势函数估计方法
+  - `grpo`（默认）：使用组内均值和标准差进行标准化
+  - `rloo`：使用留一法（Leave-One-Out）构造基线
+
+- **`--kl_in_reward`**：控制 KL 散度正则化项的处理位置
+  - `false`：KL 散度作为损失函数的独立正则化项（GRPO 方式）
+  - `true`：KL 散度直接从奖励中扣除，构造修正后的奖励（RLOO 方式）
+
+- **`--num_generations`**：每个 prompt 生成的样本数量 $K$
+
+- **`--beta`**：KL 散度正则化系数 $\beta$
+  - 控制策略更新的保守程度
+
+其他参数与 [GRPO参数](../../命令行参数.md#grpo参数)一致
diff --git a/docs/source/Instruction/GRPO/GetStarted/GRPO.md b/docs/source/Instruction/GRPO/GetStarted/GRPO.md
@@ -4,6 +4,7 @@ GRPOTrainer在ms-swift3.5进行了代码重构，如果你使用的swift版本<3
 
 [GRPO(Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) 算法利用组内相对优势计算来替代 PPO 算法中独立的价值模型，并直接在损失函数中加入 KL 散度惩罚来提高训练稳定性。
 
+## 算法原理
 
 GRPO 目标函数
 
diff --git a/docs/source/Instruction/命令行参数.md b/docs/source/Instruction/命令行参数.md
@@ -551,22 +551,25 @@ reward模型参数将在PPO、GRPO中使用。
   - offload_model: 是否在vLLM推理时 offload 模型，默认为False。
   - completion_length_limit_scope: 在多轮对话中，`max_completion_length` 的限制范围。
   `total`限制所有对话轮次的总输出长度不超过`max_completion_length`, `per_round`限制每一轮的输出长度。
-- num_iterations: 每个批次代更新次数，默认为1。
+- num_iterations: 每条数据的更新次数，[GRPO论文](https://arxiv.org/abs/2402.03300)中的 $\mu$ 值，默认为1。
 - epsilon: clip 系数，默认为0.2。
 - epsilon_high: upper clip 系数，默认为None，设置后与epsilon共同构成[epsilon, epsilon_high]裁剪范围。
+- dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
+- max_resample_times：dynamic_sample设置下限制重采样次数，默认3次。
+- overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
 - delta: [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291)中双侧 GRPO 上界裁剪值。若设置，建议大于 1 + epsilon。默认为None。
+- importance_sampling_level: 控制重要性采样比计算，可选项为 `token` 和 `sequence`，`token` 模式下保留原始的每个 token 的对数概率比，`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://www.arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练，默认为`token`。
+- advantage_estimator: 优势计算函数，默认为 `grpo`，即计算组内相对优势，可选项为 `grpo`、[`rloo`](./GRPO/AdvancedResearch/RLOO.md)。
+- kl_in_reward: 控制 KL 散度正则项的处理位置；`false`（默认）表示作为损失函数的独立正则项，`true`表示将 KL 直接并入奖励（从奖励中扣除）。
+- scale_rewards: 指定奖励的缩放策略，默认为`group`,  计算组内按标准差对奖励进行缩放。`batch`对应在整个批次范围内按标准差对奖励进行缩放，`none`对应不进行缩放。在ms-swift<3.10时，为bool变量，true对应group，false对应none。
 - sync_ref_model: 是否定期同步ref_model，默认为False。
   - ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
   - ref_model_sync_steps：同步频率，默认为512。
 - move_model_batches: 在模型向vLLM等快速推理框架移动参数时，将layers分为多少个batch. 默认为None, 代表整个模型不进行拆分，否则拆分为move_model_batches+1(非layer参数)+1(多模态部分参数)个。
 - multi_turn_scheduler: 多轮GRPO参数, 传入对应的plugin名称, 同时在plugin/multi_turn.py中添加好对应的实现。
 - max_turns: 多轮GRPO的轮数上限。默认为None，不做限制。
-- dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
-- max_resample_times：dynamic_sample设置下限制重采样次数，默认3次。
-- overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
 - top_entropy_quantile: 仅对熵值处于前指定分位的 token 参与损失计算，默认为1.0，即不过滤低熵 token，具体参考[文档](./GRPO/AdvancedResearch/entropy_mask.md)
 - log_entropy: 记录训练中的熵值变化动态，默认为False，具体参考[文档](./GRPO/GetStarted/GRPO.md#logged-metrics)
-- importance_sampling_level: 控制重要性采样比计算，可选项为 `token` 和 `sequence`，`token` 模式下保留原始的每个 token 的对数概率比，`sequence` 模式下则会对序列中所有有效 token 的对数概率比进行平均。[GSPO论文](https://www.arxiv.org/abs/2507.18071)中使用sequence级别计算来稳定训练，默认为`token`。
 
 cosine 奖励参数
 - cosine_min_len_value_wrong：cosine 奖励函数参数，生成错误答案时，最小长度对应的奖励值。默认值为-0.5。
diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -563,26 +563,27 @@ The meanings of the following parameters can be referenced [here](https://huggin
   When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
   When set to `per_round`, each individual turn's output length is limited separately.
   Defaults to `per_round`. Currently only takes effect in colocate mode.
-- top_k: Default is 50.
-- top_p: Default is 0.9.
-- repetition_penalty: Repetition penalty term. Default is 1.
-- num_iterations: number of iterations per batch. Default is 1.
+- num_iterations: The number of updates per data sample, corresponding to the $\mu$ value in the GRPO paper. Default is 1.
 - epsilon: epsilon value for clipping. Default is 0.2.
 - epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
+- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
+- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
+- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
+The hyperparameters for the reward function can be found in the [Built-in Reward Functions section](#built-in-reward-functions).
 - delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).
-- sync_ref_model: Whether to synchronize the reference model. Default is False。
+- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.
+- advantage_estimator: Advantage estimator. Default is `grpo` (group-relative advantage). Options: `grpo`, [`rloo`](./GRPO/AdvancedResearch/RLOO.md).
+- kl_in_reward: Controls where the KL regularization is applied. `false` (default): KL is added as a separate term in the loss. `true`: KL is subtracted directly from the reward (integrated into the reward).
+- scale_rewards: Reward scaling strategy. Default is `group` (scale by standard deviation within each group). `batch` scales across the entire batch; `none` disables scaling. In ms-swift<3.10, this was a boolean: `true` means `group`, `false` means `none`.
+- sync_ref_model: Whether to synchronize the reference model. Default is False.
   - ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
   - ref_model_sync_steps：The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.
 - move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
 - multi_turn_scheduler: Multi-turn GRPO parameter; pass the corresponding plugin name, and make sure to implement it in plugin/multi_turn.py.
 - max_turns: Maximum number of rounds for multi-turn GRPO. The default is None, which means there is no limit.
-- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
-- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
-- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
-The hyperparameters for the reward function can be found in the [Built-in Reward Functions section](#built-in-reward-functions).
 - top_entropy_quantile: Only tokens whose entropy ranks within the specified top quantile are included in the loss calculation. The default is 1.0, which means low-entropy tokens are not filtered. For details, refer to the [documentation](./GRPO/AdvancedResearch/entropy_mask.md).
 - log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
-- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.
+
 
 
 cosine reward function arguments
diff --git a/docs/source_en/Instruction/GRPO/AdvancedResearch/RLOO.md b/docs/source_en/Instruction/GRPO/AdvancedResearch/RLOO.md
@@ -0,0 +1,97 @@
+# REINFORCE Leave-One-Out (RLOO)
+
+**Version requirement**: ms-swift>=3.10
+
+[REINFORCE Leave-One-Out (RLOO)](https://arxiv.org/abs/2402.14740) is a reinforcement learning algorithm based on the classic REINFORCE policy-gradient method. It constructs an unbiased advantage baseline via the Leave-One-Out (LOO) technique.
+
+## Algorithm Overview
+
+For clarity, we explain RLOO by contrasting it with GRPO (Group Relative Policy Optimization).
+
+### Key Differences Between GRPO and RLOO
+
+Both GRPO and RLOO estimate advantages via intra-group comparisons to avoid the high variance of a global baseline. Their core differences are mainly in the following aspects:
+
+#### Difference 1: How the Advantage Baseline Is Constructed
+
+**1. GRPO (Group Relative Policy Optimization)**
+
+For each prompt, GRPO generates $G$ response samples and normalizes rewards using the group mean and standard deviation:
+
+$$
+\hat{A}_{i} = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}
+$$
+
+Where:
+- $R_i$ is the reward of the $i$-th sample
+- $\text{mean}(\{R_j\}_{j=1}^G) = \frac{1}{G}\sum_{j=1}^G R_j$ is the group mean
+- $\text{std}(\{R_j\}_{j=1}^G)$ is the group standard deviation
+
+**2. RLOO (REINFORCE Leave-One-Out)**
+
+For each prompt, RLOO generates $K$ response samples and constructs the baseline via Leave-One-Out, i.e., for the $i$-th sample, the baseline is the mean of the other $K-1$ samples:
+
+$$
+\hat{A}_{i} = R_i - \frac{1}{K-1}\sum_{j \neq i} R_j
+$$
+
+This can be equivalently rewritten as:
+
+$$
+\hat{A}_{i} = \frac{K}{K-1} \left(R_i - \bar{R}\right)
+$$
+
+where $\bar{R} = \frac{1}{K}\sum_{j=1}^K R_j$ is the group mean reward.
+
+> Note: We use $K$ here to match the notation in the paper. It has the same meaning as $G$ in GRPO and corresponds to the configuration parameter `num_generations`.
+
+**Why Leave-One-Out?**
+
+The key advantage is unbiasedness. For the $i$-th sample, its reward $R_i$ is independent of the baseline $\frac{1}{K-1}\sum_{j \neq i} R_j$, hence the advantage estimate is unbiased. In contrast, using the mean including itself as the baseline introduces bias.
+
+#### Difference 2: How KL Regularization Is Applied
+
+To prevent the policy from drifting too far from the reference policy, both algorithms introduce KL divergence regularization, but in different ways:
+
+**GRPO**: Adds KL divergence as an independent regularization term to the [loss](../GetStarted/GRPO.md#algorithm-overview):
+
+$$
+\mathcal{L}(\theta) = -\mathbb{E}\left[\hat{A}_i \log \pi_\theta(a_i|s_i)\right] + \beta \cdot \text{KL}(\pi_\theta \Vert \pi_{\text{ref}})
+$$
+
+**RLOO**: Integrates KL divergence directly into the reward, constructing a modified reward:
+
+$$
+R'_i = R_i - \beta \cdot \text{KL}(\pi_\theta \Vert \pi_{\text{ref}})
+$$
+
+where $\beta$ is the KL coefficient (parameter `beta`), and $\pi_{\text{ref}}$ is the reference policy (typically an SFT model or the initial policy).
+
+## Parameter Configuration
+
+RLOO training can be enabled based on `GRPOTrainer` by setting the following parameters:
+
+```bash
+# Basic RLOO configuration
+--advantage_estimator rloo  # Use RLOO's leave-one-out advantage estimator
+--kl_in_reward true         # Integrate KL divergence into the reward (default for RLOO)
+```
+
+You can refer to this [script](https://github.com/modelscope/ms-swift/tree/main/examples/train/grpo/internal/rloo.sh) for training.
+
+### Important Parameters
+
+- **`--advantage_estimator`**: Choose the advantage estimator
+  - `grpo` (default): standardize using group mean and standard deviation
+  - `rloo`: construct the baseline via Leave-One-Out
+
+- **`--kl_in_reward`**: Controls where the KL term is applied
+  - `false`: KL as a separate regularization term in the loss (GRPO style)
+  - `true`: subtract KL directly from the reward to form a modified reward (RLOO style)
+
+- **`--num_generations`**: Number of samples per prompt, i.e., $K$
+
+- **`--beta`**: KL regularization coefficient $\beta$
+  - Controls how conservatively the policy updates
+
+Other parameters are consistent with the [GRPO arguments](../../Command-line-parameters.md#grpo-arguments).
diff --git a/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md b/docs/source_en/Instruction/GRPO/GetStarted/GRPO.md
@@ -4,7 +4,9 @@ GRPOTrainer underwent a code refactoring in ms-swift3.5. If you are using a swif
 
 [GRPO (Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) leverages intra-group relative advantage calculations to replace the independent value model in the PPO algorithm and directly incorporates KL divergence penalties into the loss function to improve training stability.
 
-### GRPO Objective Function
+## Algorithm Overview
+
+GRPO Objective Function is defined as
 $
 {\scriptstyle
 \begin{aligned}
diff --git a/examples/train/grpo/internal/rloo.sh b/examples/train/grpo/internal/rloo.sh
@@ -0,0 +1,45 @@
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=8 \
+swift rlhf \
+    --rlhf_type grpo \
+    --advantage_estimator rloo \
+    --kl_in_reward true \
+    --model Qwen/Qwen2.5-VL-3B-Instruct \
+    --external_plugins examples/train/grpo/plugin/plugin.py \
+    --reward_funcs external_r1v_acc format \
+    --use_vllm true \
+    --vllm_mode colocate \
+    --vllm_gpu_memory_utilization 0.4 \
+    --vllm_tensor_parallel_size 1 \
+    --vllm_max_model_len 16384 \
+    --train_type lora \
+    --torch_dtype bfloat16 \
+    --dataset 'AI-ModelScope/clevr_cogen_a_train' \
+    --overlong_filter false \
+    --epsilon 3e-4 \
+    --epsilon_high 4e-4 \
+    --max_completion_length 1024 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 2 \
+    --learning_rate 1e-6 \
+    --gradient_accumulation_steps 4 \
+    --eval_steps 1000 \
+    --save_steps 1000 \
+    --save_total_limit 10 \
+    --sleep_level 1 \
+    --offload_model true \
+    --offload_optimizer true \
+    --logging_steps 1 \
+    --dataloader_num_workers 4 \
+    --num_generations 16 \
+    --temperature 1.0 \
+    --system 'examples/train/grpo/prompt.txt' \
+    --deepspeed zero2 \
+    --log_completions true \
+    --report_to tensorboard swanlab \
+    --num_iterations 1 \
+    --async_generate false \
+    --beta 0.001 \
+    --attn_impl flash_attention_2 \
+    --padding_free true \
+    --loss_type grpo
diff --git a/swift/llm/argument/rlhf_args.py b/swift/llm/argument/rlhf_args.py
diff --git a/swift/trainers/arguments.py b/swift/trainers/arguments.py
diff --git a/swift/trainers/rlhf_trainer/grpo_trainer.py b/swift/trainers/rlhf_trainer/grpo_trainer.py

Original file line number	Diff line number	Diff line change
`@@ -4,6 +4,7 @@ GRPOTrainer在ms-swift3.5进行了代码重构，如果你使用的swift版本<3`
`4`	`4`
`5`	`5`	`[GRPO(Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) 算法利用组内相对优势计算来替代 PPO 算法中独立的价值模型，并直接在损失函数中加入 KL 散度惩罚来提高训练稳定性。`
`6`	`6`
	`7`	`+## 算法原理`
`7`	`8`
`8`	`9`	`GRPO 目标函数`
`9`	`10`
Original file line number	Diff line number	Diff line change
`@@ -4,7 +4,9 @@ GRPOTrainer underwent a code refactoring in ms-swift3.5. If you are using a swif`
`4`	`4`
`5`	`5`	`[GRPO (Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) leverages intra-group relative advantage calculations to replace the independent value model in the PPO algorithm and directly incorporates KL divergence penalties into the loss function to improve training stability.`
`6`	`6`
`7`		`-### GRPO Objective Function`
	`7`	`+## Algorithm Overview`
	`8`	`+`
	`9`	`+GRPO Objective Function is defined as`
`8`	`10`	`$`
`9`	`11`	`{\scriptstyle`
`10`	`12`	`\begin{aligned}`