You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source_en/Instruction/Command-line-parameters.md
+11-10Lines changed: 11 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -563,26 +563,27 @@ The meanings of the following parameters can be referenced [here](https://huggin
563
563
When set to `total`, the total output length across all turns must not exceed `max_completion_length`.
564
564
When set to `per_round`, each individual turn's output length is limited separately.
565
565
Defaults to `per_round`. Currently only takes effect in colocate mode.
566
-
- top_k: Default is 50.
567
-
- top_p: Default is 0.9.
568
-
- repetition_penalty: Repetition penalty term. Default is 1.
569
-
- num_iterations: number of iterations per batch. Default is 1.
566
+
- num_iterations: The number of updates per data sample, corresponding to the $\mu$ value in the GRPO paper. Default is 1.
570
567
- epsilon: epsilon value for clipping. Default is 0.2.
571
568
- epsilon_high: Upper clip coefficient, default is None. When set, it forms a clipping range of [epsilon, epsilon_high] together with epsilon.
569
+
- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
570
+
- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
571
+
- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
572
+
The hyperparameters for the reward function can be found in the [Built-in Reward Functions section](#built-in-reward-functions).
572
573
- delta: Delta value for the upper clipping bound in two-sided GRPO. Recommended to be > 1 + epsilon. This method was introduced in the [INTELLECT-2 tech report](https://huggingface.co/papers/2505.07291).
573
-
- sync_ref_model: Whether to synchronize the reference model. Default is False。
574
+
- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.
- kl_in_reward: Controls where the KL regularization is applied. `false` (default): KL is added as a separate term in the loss. `true`: KL is subtracted directly from the reward (integrated into the reward).
577
+
- scale_rewards: Reward scaling strategy. Default is `group` (scale by standard deviation within each group). `batch` scales across the entire batch; `none` disables scaling. In ms-swift<3.10, this was a boolean: `true` means `group`, `false` means `none`.
578
+
- sync_ref_model: Whether to synchronize the reference model. Default is False.
574
579
- ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
575
580
- ref_model_sync_steps:The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.
576
581
- move_model_batches: When moving model parameters to fast inference frameworks such as vLLM/LMDeploy, determines how many batches to divide the layers into. The default is `None`, which means the entire model is not split. Otherwise, the model is split into `move_model_batches + 1` (non-layer parameters) + `1` (multi-modal component parameters) batches.
577
582
- multi_turn_scheduler: Multi-turn GRPO parameter; pass the corresponding plugin name, and make sure to implement it in plugin/multi_turn.py.
578
583
- max_turns: Maximum number of rounds for multi-turn GRPO. The default is None, which means there is no limit.
579
-
- dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
580
-
- max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
581
-
- overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
582
-
The hyperparameters for the reward function can be found in the [Built-in Reward Functions section](#built-in-reward-functions).
583
584
- top_entropy_quantile: Only tokens whose entropy ranks within the specified top quantile are included in the loss calculation. The default is 1.0, which means low-entropy tokens are not filtered. For details, refer to the [documentation](./GRPO/AdvancedResearch/entropy_mask.md).
584
585
- log_entropy: Logs the entropy values during training. The default is False. For more information, refer to the [documentation](./GRPO/GetStarted/GRPO.md#logged-metrics).
585
-
- importance_sampling_level: Controls how the importance sampling ratio is computed. Options are `token` and `sequence`. In `token` mode, the raw per-token log-probability ratios are used. In `sequence` mode, the log-probability ratios of all valid tokens in the sequence are averaged to produce a single ratio per sequence. The [GSPO paper](https://www.arxiv.org/abs/2507.18071) uses sequence-level importance sampling to stabilize training. The default is `token`.
[REINFORCE Leave-One-Out (RLOO)](https://arxiv.org/abs/2402.14740) is a reinforcement learning algorithm based on the classic REINFORCE policy-gradient method. It constructs an unbiased advantage baseline via the Leave-One-Out (LOO) technique.
6
+
7
+
## Algorithm Overview
8
+
9
+
For clarity, we explain RLOO by contrasting it with GRPO (Group Relative Policy Optimization).
10
+
11
+
### Key Differences Between GRPO and RLOO
12
+
13
+
Both GRPO and RLOO estimate advantages via intra-group comparisons to avoid the high variance of a global baseline. Their core differences are mainly in the following aspects:
14
+
15
+
#### Difference 1: How the Advantage Baseline Is Constructed
16
+
17
+
**1. GRPO (Group Relative Policy Optimization)**
18
+
19
+
For each prompt, GRPO generates $G$ response samples and normalizes rewards using the group mean and standard deviation:
- $\text{mean}(\{R_j\}_{j=1}^G) = \frac{1}{G}\sum_{j=1}^G R_j$ is the group mean
28
+
- $\text{std}(\{R_j\}_{j=1}^G)$ is the group standard deviation
29
+
30
+
**2. RLOO (REINFORCE Leave-One-Out)**
31
+
32
+
For each prompt, RLOO generates $K$ response samples and constructs the baseline via Leave-One-Out, i.e., for the $i$-th sample, the baseline is the mean of the other $K-1$ samples:
where $\bar{R} = \frac{1}{K}\sum_{j=1}^K R_j$ is the group mean reward.
45
+
46
+
> Note: We use $K$ here to match the notation in the paper. It has the same meaning as $G$ in GRPO and corresponds to the configuration parameter `num_generations`.
47
+
48
+
**Why Leave-One-Out?**
49
+
50
+
The key advantage is unbiasedness. For the $i$-th sample, its reward $R_i$ is independent of the baseline $\frac{1}{K-1}\sum_{j \neq i} R_j$, hence the advantage estimate is unbiased. In contrast, using the mean including itself as the baseline introduces bias.
51
+
52
+
#### Difference 2: How KL Regularization Is Applied
53
+
54
+
To prevent the policy from drifting too far from the reference policy, both algorithms introduce KL divergence regularization, but in different ways:
55
+
56
+
**GRPO**: Adds KL divergence as an independent regularization term to the [loss](../GetStarted/GRPO.md#algorithm-overview):
Copy file name to clipboardExpand all lines: docs/source_en/Instruction/GRPO/GetStarted/GRPO.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,7 +4,9 @@ GRPOTrainer underwent a code refactoring in ms-swift3.5. If you are using a swif
4
4
5
5
[GRPO (Group Relative Policy Optimization)](https://arxiv.org/abs/2402.03300) leverages intra-group relative advantage calculations to replace the independent value model in the PPO algorithm and directly incorporates KL divergence penalties into the loss function to improve training stability.
0 commit comments