Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
562f643
from sft
qgallouedec Sep 15, 2025
24a27db
remove vision and dft
qgallouedec Sep 15, 2025
91e546f
sft to reward
qgallouedec Sep 15, 2025
ce6d95d
`DataCollatorForLanguageModeling` to `DataCollatorForPreference`
qgallouedec Sep 15, 2025
de984d5
remove support for TrainingArguments
qgallouedec Sep 15, 2025
78a1ee9
properly load model
qgallouedec Sep 15, 2025
bfd2006
remove position_ids, packing, padding-free, seq_length
qgallouedec Sep 15, 2025
c12b4cc
remove completion_only_loss, mompletion mask, formatting_func, assist…
qgallouedec Sep 15, 2025
59b955a
update config
qgallouedec Sep 15, 2025
e718f38
now it looks good
qgallouedec Sep 15, 2025
4a8e579
template for reward
qgallouedec Sep 16, 2025
156bff9
new tiny model + fix tiny reward model
qgallouedec Sep 16, 2025
978dad3
fix template name
qgallouedec Sep 16, 2025
1478879
rm promptencoder; rm non-chatml support ; fix padding token; fix proc…
qgallouedec Sep 16, 2025
2a9b8bd
fix tiny models
qgallouedec Sep 16, 2025
67de45a
fix eval
qgallouedec Sep 16, 2025
f92041c
test!!!
qgallouedec Sep 16, 2025
fd4b0a0
tiny GPTNoeX
qgallouedec Sep 16, 2025
59e75f3
fix peft target modules
qgallouedec Sep 16, 2025
c43b633
add indication peft_config
qgallouedec Sep 16, 2025
09f5bff
move `remove_none_values`
qgallouedec Sep 16, 2025
37a15a8
remove compute_loss_func; fix model docstring; allow is_processed; fi…
qgallouedec Sep 16, 2025
7f3f4fd
support SequenceClassification models in clone_chat_template
qgallouedec Sep 16, 2025
c2be046
Merge branch 'main' into reward-refactor
qgallouedec Sep 16, 2025
15e886a
Merge branch 'support-seq-cls-clone-chat' into reward-refactor
qgallouedec Sep 16, 2025
82f238a
fix test
qgallouedec Sep 16, 2025
41cbaa0
fix sft docstring
qgallouedec Sep 16, 2025
0ee9fc0
docstring
qgallouedec Sep 16, 2025
d84b850
simplify example in readme
qgallouedec Sep 16, 2025
59b4ff0
two papers
qgallouedec Sep 16, 2025
aff5097
margin and center_rewards_coefficient
qgallouedec Sep 16, 2025
ce17355
cli + documentation
qgallouedec Sep 16, 2025
6f0fb09
fix iframe
qgallouedec Sep 16, 2025
6a57693
fix doc
qgallouedec Sep 16, 2025
f241474
focus dude
qgallouedec Sep 16, 2025
c76499d
nits
qgallouedec Sep 16, 2025
a341077
fix space
qgallouedec Sep 16, 2025
302e4c9
nit
qgallouedec Sep 16, 2025
8431801
fix layer_types
qgallouedec Sep 16, 2025
b7f8776
update section header for dataset mixtures in CLI documentation
qgallouedec Sep 16, 2025
f5f1b2d
fix some iframes
qgallouedec Sep 16, 2025
3bf8155
add disable_dropout parameter to RewardConfig and implement in Reward…
qgallouedec Sep 17, 2025
bf77b59
deprecate RewardDataCollatorWithPadding and decode_and_strip_padding …
qgallouedec Sep 17, 2025
4eda563
Update trl/trainer/reward_trainer.py
qgallouedec Sep 22, 2025
291ba98
filter
qgallouedec Sep 22, 2025
92a5555
Merge branch 'main' into reward-refactor
qgallouedec Sep 22, 2025
b63bc89
Merge branch 'main' into support-seq-cls-clone-chat
qgallouedec Sep 22, 2025
d3d3414
Merge branch 'support-seq-cls-clone-chat' into reward-refactor
qgallouedec Sep 22, 2025
4a5bf82
Merge branch 'main' into support-seq-cls-clone-chat
qgallouedec Sep 23, 2025
6ee1aed
Merge branch 'support-seq-cls-clone-chat' into reward-refactor
qgallouedec Sep 23, 2025
1323901
🐯 fix: use_liger_kernel with IterableDataset (#4087)
jue-jue-zi Sep 23, 2025
7d10daa
📤 Fix a dataset loading bug in scripts (#4124)
singing-cat Sep 23, 2025
814f97f
⚓ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely (#4057)
kashif Sep 23, 2025
05fd402
📤 Fix a dataset loading bug in scripts
qgallouedec Sep 23, 2025
7c174e0
📌 Pin vLLM version (#4122)
qgallouedec Sep 23, 2025
8843b7b
👋 Remove `backend` parameter from `GuidedDecodingParams` (#4123)
qgallouedec Sep 23, 2025
95fe6b8
🧹 Remove `max_batch_tokens`, `num_blocks` and `block_size` from gener…
qgallouedec Sep 23, 2025
362af13
Merge branch 'main' into support-seq-cls-clone-chat
qgallouedec Sep 23, 2025
424d50d
Merge branch 'main' into support-seq-cls-clone-chat
qgallouedec Sep 30, 2025
ad554c6
Merge branch 'support-seq-cls-clone-chat' into reward-refactor
qgallouedec Sep 30, 2025
4e29651
Merge branch 'main' into reward-refactor
qgallouedec Sep 30, 2025
d6e1245
Merge branch 'main' into reward-refactor
qgallouedec Sep 30, 2025
0641915
fix template file
qgallouedec Sep 30, 2025
adcb80b
fix imports
qgallouedec Sep 30, 2025
1c4b295
revert modif online dpo
qgallouedec Sep 30, 2025
cb61502
#4048 and #4124
qgallouedec Sep 30, 2025
72b4ad3
#4178
qgallouedec Sep 30, 2025
8e4f332
#4161
qgallouedec Sep 30, 2025
0f3b4f8
#4007
qgallouedec Sep 30, 2025
ff1175b
#4080
qgallouedec Sep 30, 2025
1151232
#4006
qgallouedec Sep 30, 2025
b7ee764
fix: correct spelling of 'recommended' in reward_trainer.md
qgallouedec Sep 30, 2025
b828100
update tiny generation script
qgallouedec Sep 30, 2025
f16282b
rm force
qgallouedec Sep 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 2 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,23 +136,13 @@ trainer.train()
Here is a basic example of how to use the [`RewardTrainer`](https://huggingface.co/docs/trl/reward_trainer):

```python
from trl import RewardConfig, RewardTrainer
from trl import RewardTrainer
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen2.5-0.5B-Instruct", num_labels=1
)
model.config.pad_token_id = tokenizer.pad_token_id

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

training_args = RewardConfig(output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2)
trainer = RewardTrainer(
args=training_args,
model=model,
processing_class=tokenizer,
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
Expand Down
98 changes: 97 additions & 1 deletion docs/source/clis.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ Currently supported commands are:
- `trl dpo`: fine-tune a LLM with DPO
- `trl grpo`: fine-tune a LLM with GRPO
- `trl kto`: fine-tune a LLM with KTO
- `trl reward`: train a Reward Model
- `trl rloo`: fine-tune a LLM with RLOO
- `trl sft`: fine-tune a LLM with SFT

Expand Down Expand Up @@ -41,6 +42,15 @@ trl dpo \
--dataset_name anthropic/hh-rlhf
```

</hfoption>
<hfoption id="Reward">

```bash
trl reward \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/ultrafeedback_binarized
```

</hfoption>
</hfoptions>

Expand Down Expand Up @@ -78,6 +88,21 @@ Launch with:
trl dpo --config dpo_config.yaml
```

</hfoption>
<hfoption id="Reward">

```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/ultrafeedback_binarized
```

Launch with:

```bash
trl reward --config reward_config.yaml
```

</hfoption>
</hfoptions>

Expand Down Expand Up @@ -138,6 +163,33 @@ Launch with:
```bash
trl dpo --config dpo_config.yaml
```

</hfoption>
<hfoption id="Reward inline">

```bash
trl reward \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/ultrafeedback_binarized \
--num_processes 4
```

</hfoption>
<hfoption id="Reward w/ config file">

```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/ultrafeedback_binarized
num_processes: 4
```

Launch with:

```bash
trl reward --config reward_config.yaml
```

</hfoption>
</hfoptions>

Expand Down Expand Up @@ -217,14 +269,41 @@ Launch with:
```bash
trl dpo --config dpo_config.yaml
```

</hfoption>
<hfoption id="Reward inline">

```bash
trl reward \
--model_name_or_path Qwen/Qwen2.5-0.5B \
--dataset_name trl-lib/ultrafeedback_binarized \
--accelerate_config zero2 # or path/to/my/accelerate/config.yaml
```

</hfoption>
<hfoption id="Reward w/ config file">

```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
dataset_name: trl-lib/ultrafeedback_binarized
accelerate_config: zero2 # or path/to/my/accelerate/config.yaml
```

Launch with:

```bash
trl reward --config reward_config.yaml
```

</hfoption>
</hfoptions>

### Using dataset mixtures

You can use dataset mixtures to combine multiple datasets into a single training dataset. This is useful for training on diverse data sources or when you want to mix different types of data.

<hfoptions id="accelerate_config">
<hfoptions id="dataset_mixtures">
<hfoption id="SFT">

```yaml
Expand Down Expand Up @@ -258,6 +337,23 @@ Launch with:
trl dpo --config dpo_config.yaml
```

</hfoption>
<hfoption id="Reward">

```yaml
# reward_config.yaml
model_name_or_path: Qwen/Qwen2.5-0.5B
datasets:
- path: trl-lib/tldr-preference
- path: trl-lib/lm-human-preferences-sentiment
```

Launch with:

```bash
trl reward --config reward_config.yaml
```

</hfoption>
</hfoptions>

Expand Down
50 changes: 50 additions & 0 deletions docs/source/paper_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -533,3 +533,53 @@ training_args = CPOConfig(
...
)
```

## Reward Modeling

Papers relating to the [`RewardTrainer`]

### Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking

**📜 Paper**: https://huggingface.co/papers/2312.09244

This paper proposed an auxiliary loss function designed to directly learn a centered reward model. This auxiliary loss minimizes the squared sum of the rewards, encouraging the model to naturally produce mean-zero outputs and thereby resolving the issue of underdetermination.

$$
\mathcal{L}(\theta) = - \mathbb{E}_{(x,y^+,y^-) \sim \mathcal{D}} \left[ \log \sigma(r_\theta(x, y^+) - r_\theta(x, y^-)) \textcolor{red}{- \eta \cdot (r_\theta(x, y^+) + r_\theta(x, y^-))^2} \right].
$$

To use this auxiliary loss with [`RewardTrainer`], you can use the `center_rewards_coefficient` argument in [`RewardConfig`] as follows:

```python
from trl import RewardConfig

training_args = RewardConfig(
center_rewards_coefficient=0.01, # η in the paper
...
)
```

### Llama 2: Open Foundation and Fine-Tuned Chat Models

**📜 Paper**: https://huggingface.co/papers/2307.09288

In this paper, the authors propose to leverage their preference ratings being decomposed as a scale of four points (e.g., _significantly better_) to provide more informative feedback to the reward model. This is done by adding a margin to the loss function, which encourages the reward model to assign larger gaps in scores for pairs with higher preference ratings.

$$
\mathcal{L}(\theta) = - \mathbb{E}_{(x,y^+,y^-,\textcolor{red}{m}) \sim \mathcal{D}} \left[ \log \sigma(r_\theta(x, y^+) - r_\theta(x, y^-) \textcolor{red}{- m}) \right].
$$

You can add a margin to the loss by adding a `margin` column to the dataset. The following example shows how to set up a the "Margin Small" setting of the paper.

```python
def add_margin(example):
preference_to_margin = {
"significantly better": 1.0,
"better": 2.0/3.0,
"slightly better": 1.0/3.0,
"negligibly better / unsure": 0.0,
}
return {"margin": preference_to_margin[example["preference_label"]]}

dataset = dataset.map(add_margin)
```
21 changes: 20 additions & 1 deletion docs/source/quickstart.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Quickstart

TRL is a comprehensive library for post-training foundation models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO).
TRL is a comprehensive library for post-training foundation models using techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), Direct Preference Optimization (DPO).

## Quick Examples

Expand Down Expand Up @@ -51,6 +51,21 @@ trainer = DPOTrainer(
trainer.train()
```

### Reward Modeling

```python
from trl import RewardTrainer
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

trainer = RewardTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
train_dataset=dataset,
)
trainer.train()
```

## Command Line Interface

Skip the code entirely - train directly from your terminal:
Expand All @@ -63,6 +78,10 @@ trl sft --model_name_or_path Qwen/Qwen2.5-0.5B \
# DPO: Align with preferences
trl dpo --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized

# Reward: Train a reward model
trl reward --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
--dataset_name trl-lib/ultrafeedback_binarized
```

## What's Next?
Expand Down
Loading
Loading