🎁 `RewardTrainer` refactor #4093

qgallouedec · 2025-09-15T21:30:50Z

This PR refactors RewardTrainer

Aims for an implementation that better aligns with SFTTrainer, improving long-term maintainability
Streamlines usage for end-users (see readme diff)
Better test coverage: expanded from 8 to 29 cases
No regressions or breaking changes expected
Better documentation: expanded from 94 to 245 lines
Now works seamlessly with the CLI: trl reward ...
Now includes its own model card
Now supports activation offloading
While caution is always good, this trainer isn’t heavily used, so the risk of major disruption is minimal (and we’re 99% confident nothing’s broken)
The goal is to have the same refactor for other trainer (KTO, DPO, PRM, ...)

blue: before refactor
red: after refactor

to have the same base model:

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForSequenceClassification.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", num_labels=1)
tokenizer.save_pretrained("Qwen2.5-0.5B-Reward-Base")
model.saved_pretrained("Qwen2.5-0.5B-Reward-Base")

# Code user before refactoring
from trl import RewardConfig, RewardTrainer
from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-Reward-Base")
model = AutoModelForSequenceClassification.from_pretrained("Qwen2.5-0.5B-Reward-Base", num_labels=1)
model.config.pad_token_id = tokenizer.pad_token_id

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:2000]")

training_args = RewardConfig(output_dir="Qwen2.5-0.5B-Reward", learning_rate=0.0001, logging_steps=10)
trainer = RewardTrainer(
    args=training_args,
    model=model,
    processing_class=tokenizer,
    train_dataset=dataset,
)
trainer.train() # + needed to implement accuracy logging


# Code user after ractroing
from trl import RewardTrainer, RewardConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train[:2000]")

trainer = RewardTrainer(
    model="Qwen2.5-0.5B-Reward-Base",
    train_dataset=dataset,
    args=RewardConfig(output_dir="Qwen2.5-0.5B-Reward", learning_rate=0.0001, logging_steps=10),
)
trainer.train()

…ant_only_loss; support implicit/explicit - standard/conversational; remove liger

…essing + logging

…x tiny gptnoex

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

Co-authored-by: Quentin Gallouédec <[email protected]>

…ation kwargs (#4065)

kashif

1 typo but all else is good from my side! great!

docs/source/reward_trainer.md

qgallouedec added 7 commits September 15, 2025 20:28

from sft

562f643

remove vision and dft

24a27db

sft to reward

91e546f

DataCollatorForLanguageModeling to DataCollatorForPreference

ce6d95d

remove support for TrainingArguments

de984d5

properly load model

78a1ee9

remove position_ids, packing, padding-free, seq_length

bfd2006

qgallouedec changed the base branch from main to support-reward-refactor September 15, 2025 21:31

qgallouedec added 10 commits September 15, 2025 23:00

remove completion_only_loss, mompletion mask, formatting_func, assist…

c12b4cc

…ant_only_loss; support implicit/explicit - standard/conversational; remove liger

update config

59b955a

now it looks good

e718f38

template for reward

4a8e579

new tiny model + fix tiny reward model

156bff9

fix template name

978dad3

rm promptencoder; rm non-chatml support ; fix padding token; fix proc…

1478879

…essing + logging

fix tiny models

2a9b8bd

fix eval

67de45a

test!!!

f92041c

qgallouedec changed the base branch from support-reward-refactor to main September 16, 2025 04:34

qgallouedec added 6 commits September 16, 2025 15:56

tiny GPTNoeX

fd4b0a0

fix peft target modules

59e75f3

add indication peft_config

c43b633

move remove_none_values

09f5bff

remove compute_loss_func; fix model docstring; allow is_processed; fi…

37a15a8

…x tiny gptnoex

support SequenceClassification models in clone_chat_template

7f3f4fd

qgallouedec mentioned this pull request Sep 16, 2025

🎞️ Support sequence classification models in clone_chat_template #4097

Merged

Merge branch 'main' into reward-refactor

c2be046

qgallouedec changed the base branch from main to support-seq-cls-clone-chat September 16, 2025 17:30

qgallouedec added 2 commits September 16, 2025 17:30

Merge branch 'support-seq-cls-clone-chat' into reward-refactor

15e886a

fix test

82f238a

jue-jue-zi and others added 10 commits September 23, 2025 14:53

🐯 fix: use_liger_kernel with IterableDataset (#4087)

1323901

Co-authored-by: Quentin Gallouédec <[email protected]> Co-authored-by: Quentin Gallouédec <[email protected]>

📤 Fix a dataset loading bug in scripts (#4124)

7d10daa

Co-authored-by: Quentin Gallouédec <[email protected]>

⚓ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely (#4057)

814f97f

Co-authored-by: Quentin Gallouédec <[email protected]>

📤 Fix a dataset loading bug in scripts

05fd402

📌 Pin vLLM version (#4122)

7c174e0

👋 Remove backend parameter from GuidedDecodingParams (#4123)

8843b7b

🧹 Remove max_batch_tokens, num_blocks and block_size from gener…

95fe6b8

…ation kwargs (#4065)

Merge branch 'main' into support-seq-cls-clone-chat

362af13

Merge branch 'main' into support-seq-cls-clone-chat

424d50d

Merge branch 'support-seq-cls-clone-chat' into reward-refactor

ad554c6

Base automatically changed from support-seq-cls-clone-chat to main September 30, 2025 17:42

qgallouedec and others added 5 commits September 30, 2025 11:43

Merge branch 'main' into reward-refactor

4e29651

Merge branch 'main' into reward-refactor

d6e1245

fix template file

0641915

fix imports

adcb80b

revert modif online dpo

1c4b295

kashif approved these changes Sep 30, 2025

View reviewed changes

docs/source/reward_trainer.md Outdated Show resolved Hide resolved

qgallouedec added 9 commits September 30, 2025 18:45

#4048 and #4124

cb61502

#4178

72b4ad3

#4161

8e4f332

#4007

0f3b4f8

#4080

ff1175b

#4006

1151232

fix: correct spelling of 'recommended' in reward_trainer.md

b7ee764

update tiny generation script

b828100

rm force

f16282b

qgallouedec merged commit da209f8 into main Sep 30, 2025
11 of 12 checks passed

qgallouedec deleted the reward-refactor branch September 30, 2025 21:13

qgallouedec mentioned this pull request Oct 5, 2025

Custom DataCollator Bug in RewardTrainer #3101

Closed

5 tasks

qgallouedec linked an issue Oct 5, 2025 that may be closed by this pull request

Reward #2633

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

🎁 `RewardTrainer` refactor #4093

🎁 `RewardTrainer` refactor #4093

Uh oh!

qgallouedec commented Sep 15, 2025 •

edited

Loading

Uh oh!

kashif left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

🎁 RewardTrainer refactor #4093

🎁 RewardTrainer refactor #4093

Uh oh!

Conversation

qgallouedec commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

🎁 `RewardTrainer` refactor #4093

🎁 `RewardTrainer` refactor #4093

qgallouedec commented Sep 15, 2025 •

edited

Loading