Instruction Tuning Improvements #380

le1nux · 2025-07-03T22:57:15Z

What does this PR do?

This PR ..

General Changes

..

Breaking Changes

..

Checklist before submitting final PR

My PR is minimal and addresses one issue in isolation
I have merged the latest version of the target branch into this feature branch
I have reviewed my own code w.r.t. correct implementation, missing type hints, proper documentation, etc.
I have run a sample config for model training
I have checked that all tests run through (python tests/tests.py)
I have updated the internal changelog (CHANGELOG_DEV.md)

le1nux

This is a draft.

What has been done so far:

Refactored the collator design which is now composable. Meaning you can call multiple collate_fns sequentially. This is helpful for instance for instruction tuning, where you first shift the targets for the autoregressive objective and then mask certiain tokens to disregard in the loss such as input from the user role. In this case, we would have two collate_fns being called one after the other.
We can specify now, which component to load in the app_state from a DCP checkpoint. This is important for continued pretraining or finetuning, where we don't want to use the previous optimizer and lr scheduler states.
Added an iterative mem map dataset implemenation. Instead of packing the samples, we take the index (stored in the pbin) to iterate over each sample.

Todos:

Add tests for the new functionality
Fix failing tests.
Verify correctness of the code.
Sometimes we skip samples by removing masking out all tokens in the loss (setting token_ids to -100). When the whole batch only consists of "skipped" samples, the loss is NaN. Especially, in case of batch size 1 this can be seen quite frequently. We need to find a solution to not run backprop on such batches.

lllAlexanderlll · 2025-08-18T13:08:53Z

The current plan is to make the instruction-tuning data preparation independent towards special tokens, as #383 persists.
This can be done by using jinjas Extension feature, which is used in Huggingfaces newest iteration of handling chat templates:
https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/utils/chat_template_utils.py#L375

This works by registering new tags to the jinja Template Render Environment (see the linked code above for the full code):

    class AssistantTracker(Extension):
        # This extension is used to track the indices of assistant-generated tokens in the rendered chat
        tags = {"generation"}

Within the chat template this tag is started
https://huggingface.co/HuggingFaceTB/SmolLM3-3B/blob/main/chat_template.jinja#L76
and ended
https://huggingface.co/HuggingFaceTB/SmolLM3-3B/blob/main/chat_template.jinja#L82
for assistant turns.

Then, during rendering of the chat template with tokenize=True and return_assistant_mask=True the indices during which the "generation" tag was active are returned:
https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/utils/chat_template_utils.py#L442
https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/processing_utils.py#L1629
and added as assistant_mask:
https://github.com/huggingface/transformers/blob/v4.53.3/src/transformers/processing_utils.py#L1678

We can do the same and remove the old special-token-based approach. However, this is not compatible to the current pbin file, as we only store the IDs and by now not other fields (attention_mask, assistant_mask). We would need to have tokenization and packing on the fly for this jinja-based assistant turn tracking.

Packing on the fly in TRL: https://github.com/huggingface/trl/blob/main/trl/data_utils.py#L495

le1nux added 7 commits July 4, 2025 00:50

refactor: app_state can now be partially loaded

e6c82bd

refactor: collate fn is now composable

b9e7baf

feat: added iterative memmap dataset

78f6058

feat: added autoregressive collate fn

1cbe272

refactor: added first running config with refactored collator

cf87046

refactor: app state properties can cow be set from outside

adf8e19

feat: added multi-round inference (not multi-turn)

331473b

le1nux commented Jul 6, 2025

View reviewed changes

le1nux changed the title ~~App state refactoring~~ Instruction Tuning Improvements Jul 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Instruction Tuning Improvements #380

Instruction Tuning Improvements #380

Uh oh!

le1nux commented Jul 3, 2025

Uh oh!

le1nux left a comment •

edited

Loading

Uh oh!

lllAlexanderlll commented Aug 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Instruction Tuning Improvements #380

Are you sure you want to change the base?

Instruction Tuning Improvements #380

Uh oh!

Conversation

le1nux commented Jul 3, 2025

What does this PR do?

General Changes

Breaking Changes

Checklist before submitting final PR

Uh oh!

le1nux left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lllAlexanderlll commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

le1nux left a comment •

edited

Loading

lllAlexanderlll commented Aug 18, 2025 •

edited

Loading