Skip to content

Commit 002a00c

Browse files
author
Timm Ruland
committed
chore: Merge remote-tracking branch 'origin/main' into packed_dataset_filtering
2 parents 76fbae0 + 1e4d28e commit 002a00c

File tree

92 files changed

+24852
-552
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+24852
-552
lines changed

.github/workflows/tests_full.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ jobs:
2727
python -m pip install --upgrade pip setuptools wheel
2828
export FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE
2929
python -m pip install ninja # Lowers compilation time of flash attention significantly
30-
python -m pip install flash-attn --no-build-isolation
30+
python -m pip install flash-attn==2.7.4.post1 --no-build-isolation
3131
python -m pip install -e .[tests]
3232
- name: Run tests
3333
run: |

.gitignore

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ logs/
99
core.*
1010
checkpoint
1111
wandb
12+
artifacts
1213

1314
# Byte-compiled / optimized / DLL files
1415
__pycache__/
@@ -163,5 +164,8 @@ tests/tmp/*
163164
*wandb_storage*
164165
.coverage/*
165166
*.pbin
166-
167-
tutorials/profiling/experiments
167+
tutorials/scaling_up/experiments
168+
tutorials/profiling/experiments
169+
tutorials/instruction_tuning/prepared_data
170+
config_files/instruction_tuning
171+
data/lorem_ipsum_instruct.jsonl

CHANGELOG_DEV.md

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -163,4 +163,26 @@ Some HF tokenisers such as `xlm-roberta-large` add special tokens (e.g., eod tok
163163
This side-effect in the transformers library has lead to the eod token being appended twice when tokenizing / packing our data. We added a check for this and only append the eod token once now:
164164
https://github.com/Modalities/modalities/blob/1c1ccdc973283c45bc8c9fadf4d20f03e435cd04/src/modalities/dataloader/create_packed_data.py#L327-L330
165165

166-
Additionally, I added a script that verifies the consistency of the indexation and tokenization of a given JSONL file. We run the indexation and tokenization routines in modalities and compare it to tokenized JSONL file to which we applied the HF tokenizer directly.
166+
Additionally, I added a script that verifies the consistency of the indexation and tokenization of a given JSONL file. We run the indexation and tokenization routines in modalities and compare it to tokenized JSONL file to which we applied the HF tokenizer directly.
167+
168+
## PR #379 Instruction Tuning Support
169+
170+
* New entry point `apply_chat_template` to form chats and create index and pbin files of it
171+
* A wrapper for collate functions to include tokens in the loss which appear between indicator tokens
172+
* A new parameter for the PackedMemMapDatasetContinuous to allow not to re-use the last target token
173+
* A tutorial how to apply instruction-tuning on a Huggingface Model
174+
175+
176+
## PR #359 Activation Checkpoint with FSDP2
177+
178+
This PR adds activation checkpointing (AC) support for FSDP2.
179+
There are now three AC variants:
180+
* Full AC (same as before, where entire complete modules get ACed, leading to the largest memory footprint reduction)
181+
* Selective Layer AC (only very nth layer or module is ACed)
182+
* Selective OP Ac (only certain OPs, typically low memory but compute intense, are checkpointed)
183+
184+
## PR #374 Tensor Parallelism Support
185+
186+
* adds support for Tensor Parallelism (including Sequence Parallelism).
187+
* adds a debugging toolkit to track the input and output tensors during a forward pass, gradients during the backward pass and weight tensors.
188+
Tensors can be either normal Tensors or DTensors.

README.md

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ conda activate modalities
4444
# install PyTorch, Ninja and Flash Attention (mandatory)
4545
pip install torch==2.6.0
4646
pip install ninja # Lowers compilation time of flash attention significantly
47-
pip install flash-attn --no-build-isolation
47+
pip install flash-attn==2.7.4.post1 --no-build-isolation
4848
```
4949

5050
### Option 1: Installation from source
@@ -74,6 +74,20 @@ To install Modalities via pip, run
7474
pip install modalities
7575
```
7676

77+
### Option 3: Feature Complete via UV
78+
79+
```sh
80+
curl -LsSf https://astral.sh/uv/install.sh | sh
81+
uv venv --seed --python 3.11 --prompt modalities
82+
source .venv/bin/activate
83+
uv pip install torch
84+
uv pip install ninja
85+
uv pip install --no-build-isolation flash-attn==2.7.4.post1
86+
# for developer: use [tests,linting] and install pre-commit hooks
87+
uv pip install -e .[tests,linting]
88+
pre-commit install --install-hooks
89+
```
90+
7791
## Usage
7892
Modalities provides several entry points to interact with the framework. The following section lists the available entry points and their respective functionalities.
7993

config_files/text_generation/text_generation_config_torch.yaml

Lines changed: 0 additions & 87 deletions
This file was deleted.

config_files/training/config_lorem_ipsum_long_fsdp2.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ settings:
2323
enforce_last_step_evaluated: false
2424
enforce_last_step_checkpointed: false
2525
step_profile:
26-
gradient_accumulation_steps: 2
26+
gradient_accumulation_steps: 1
2727
local_train_micro_batch_size: 1
2828
sequence_length: 256
2929
training_target:

config_files/training/config_lorem_ipsum_long_fsdp2_warmstart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ settings:
6868
config:
6969
checkpoint_path: ${settings.warmstart_checkpoint_paths.checkpoint_folder_path}
7070
warmstart_checkpoint_paths: # ${warmstart_env:checkpoint_paths}
71-
checkpoint_folder_path: /raid/fromm/modalities/data/checkpoints/2025-04-16__12-40-51_6dcbb1a0/eid_2025-04-16__12-40-51_6dcbb1a0-seen_steps_32-seen_tokens_65536-target_steps_162-target_tokens_331776
71+
checkpoint_folder_path: /raid/s3/opengptx/max_lue/repositories/modalities/data/checkpoints/2025-03-14__15-25-59_970fedec/eid_2025-03-14__15-25-59_970fedec-seen_steps_96-seen_tokens_196608-target_steps_162-target_tokens_331776
7272

7373
collate_fn:
7474
component_key: collate_fn

0 commit comments

Comments
 (0)