Skip to content

Commit b3744ca

Browse files
authored
Merge pull request #494 from dushyantbehl/cleanup_data_handling
feat: Data Handling v3 (Breaking change for data config interface)
2 parents 8821791 + cd90fb4 commit b3744ca

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

45 files changed

+1245
-975
lines changed

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,8 @@ coverage*.xml
88
dist
99
htmlcov
1010
test
11+
error.log
12+
tmp/
1113

1214
# IDEs
1315
.vscode/
@@ -45,3 +47,4 @@ mlruns/
4547

4648
# Auto-generated file
4749
/tuning/_version.py
50+
fms_hf_tuning-*/

.pylintrc

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -445,7 +445,9 @@ disable=raw-checker-failed,
445445
pointless-statement,
446446
wrong-import-order,
447447
duplicate-code,
448-
unbalanced-tuple-unpacking
448+
unbalanced-tuple-unpacking,
449+
unspecified-encoding,
450+
too-many-lines
449451

450452
# Enable the message, report, category or checker with the given id(s). You can
451453
# either give multiple identifier separated by comma (,) or put this option

docs/advanced-data-handlers.md

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# Data Handlers
2+
Please note, this document is intended for advanced users who want to customize data handler arguments and use data handler functions to perform
3+
complex operations on the data configs.
4+
5+
Data handlers, are routines which process a dataset using [HF process frameworks](https://huggingface.co/docs/datasets/en/process) including map, filter, remove, select, and rename.
6+
All data handler routines are registered with our data preprocessor as a `k:func` object where
7+
`k` is the name (`str`) of the data handler and `func` (`callable`) is the function which is called.
8+
9+
In the data config, users can request which data handler to apply by requesting the corresponding `name`
10+
with which the data handler was registered and specifying the appropriate `arguments`. Each data handler accepts two types of arguments via `DataHandlerArguments` (as defined in the data preprocessor [schema](./advanced-data-preprocessing.md#what-is-data-config-schema)), as shown below.
11+
12+
Arguments to the data handlers are of two types,
13+
14+
Each data handler is a routine passed to an underlying HF API so the `kwargs` supported by the underlying API can be passed via the `arguments` section of the data handler config. In our pre-existing handlers the supported underlying API is either:
15+
- [Map](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map)
16+
- [Filter](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.filter)
17+
- [Rename](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns)
18+
- [Select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select)
19+
- [Remove](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns)
20+
21+
The operations performed for pre-existing handlers can be found in the [Preexisting data handlers](#preexisting-data-handlers) section
22+
23+
For example, users can pass `batched` through `arguments` to ensure [batched processing](https://huggingface.co/docs/datasets/en/about_map_batch) of the data handler.
24+
25+
Users can also pass any number of `kwargs` arguments required for each data handling `routine` function as [`fn_kwargs`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.fn_kwargs) inside the arguments.
26+
27+
A typical YAML snippet where you'd specify arguments to the handlers
28+
```
29+
datapreprocessor:
30+
...
31+
datasets:
32+
- name: my_dataset
33+
data_paths:
34+
- /path/to/my_dataset
35+
data_handlers:
36+
- name: tokenize
37+
arguments:
38+
# Additional kwargs passed directly to the underlying HF API call
39+
batched: false
40+
num_proc: 10
41+
42+
fn_kwargs:
43+
# Any arguments specific to the tokenize handler itself
44+
truncation: true
45+
max_length: 1280
46+
```
47+
48+
For example, `num_proc` and `batched` in the snippet above are passed straight to
49+
`datasets.Dataset.map(...) ` while, the `truncation` and `max_length` arguments
50+
in the snippet above directly control how the handler performs tokenization.
51+
52+
For native handlers like `REMOVE` `RENAME` `SELECT` (see below) you don't need to pass `fn_kwargs` and args need to be provided in `arguments`.
53+
54+
### Default Arguments
55+
Each data handler supports many arguments and some of them are automatically provided to the data handler via the data processor framework.
56+
The data processor framework makes these arguments available to the data handlers via `kwargs`.
57+
58+
1. `tokenizer`: The `AutoTokenizer` representation of the `tokenizer_name_or_path` or from `model_name_or_path` arg passed to the library.
59+
2. `column_names`: The names of the columns of the current dataset being processed.
60+
61+
**Also one special handling data preprocessor provides is to pass in `remove_columns` as `all` which will internally be translated to all column names to the `Map` of `Filter` data handler routines.**
62+
63+
## Preexisting data handlers
64+
This library currently supports the following preexisting data handlers. These handlers could be requested by their same name and users can lookup the function args from [data handlers source code](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py):
65+
66+
### `tokenize_and_apply_input_masking`:
67+
Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
68+
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, see [this](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml) or the `add_eos_token` argument below.
69+
70+
Users don't need to pass any extra `response` or `instruction` templates here.
71+
72+
**Type: MAP**
73+
74+
**arguments**
75+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
76+
77+
**fn_args:**
78+
- `element`: the HF Dataset element.
79+
- `input_column_name`: Name of the input (instruction) field in dataset
80+
- `output_column_name`: Name of the output field in dataset
81+
- `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True
82+
83+
**Returns:**
84+
- tokenized Dataset element with input_ids, labels and attention_mask columns where labels contain masking of the `input` section of the dataset.
85+
86+
### `apply_custom_jinja_template`:
87+
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
88+
Returns dataset which contains column `formatted_text_column_name` containing the string formatted using provided template.
89+
90+
Users need to pass in appropriate `response_template` if they specify this handler as the final handler to ensure the
91+
`DataCollatorForCompletionOnlyLM` used underneath to apply proper masking ensure the model learns only on responses.
92+
93+
**Type: MAP**
94+
95+
**arguments**
96+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
97+
98+
**fn_args:**
99+
- `element`: the HF Dataset element
100+
- `formatted_text_column_name`: formatted_dataset_field.
101+
- `template`: Jinja template to format data with. Features of Dataset should be referred to by their key.
102+
103+
**Returns:**
104+
- Formatted HF Dataset element by formatting dataset with provided jinja template, saving the result to `formatted_text_column_name` argument.
105+
106+
### `apply_tokenizer_chat_template`:
107+
Uses tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
108+
Returns dataset which contains column `formatted_text_column_name` containing the chat template formatted string.
109+
110+
Since this handler does not tokenize the dataset users need to provide appropriate `resonse_template` and `instruction_template` for the
111+
`DataCollatorForCompletionOnlyLM` used underneath to apply proper masking ensure the model learns only on assistant responses.
112+
113+
**Type: MAP**
114+
115+
**arguments**
116+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
117+
118+
**fn_args:**
119+
- `element`: the HF Dataset element.
120+
- `formatted_text_column_name`: the field in which to store the rendered text.
121+
- `conversation_column`: column name where the chat template expects the conversation
122+
123+
**Returns:**
124+
- Formatted HF Dataset element by formatting dataset with tokenizer's chat template, saving the result to `formatted_text_column_name` argument.
125+
126+
### `tokenize_and_apply_chat_template_with_masking`:
127+
Uses tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
128+
Then tokenizes the dataset while masking all user and system conversations ensuring model learns only on assistant responses.
129+
Tokenizes the dataset so you don't need to pass any extra arguments for data collator.
130+
131+
**Type: MAP**
132+
133+
**arguments**
134+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
135+
*Note: - Always recommended to be used with `remove_columns:all` as argument as you don't want to retain text columns and tokenized columns alongside while training which can cause a potential crash.
136+
137+
**fn_args:**
138+
- `element`: the HF Dataset element.
139+
- `formatted_text_column_name`: the field in which to store the rendered text.
140+
- `conversation_column`: column name where the chat template expects the conversation
141+
142+
**Returns:**
143+
- Tokenized Dataset element containing `input_ids` `labels` and `attention_mask`.
144+
145+
### `tokenize`:
146+
Tokenizes one column of the dataset passed as input `text_column_name`.
147+
148+
**Type: MAP**
149+
150+
**arguments**
151+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
152+
153+
**fn_kwargs:**
154+
- `element`: the HF Dataset element.
155+
- `text_column_name`: The dataset field to tokenize.
156+
- `truncation`: Truncation strategy to use, refer the link (https://huggingface.co/docs/transformers/en/pad_truncation).
157+
- `max_length`: Max length to truncate the samples to.
158+
159+
**Return:**
160+
- Tokenized dataset element field `text_column_name` containing `input_ids` and `labels`
161+
162+
### `duplicate_columns`:
163+
Duplicate one columne of a dataset to another new column.
164+
165+
**Type: MAP**
166+
167+
**arguments**
168+
- Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map)
169+
170+
**fn_args:**
171+
- `element`: the HF Dataset element
172+
- `existing_column_name`: Name of the column to be duplicated
173+
- `new_column_name`: Name of the new column where dyplicated column is saved
174+
175+
**Return:**
176+
- Formatted HF dataset element with `new_column_name` where `existing_column_name` content is copied.
177+
178+
### `skip_samples_with_large_columns`:
179+
Skips elements which contains certain columns larger than the passed max length in the dataset.
180+
181+
**Type: FILTER**
182+
183+
**arguments**
184+
- Any arguments supported by [HF Filter API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.filter)
185+
186+
**fn_args**:
187+
- `element`: HF dataset element.
188+
- `column_name`: Name of column to be filtered.
189+
- `max_allowed_length`: Max allowed lenght of column in either characters or tokens.
190+
191+
**Return:**
192+
- A filtered dataset which contains elements with length of column `column_name` shorter than max allowed length
193+
194+
### `remove_columns`:
195+
Directly calls [remove_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) in HF API over the dataset.
196+
197+
**Type: REMOVE**
198+
199+
**arguments**:
200+
- `column_names`: Names of columns to be removed from dataset
201+
202+
**fn_args**:
203+
- Nil. As this is a Native API.
204+
205+
**Returns:**
206+
- Dataset with specified `column_names` removed
207+
208+
### `select_columns`:
209+
Directly calls [select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select) in HF API
210+
211+
**Type: SELECT**
212+
213+
**arguments**:
214+
- `column_names`: Names of columns to be retained in the new dataset
215+
216+
**fn_args**:
217+
- Nil. As this is a Native API.
218+
219+
**Returns:**
220+
- Dataset where only columns specified in `column_names` are retained.
221+
222+
### `rename_columns`:
223+
Directly calls [rename_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns) in HF API
224+
225+
**Type: RENAME**
226+
227+
**arguments**:
228+
- `column_mapping`: Column names passed as `str:str` from `old_name:new_name`
229+
230+
**fn_args**:
231+
- Nil. As this is a Native API.
232+
233+
**Returns:**
234+
- Dataset where columns are renamed using provided column mapping.
235+
236+
## Additional arguments
237+
Please note that the choice of extra arguments needed with handler depends on how the dataset looks post processing which is a combination post
238+
application of the full DAG of the data handlers and should be used be referring to our other documentation [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/README.md) and [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/docs/advanced-data-preprocessing.md) and reference templates provided [here](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main/tests/artifacts/predefined_data_configs)
239+
240+
241+
## Extra data handlers
242+
Users are also allowed to pass custom data handlers using [`sft_trainer.py::train()`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L71) API call via the [`additional_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L89) argument.
243+
244+
The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api

docs/advanced-data-preprocessing.md

Lines changed: 3 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -164,84 +164,11 @@ Probably something like this:
164164
Additionally while loading the dataset, users can specify which columns to rename via `rename_columns` and which to retain via `retain_columns` arguments above.
165165
The order of application of these operations is *strictly rename followed by retain* so users should note that an old column name which is renamed will not be available in retain and hence should be careful while applying these operations. The code will throw a `ValueError` in case user specified a column requested to be renamed via rename argument in retain argument as well.
166166

167-
### How can users specify data handlers.
167+
### Data Handlers
168168

169-
Data handlers, as explained above, are routines which process the dataset using [HF map framework](https://huggingface.co/docs/datasets/en/process#map).
170-
All data handler routines are registered with our data preprocessor as a `k:func` object where
171-
`k` is the name (`str`) of the data handler and `func` (`callable`) is the function which is called.
169+
Data handlers, as explained above, are routines which process the dataset using [HF process frameworks](https://huggingface.co/docs/datasets/en/process) including map, filter, remove, select, and rename.
172170

173-
In the data config, users can request which data handler to apply by requesting the corresponding `name`
174-
with which the data handler was registered and specifying the appropriate `arguments`. Each data handler accepts two types of arguments via `DataHandlerArguments` (as defined in the above [schema](#what-is-data-config-schema)), as shown below.
175-
176-
```yaml
177-
DataHandler:
178-
type: object
179-
additionalProperties: false
180-
properties:
181-
name:
182-
type: string
183-
arguments:
184-
$ref: '#/definitions/DataHandlerArguments'
185-
required:
186-
- arguments
187-
- name
188-
title: DataHandler
189-
DataHandlerArguments:
190-
type: object
191-
additionalProperties: false
192-
properties:
193-
remove_columns:
194-
type: string
195-
batched:
196-
type: boolean
197-
fn_kwargs:
198-
$ref: '#/definitions/DataHandlerFnKwargs'
199-
required:
200-
- fn_kwargs
201-
- remove_columns
202-
title: DataHandlerArguments
203-
DataHandlerFnKwargs:
204-
type: object
205-
properties:
206-
str:
207-
type: str
208-
title: DataHandlerFnKwargs
209-
```
210-
211-
Arguments to the data handlers are of two types,
212-
213-
Each data handler is a routine passed to the underlying [HF Map API]((https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map)) so the `kwargs` supported by the underlying API can be passed via the `arguments` section of the data handler config.
214-
215-
For example, users can pass `remove_columns` to remove any columns from the dataset when executing the particular handler or they can use `batched` to ensure [batched processing](https://huggingface.co/docs/datasets/en/about_map_batch) of the data handler.
216-
217-
Users can also pass any number of `kwargs` arguments required for each data handling `routine` function as [`fn_kwargs`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.fn_kwargs) inside the arguments.
218-
219-
#### Preexisting data handlers
220-
This library currently supports the following [preexisting data handlers](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py#L156):
221-
- `add_tokenizer_eos_token`:
222-
Appends the tokenizer's EOS token to a specified dataset field.
223-
- `apply_custom_data_formatting_template`:
224-
Applies a custom template (e.g., Alpaca style) to format dataset elements.
225-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_template.yaml)
226-
- `tokenize_and_apply_input_masking`:
227-
Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets.
228-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml)
229-
- `apply_custom_jinja_template`:
230-
Applies a custom jinja template (e.g., Alpaca style) to format dataset elements.
231-
By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, [see](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/apply_custom_jinja_template.yaml)
232-
- `apply_tokenizer_chat_template`:
233-
Uses a tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates.
234-
- `duplicate_columns`:
235-
Duplicates one column of the dataset to another column.
236-
- `tokenize`:
237-
Tokenizes one column of the dataset passed as input `dataset_text_field`.
238-
239-
These handlers could be requested by their same name and users can lookup the function args from [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py)
240-
241-
#### Extra data handlers
242-
Users are also allowed to pass custom data handlers using [`sft_trainer.py::train()`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L71) API call via the [`additional_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L89) argument.
243-
244-
The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api
171+
For a thorough explanation of data handlers, how to use them, see the [data handlers document](./advanced-data-handlers.md)
245172

246173
### Data Mixing
247174
Dataset mixing allows users to mix multiple datasets often with different `sampling ratios` to ensure the model is trained on a mix of some datasets in specific proportion.

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ dependencies = [
3737
"trl>=0.13,<0.18",
3838
"peft>=0.8.0,<0.14",
3939
"protobuf>=5.28.0,<6.0.0",
40-
"datasets>=2.15.0,<4.0",
40+
"datasets>=3.5.0,<4.0",
4141
"simpleeval>=0.9.13,<2.0",
4242
"pillow>=11.0.0,<12.0",
4343
]

0 commit comments

Comments
 (0)