|
| 1 | +# Data Handlers |
| 2 | +Please note, this document is intended for advanced users who want to customize data handler arguments and use data handler functions to perform |
| 3 | +complex operations on the data configs. |
| 4 | + |
| 5 | +Data handlers, are routines which process a dataset using [HF process frameworks](https://huggingface.co/docs/datasets/en/process) including map, filter, remove, select, and rename. |
| 6 | +All data handler routines are registered with our data preprocessor as a `k:func` object where |
| 7 | +`k` is the name (`str`) of the data handler and `func` (`callable`) is the function which is called. |
| 8 | + |
| 9 | +In the data config, users can request which data handler to apply by requesting the corresponding `name` |
| 10 | +with which the data handler was registered and specifying the appropriate `arguments`. Each data handler accepts two types of arguments via `DataHandlerArguments` (as defined in the data preprocessor [schema](./advanced-data-preprocessing.md#what-is-data-config-schema)), as shown below. |
| 11 | + |
| 12 | +Arguments to the data handlers are of two types, |
| 13 | + |
| 14 | +Each data handler is a routine passed to an underlying HF API so the `kwargs` supported by the underlying API can be passed via the `arguments` section of the data handler config. In our pre-existing handlers the supported underlying API is either: |
| 15 | + - [Map](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map) |
| 16 | + - [Filter](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.filter) |
| 17 | + - [Rename](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns) |
| 18 | + - [Select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select) |
| 19 | + - [Remove](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) |
| 20 | + |
| 21 | +The operations performed for pre-existing handlers can be found in the [Preexisting data handlers](#preexisting-data-handlers) section |
| 22 | + |
| 23 | +For example, users can pass `batched` through `arguments` to ensure [batched processing](https://huggingface.co/docs/datasets/en/about_map_batch) of the data handler. |
| 24 | + |
| 25 | +Users can also pass any number of `kwargs` arguments required for each data handling `routine` function as [`fn_kwargs`](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.map.fn_kwargs) inside the arguments. |
| 26 | + |
| 27 | +A typical YAML snippet where you'd specify arguments to the handlers |
| 28 | +``` |
| 29 | +datapreprocessor: |
| 30 | + ... |
| 31 | + datasets: |
| 32 | + - name: my_dataset |
| 33 | + data_paths: |
| 34 | + - /path/to/my_dataset |
| 35 | + data_handlers: |
| 36 | + - name: tokenize |
| 37 | + arguments: |
| 38 | + # Additional kwargs passed directly to the underlying HF API call |
| 39 | + batched: false |
| 40 | + num_proc: 10 |
| 41 | + |
| 42 | + fn_kwargs: |
| 43 | + # Any arguments specific to the tokenize handler itself |
| 44 | + truncation: true |
| 45 | + max_length: 1280 |
| 46 | +``` |
| 47 | + |
| 48 | +For example, `num_proc` and `batched` in the snippet above are passed straight to |
| 49 | +`datasets.Dataset.map(...) ` while, the `truncation` and `max_length` arguments |
| 50 | +in the snippet above directly control how the handler performs tokenization. |
| 51 | + |
| 52 | +For native handlers like `REMOVE` `RENAME` `SELECT` (see below) you don't need to pass `fn_kwargs` and args need to be provided in `arguments`. |
| 53 | + |
| 54 | +### Default Arguments |
| 55 | +Each data handler supports many arguments and some of them are automatically provided to the data handler via the data processor framework. |
| 56 | +The data processor framework makes these arguments available to the data handlers via `kwargs`. |
| 57 | + |
| 58 | +1. `tokenizer`: The `AutoTokenizer` representation of the `tokenizer_name_or_path` or from `model_name_or_path` arg passed to the library. |
| 59 | +2. `column_names`: The names of the columns of the current dataset being processed. |
| 60 | + |
| 61 | +**Also one special handling data preprocessor provides is to pass in `remove_columns` as `all` which will internally be translated to all column names to the `Map` of `Filter` data handler routines.** |
| 62 | + |
| 63 | +## Preexisting data handlers |
| 64 | +This library currently supports the following preexisting data handlers. These handlers could be requested by their same name and users can lookup the function args from [data handlers source code](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tuning/data/data_handlers.py): |
| 65 | + |
| 66 | +### `tokenize_and_apply_input_masking`: |
| 67 | +Tokenizes input text and applies masking to the labels for causal language modeling tasks, good for input/output datasets. |
| 68 | +By default this handler adds `EOS_TOKEN` which can be disabled by a handler argument, see [this](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/tests/artifacts/predefined_data_configs/tokenize_and_apply_input_masking.yaml) or the `add_eos_token` argument below. |
| 69 | + |
| 70 | +Users don't need to pass any extra `response` or `instruction` templates here. |
| 71 | + |
| 72 | +**Type: MAP** |
| 73 | + |
| 74 | +**arguments** |
| 75 | + - Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) |
| 76 | + |
| 77 | +**fn_args:** |
| 78 | + - `element`: the HF Dataset element. |
| 79 | + - `input_column_name`: Name of the input (instruction) field in dataset |
| 80 | + - `output_column_name`: Name of the output field in dataset |
| 81 | + - `add_eos_token`: should add tokenizer.eos_token to text or not, defaults to True |
| 82 | + |
| 83 | +**Returns:** |
| 84 | +- tokenized Dataset element with input_ids, labels and attention_mask columns where labels contain masking of the `input` section of the dataset. |
| 85 | + |
| 86 | +### `apply_custom_jinja_template`: |
| 87 | +Applies a custom jinja template (e.g., Alpaca style) to format dataset elements. |
| 88 | +Returns dataset which contains column `formatted_text_column_name` containing the string formatted using provided template. |
| 89 | + |
| 90 | +Users need to pass in appropriate `response_template` if they specify this handler as the final handler to ensure the |
| 91 | +`DataCollatorForCompletionOnlyLM` used underneath to apply proper masking ensure the model learns only on responses. |
| 92 | + |
| 93 | +**Type: MAP** |
| 94 | + |
| 95 | +**arguments** |
| 96 | + - Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) |
| 97 | + |
| 98 | +**fn_args:** |
| 99 | + - `element`: the HF Dataset element |
| 100 | + - `formatted_text_column_name`: formatted_dataset_field. |
| 101 | + - `template`: Jinja template to format data with. Features of Dataset should be referred to by their key. |
| 102 | + |
| 103 | +**Returns:** |
| 104 | +- Formatted HF Dataset element by formatting dataset with provided jinja template, saving the result to `formatted_text_column_name` argument. |
| 105 | + |
| 106 | +### `apply_tokenizer_chat_template`: |
| 107 | +Uses tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates. |
| 108 | +Returns dataset which contains column `formatted_text_column_name` containing the chat template formatted string. |
| 109 | + |
| 110 | +Since this handler does not tokenize the dataset users need to provide appropriate `resonse_template` and `instruction_template` for the |
| 111 | +`DataCollatorForCompletionOnlyLM` used underneath to apply proper masking ensure the model learns only on assistant responses. |
| 112 | + |
| 113 | +**Type: MAP** |
| 114 | + |
| 115 | +**arguments** |
| 116 | + - Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) |
| 117 | + |
| 118 | +**fn_args:** |
| 119 | + - `element`: the HF Dataset element. |
| 120 | + - `formatted_text_column_name`: the field in which to store the rendered text. |
| 121 | + - `conversation_column`: column name where the chat template expects the conversation |
| 122 | + |
| 123 | +**Returns:** |
| 124 | +- Formatted HF Dataset element by formatting dataset with tokenizer's chat template, saving the result to `formatted_text_column_name` argument. |
| 125 | + |
| 126 | +### `tokenize_and_apply_chat_template_with_masking`: |
| 127 | +Uses tokenizer's chat template to preprocess dataset elements, good for single/multi turn chat templates. |
| 128 | +Then tokenizes the dataset while masking all user and system conversations ensuring model learns only on assistant responses. |
| 129 | +Tokenizes the dataset so you don't need to pass any extra arguments for data collator. |
| 130 | + |
| 131 | +**Type: MAP** |
| 132 | + |
| 133 | +**arguments** |
| 134 | + - Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) |
| 135 | + *Note: - Always recommended to be used with `remove_columns:all` as argument as you don't want to retain text columns and tokenized columns alongside while training which can cause a potential crash. |
| 136 | + |
| 137 | +**fn_args:** |
| 138 | + - `element`: the HF Dataset element. |
| 139 | + - `formatted_text_column_name`: the field in which to store the rendered text. |
| 140 | + - `conversation_column`: column name where the chat template expects the conversation |
| 141 | + |
| 142 | +**Returns:** |
| 143 | +- Tokenized Dataset element containing `input_ids` `labels` and `attention_mask`. |
| 144 | + |
| 145 | +### `tokenize`: |
| 146 | +Tokenizes one column of the dataset passed as input `text_column_name`. |
| 147 | + |
| 148 | +**Type: MAP** |
| 149 | + |
| 150 | +**arguments** |
| 151 | + - Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) |
| 152 | + |
| 153 | +**fn_kwargs:** |
| 154 | + - `element`: the HF Dataset element. |
| 155 | + - `text_column_name`: The dataset field to tokenize. |
| 156 | + - `truncation`: Truncation strategy to use, refer the link (https://huggingface.co/docs/transformers/en/pad_truncation). |
| 157 | + - `max_length`: Max length to truncate the samples to. |
| 158 | + |
| 159 | +**Return:** |
| 160 | +- Tokenized dataset element field `text_column_name` containing `input_ids` and `labels` |
| 161 | + |
| 162 | +### `duplicate_columns`: |
| 163 | +Duplicate one columne of a dataset to another new column. |
| 164 | + |
| 165 | +**Type: MAP** |
| 166 | + |
| 167 | +**arguments** |
| 168 | + - Any argument supported by the [HF MAP API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) |
| 169 | + |
| 170 | +**fn_args:** |
| 171 | + - `element`: the HF Dataset element |
| 172 | + - `existing_column_name`: Name of the column to be duplicated |
| 173 | + - `new_column_name`: Name of the new column where dyplicated column is saved |
| 174 | + |
| 175 | +**Return:** |
| 176 | +- Formatted HF dataset element with `new_column_name` where `existing_column_name` content is copied. |
| 177 | + |
| 178 | +### `skip_samples_with_large_columns`: |
| 179 | +Skips elements which contains certain columns larger than the passed max length in the dataset. |
| 180 | + |
| 181 | +**Type: FILTER** |
| 182 | + |
| 183 | +**arguments** |
| 184 | + - Any arguments supported by [HF Filter API](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.filter) |
| 185 | + |
| 186 | +**fn_args**: |
| 187 | + - `element`: HF dataset element. |
| 188 | + - `column_name`: Name of column to be filtered. |
| 189 | + - `max_allowed_length`: Max allowed lenght of column in either characters or tokens. |
| 190 | + |
| 191 | +**Return:** |
| 192 | +- A filtered dataset which contains elements with length of column `column_name` shorter than max allowed length |
| 193 | + |
| 194 | +### `remove_columns`: |
| 195 | +Directly calls [remove_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.remove_columns) in HF API over the dataset. |
| 196 | + |
| 197 | +**Type: REMOVE** |
| 198 | + |
| 199 | +**arguments**: |
| 200 | + - `column_names`: Names of columns to be removed from dataset |
| 201 | + |
| 202 | +**fn_args**: |
| 203 | + - Nil. As this is a Native API. |
| 204 | + |
| 205 | +**Returns:** |
| 206 | +- Dataset with specified `column_names` removed |
| 207 | + |
| 208 | +### `select_columns`: |
| 209 | +Directly calls [select](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.select) in HF API |
| 210 | + |
| 211 | +**Type: SELECT** |
| 212 | + |
| 213 | +**arguments**: |
| 214 | + - `column_names`: Names of columns to be retained in the new dataset |
| 215 | + |
| 216 | +**fn_args**: |
| 217 | +- Nil. As this is a Native API. |
| 218 | + |
| 219 | +**Returns:** |
| 220 | +- Dataset where only columns specified in `column_names` are retained. |
| 221 | + |
| 222 | +### `rename_columns`: |
| 223 | +Directly calls [rename_columns](https://huggingface.co/docs/datasets/v3.2.0/en/package_reference/main_classes#datasets.Dataset.rename_columns) in HF API |
| 224 | + |
| 225 | +**Type: RENAME** |
| 226 | + |
| 227 | +**arguments**: |
| 228 | + - `column_mapping`: Column names passed as `str:str` from `old_name:new_name` |
| 229 | + |
| 230 | +**fn_args**: |
| 231 | + - Nil. As this is a Native API. |
| 232 | + |
| 233 | +**Returns:** |
| 234 | +- Dataset where columns are renamed using provided column mapping. |
| 235 | + |
| 236 | +## Additional arguments |
| 237 | +Please note that the choice of extra arguments needed with handler depends on how the dataset looks post processing which is a combination post |
| 238 | +application of the full DAG of the data handlers and should be used be referring to our other documentation [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/README.md) and [here](https://github.com/foundation-model-stack/fms-hf-tuning/blob/main/docs/advanced-data-preprocessing.md) and reference templates provided [here](https://github.com/foundation-model-stack/fms-hf-tuning/tree/main/tests/artifacts/predefined_data_configs) |
| 239 | + |
| 240 | + |
| 241 | +## Extra data handlers |
| 242 | +Users are also allowed to pass custom data handlers using [`sft_trainer.py::train()`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L71) API call via the [`additional_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/sft_trainer.py#L89) argument. |
| 243 | + |
| 244 | +The argument expects users to pass a map similar to the existing data handlers `k(str):func(callable)` which will be registered with the data preprocessor via its [`register_data_handlers`](https://github.com/foundation-model-stack/fms-hf-tuning/blob/d7f06f5fc898eb700a9e89f08793b2735d97889c/tuning/data/data_processors.py#L65) api |
0 commit comments