AI-Hypercomputer
diff --git a/‎docs/guides/data_input_grain.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/guides/data_input_grain.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/guides/data_input_pipeline.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/guides/data_input_pipeline.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎requirements_with_jax_ai_image.txt‎
Lines changed: 3 additions & 3 deletions b/‎requirements_with_jax_ai_image.txt‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎requirements_with_jax_stable_stack_0_6_1_pipreqs.txt‎
Lines changed: 1 addition & 1 deletion b/‎requirements_with_jax_stable_stack_0_6_1_pipreqs.txt‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/MaxText/assets/tokenizer‎
2.37 KB b/‎src/MaxText/assets/tokenizer‎
2.37 KB
diff --git a/‎src/MaxText/configs/base.yml‎
Lines changed: 2 additions & 0 deletions b/‎src/MaxText/configs/base.yml‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎src/MaxText/data_loader.py‎
Lines changed: 3 additions & 4 deletions b/‎src/MaxText/data_loader.py‎
Lines changed: 3 additions & 4 deletions
diff --git a/‎src/MaxText/input_pipeline/_grain_data_processing.py‎
Lines changed: 11 additions & 6 deletions b/‎src/MaxText/input_pipeline/_grain_data_processing.py‎
Lines changed: 11 additions & 6 deletions
diff --git a/‎src/MaxText/input_pipeline/_hf_data_processing.py‎
Lines changed: 5 additions & 9 deletions b/‎src/MaxText/input_pipeline/_hf_data_processing.py‎
Lines changed: 5 additions & 9 deletions
diff --git a/‎src/MaxText/input_pipeline/_input_pipeline_utils.py‎
Lines changed: 5 additions & 27 deletions b/‎src/MaxText/input_pipeline/_input_pipeline_utils.py‎
Lines changed: 5 additions & 27 deletions
@@ -24,7 +24,7 @@ Grain ensures determinism in data input pipelines by saving the pipeline's state
 * **Debug training anomalies**: When troubleshooting training spikes or anomalies, the ability to replay the exact data sequence helps distinguish between bad data batches and underlying hardware or software issues.
 
 ## Data shuffling
-* **Global shuffle**: This feature is only available when using Grain with [ArrayRecord](https://github.com/google/array_record) (random access) format, achieved by shuffling indices globally at the beginning of each epoch and then reading the elements according to the random order. This is usually fast enough, even when using hard drives and distributed file systems.
+* **Global shuffle**: This feature is only available when using Grain with [ArrayRecord](https://github.com/google/array_record) (random access) format, achieved by shuffling indices globally at the beginning of each epoch and then reading the elements according to the random order. This shuffle method effectively prevents local overfitting, leading to better training results.
 * **Hierarchical shuffle**: For sequential access format [Parquet](https://arrow.apache.org/docs/python/parquet.html), shuffle is performed by these steps: file shuffling, interleave from files, and window shuffle using a fixed size buffer.
 
 ## Using Grain
 
@@ -37,11 +37,11 @@ The approaches to solve these challenges depend on whether your dataset supports
 Random-access formats are highly recommended for multi-host training because they allow any part of the file to be read directly by its index.<br>
 In MaxText, this is best supported by the ArrayRecord format using the Grain input pipeline. This approach gracefully handles the key challenges:
 * **Concurrent access and uniqueness**: Grain assigns a unique set of indices to each host. ArrayRecord allows different hosts to read from different indices in the same file.
-* **Uneven completion**: Data indices are distributed evenly among hosts. Without packing, the data imbalance between hosts will be at most one batch. To handle the final steps where some hosts run out of data, you can enable the `generate_padding_example` flag. This directs hosts to generate empty "padding" batches until the training or evaluation steps are met. **Note**: When sequence packing is enabled, the difference in the number of packed examples per host can be larger. The `generate_padding_example` flag still solves this. However, as more hosts begin generating padding, you will observe a decrease in total_weights and a slower change in the training loss. If all hosts exhaust their data before the target step count is reached, both total_weights and loss will drop to 0.
+* **Uneven completion**: Data indices are distributed evenly among hosts. Without packing, the data imbalance between hosts will be at most one batch. To handle the final steps where some hosts run out of data, you can enable the `generate_padding_batch_train`/`generate_padding_batch_eval` flag. This directs hosts to generate empty "padding" batches until the training or evaluation steps are met. **Note**: When sequence packing is enabled, the difference in the number of packed examples per host can be larger. The `generate_padding_batch_train`/`generate_padding_batch_eval` flag still solves this. However, as more hosts begin generating padding, you will observe a decrease in total_weights and a slower change in the training loss. If all hosts exhaust their data before the target step count is reached, both total_weights and loss will drop to 0.
 
 ### Sequential access dataset
 * **Concurrent access and uniqueness**: Sequential-access datasets (e.g., Parquet, JSON, TFRecord) cannot be accessed by index, requiring a different strategy -- file-based sharding, where each host is given exclusive access to a specific subset of data files. **Key requirement**: `(Number of data files) % (Number of data-loading hosts) == 0`.  If the file count isn't a multiple of the host count, the files will be distributed unevenly. For example, with 10 files and 8 hosts, some hosts will get two files while others get one, significantly worsening the "uneven completion" problem. If you have fewer files than hosts, performance will be severely degraded as all hosts are concurrently accessing all the files.
-* **Uneven completion**: Similar to random-access datasets, you can use the `generate_padding_example` flag to handle hosts that finish their file shards early (currently only supported in Hugging Face pipeline, not available in TFDS pipeline). 
+* **Uneven completion**: Similar to random-access datasets, you can use the `generate_padding_batch_train`/`generate_padding_batch_eval` flag to handle hosts that finish their file shards early. 
 
 ```{toctree}
 :hidden:
 
@@ -4,7 +4,7 @@ datasets
 flax>=0.11.0
 google-api-python-client
 google-jetstream@git+https://github.com/AI-Hypercomputer/JetStream.git
-grain[parquet]>=0.2.6
+grain[parquet]>=0.2.12
 jaxtyping
 jsonlines
 mlperf-logging@git+https://github.com/mlperf/logging.git
@@ -13,12 +13,12 @@ orbax-checkpoint>=0.11.22
 pathwaysutils>=0.1.1
 pillow>=11.1.0
 pre-commit
-protobuf==3.20.3
+protobuf>=5.29.5
 pyink
 pylint
 pytest
 pytype
-sentencepiece==0.1.97
+sentencepiece>=0.2.0
 tensorflow-datasets
 tensorflow-text>=2.17.0
 tiktoken
 
@@ -8,7 +8,7 @@ datasets==3.6.0
 etils==1.12.2
 evaluate==0.4.4
 flax==0.11.0
-grain==0.2.10
+grain==0.2.12
 grpcio==1.72.0rc1
 huggingface_hub==0.33.0
 jax==0.6.0
 
@@ -470,6 +470,8 @@ eval_data_columns: ['text'] # for DPO dataset containing "chosen" and "rejected"
 eval_image_column: 'image'
 packing: True
 num_epoch: 1  # only grain and tfds pipeline supports num_epoch > 1
+generate_padding_batch_train: False
+generate_padding_batch_eval: False
 
 # direct preference optimization (DPO)
 use_dpo: False
 
@@ -26,7 +26,6 @@
     maybe_record_goodput,
 )
 
-
 class DataLoader:
   """
   Loads preprocessed data for training.
@@ -51,10 +50,10 @@ def load_next_batch(self):
         self.last_batch = jax.lax.with_sharding_constraint(example_batch, self.input_data_shardings)
         self.check_example_batch()
       except Exception as e:  # pylint: disable=broad-except
-        if "StopIteration" in str(e):
-          raise exceptions.StopTraining("You may have run out of training data.")
+        if isinstance(e, StopIteration):
+          raise exceptions.StopTraining(f"You may have run out of training data. Received {type(e)} exception: ({e})")
         else:
-          raise exceptions.StopTraining(f"`load_next_batch()` failed ({e}).")
+          raise exceptions.StopTraining(f"`load_next_batch()` failed with {type(e)} exception: ({e}).")
     return self.last_batch
 
   def check_example_batch(self):
 
@@ -120,11 +120,11 @@ def pretrain_preprocessing_pipeline(dataset, config, data_columns, tokenize, gra
             data_columns, config.max_target_length, config.add_bos, config.add_eos, tokenizer_model
         )
     )
-
   # Pack and Batch examples.
+  batch_size = config.global_batch_size_to_load // jax.process_count()
   if config.packing:
     length_struct = {col: config.max_target_length for col in data_columns}
-    dataset = grain.experimental.FirstFitPackIterDataset(dataset, length_struct=length_struct, num_packing_bins=30)
+    dataset = grain.experimental.FirstFitPackIterDataset(dataset, length_struct=length_struct, num_packing_bins=batch_size)
     rekey_dict = {
         "targets_segmentation": "targets_segment_ids",
         "inputs_segmentation": "inputs_segment_ids",
@@ -134,7 +134,8 @@ def pretrain_preprocessing_pipeline(dataset, config, data_columns, tokenize, gra
     dataset = dataset.map(_input_pipeline_utils.Rekey(rekey_dict))
   else:
     dataset = dataset.map(_input_pipeline_utils.PadToMaxLength(config.max_target_length, pad_id))
-  dataset = dataset.batch(batch_size=config.global_batch_size_to_load // jax.process_count(), drop_remainder=False)
+  batch_fn = functools.partial(grain.experimental.batch_and_pad, batch_size=batch_size, pad_value=pad_id)
+  dataset = dataset.batch(batch_size, batch_fn=batch_fn)
 
   # Shift inputs for teacher-forced training
   dataset = dataset.map(
@@ -175,7 +176,9 @@ def dpo_preprocessing_pipeline(dataset, config, data_columns, tokenize, grain_wo
     )
 
   dataset = dataset.map(_input_pipeline_utils.PadToMaxLength(config.max_target_length, pad_id))
-  dataset = dataset.batch(batch_size=config.global_batch_size_to_load // jax.process_count(), drop_remainder=False)
+  batch_size = config.global_batch_size_to_load // jax.process_count()
+  batch_fn = functools.partial(grain.experimental.batch_and_pad, batch_size=batch_size, pad_value=pad_id)
+  dataset = dataset.batch(batch_size, batch_fn=batch_fn)
   dataset = dataset.mp_prefetch(grain.MultiprocessingOptions(num_workers=grain_worker_count))
   return dataset
 
@@ -216,7 +219,9 @@ def make_grain_train_iterator(
           tokenize=config.tokenize_train_data,
           grain_worker_count=config.grain_worker_count,
       )
-    return multihost_dataloading.MultiHostDataLoadIterator(train_dataloader, global_mesh)
+    return multihost_dataloading.MultiHostDataLoadIterator(
+        train_dataloader, global_mesh, config.generate_padding_batch_train
+    )
   else:
     get_ds_fn = functools.partial(
         get_datasets,
@@ -283,7 +288,7 @@ def make_grain_eval_iterator(
           tokenize=config.tokenize_eval_data,
           grain_worker_count=config.grain_worker_count_eval,
       )
-    return multihost_dataloading.MultiHostDataLoadIterator(eval_dataloader, global_mesh)
+    return multihost_dataloading.MultiHostDataLoadIterator(eval_dataloader, global_mesh, config.generate_padding_batch_eval)
   else:
     get_ds_fn = functools.partial(
         get_datasets,
 
@@ -102,7 +102,6 @@ def vision_sft_preprocessing_pipeline(
       dataloading_host_index=dataloading_host_index,
       dataloading_host_count=dataloading_host_count,
       num_threads=1,
-      generate_padding_example=True,
       max_target_length=config.max_target_length,
       data_column_names=text_columns,
   )
@@ -162,8 +161,8 @@ def preprocessing_pipeline(
     packing=True,
     shift=True,
     num_threads=1,
-    drop_remainder=False,
-    generate_padding_example=False,
+    drop_remainder=True,
+    generate_padding_batch=False,
     use_dpo=None,
     use_sft=None,
     sft_train_on_completion_only=True,
@@ -239,7 +238,6 @@ def preprocessing_pipeline(
       dataloading_host_index,
       dataloading_host_count,
       num_threads,
-      generate_padding_example,
       max_target_length,
       data_column_names,
   )
@@ -304,7 +302,7 @@ def lists2array(x):
       read_options=grain.ReadOptions(num_threads=num_threads, prefetch_buffer_size=128),
   )
 
-  multihost_gen = multihost_dataloading.MultiHostDataLoadIterator(dataloader, global_mesh)
+  multihost_gen = multihost_dataloading.MultiHostDataLoadIterator(dataloader, global_mesh, generate_padding_batch)
 
   # Return multi-host jax.Array prep iterator
   return multihost_gen
@@ -352,7 +350,7 @@ def make_hf_train_iterator(
         add_bos=config.add_bos,
         add_eos=config.add_eos,
         packing=config.packing,
-        generate_padding_example=False,
+        generate_padding_batch=config.generate_padding_batch_train,
         use_dpo=config.use_dpo,
         use_sft=config.use_sft,
         sft_train_on_completion_only=config.sft_train_on_completion_only,
@@ -374,8 +372,6 @@ def make_hf_eval_iterator(
       streaming=True,
       token=config.hf_access_token,
   )
-
-  eval_generate_padding_example = config.eval_steps > 0
   if config.use_sft and config.use_multimodal:
     eval_iter = vision_sft_preprocessing_pipeline(
         dataset=eval_ds,
@@ -404,7 +400,7 @@ def make_hf_eval_iterator(
         add_bos=config.add_bos,
         add_eos=config.add_eos,
         packing=config.packing,
-        generate_padding_example=eval_generate_padding_example,
+        generate_padding_batch=config.generate_padding_batch_eval,
         use_dpo=config.use_dpo,
         use_sft=config.use_sft,
         sft_train_on_completion_only=config.sft_train_on_completion_only,
 
@@ -157,9 +157,7 @@ def apply_chat_template(example, tokenizer_model, data_column_name):
     for message in example[data_column_name]:
       if message["role"] == "user":
         prompt = message
-        prompt_in_chat_template = tokenizer_model.apply_chat_template(
-            [prompt], add_generation_prompt=False, tokenize=False
-        )
+        prompt_in_chat_template = tokenizer_model.apply_chat_template([prompt], add_generation_prompt=False, tokenize=False)
         messages.append(prompt_in_chat_template)
         is_prompt.append(True)
       elif message["role"] == "assistant":
@@ -266,15 +264,13 @@ def __init__(
       dataloading_host_index: int,
       dataloading_host_count: int,
       num_threads: int,
-      generate_padding_example: bool,
       max_target_length: int,
       data_column_names: list[str],
   ):
     self.dataset = dataset
     self.num_threads = num_threads
     self.dataloading_host_count = dataloading_host_count
     self.dataloading_host_index = dataloading_host_index
-    self.generate_padding_example = generate_padding_example
     self.max_target_lenth = max_target_length
     self.data_column_names = data_column_names
     if hasattr(dataset, "n_shards"):
@@ -285,7 +281,6 @@ def __init__(
     self.dataset_shards = [dataloading_host_index * self.num_threads + i for i in range(self.num_threads)]
     self.datasets = [split_dataset_by_node(dataset, world_size=self.n_shards, rank=x) for x in self.dataset_shards]
     self.data_iters = []
-    self.out_of_data = False
 
   def _check_shard_count(self):
     if self.n_shards < (self.dataloading_host_count * self.num_threads):
@@ -300,20 +295,13 @@ def _update_shard(self, idx):
     """update shard"""
     new_shard = self.dataset_shards[idx] + self.dataloading_host_count * self.num_threads
     if new_shard < self.n_shards:
-      max_logging.log(
-          f"Updating host {self.dataloading_host_index} dataset {idx}, was on shard {self.dataset_shards[idx]}"
-      )
+      max_logging.log(f"Updating host {self.dataloading_host_index} dataset {idx}, was on shard {self.dataset_shards[idx]}")
       max_logging.log(f"New shard is {new_shard}")
       self.dataset_shards[idx] = new_shard
       self.datasets[idx] = split_dataset_by_node(self.dataset, world_size=self.n_shards, rank=self.dataset_shards[idx])
       self.data_iters[idx] = iter(self.datasets[idx])
     else:
-      max_logging.log(f"Run out of shards on host {self.dataloading_host_index}, shard {new_shard} is not available")
-      self.out_of_data = True
-      if self.generate_padding_example:
-        max_logging.log(
-            f"Host {self.dataloading_host_index} will start generating all-0 padding examples until step number is met."
-        )
+      raise StopIteration(f"Run out of shards on host {self.dataloading_host_index}, shard {new_shard} is not available")
 
   def __len__(self):
     """Return length of the HF dataset. Since HuggingFace IterableDataset does not have length,
@@ -329,20 +317,10 @@ def __getitem__(self, index):
 
     while True:
       try:
-        if self.out_of_data:
-          if self.generate_padding_example:
-            return {
-                column_name: np.zeros(self.max_target_lenth, dtype=np.int32) for column_name in self.data_column_names
-            }
-          else:
-            raise StopIteration("Running out of data")
         data = next(self.data_iters[idx])
         return data
-      except StopIteration as e:
-        if not self.out_of_data:
-          self._update_shard(idx)
-        else:
-          raise e
+      except StopIteration:
+        self._update_shard(idx)
 
 
 ########## Functions used by Grain pipeline