AI-Hypercomputer
diff --git a/‎README.md‎
Lines changed: 5 additions & 0 deletions b/‎README.md‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/Chart.yaml‎
Lines changed: 20 additions & 0 deletions b/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/Chart.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/README.md‎
Lines changed: 155 additions & 0 deletions b/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/README.md‎
Lines changed: 155 additions & 0 deletions
diff --git a/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/launcher.sh‎
Lines changed: 75 additions & 0 deletions b/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/launcher.sh‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/llama31-8b.py‎
Lines changed: 83 additions & 0 deletions b/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/llama31-8b.py‎
Lines changed: 83 additions & 0 deletions
diff --git a/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/recipe_launch_command.sh‎
Lines changed: 1 addition & 0 deletions b/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/recipe_launch_command.sh‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/templates/workload-config-configmap.yaml‎
Lines changed: 26 additions & 0 deletions b/‎training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/templates/workload-config-configmap.yaml‎
Lines changed: 26 additions & 0 deletions
@@ -47,6 +47,11 @@ Models             | GPU Machine Type
 **Mixtral-8-7B**   | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | NeMo      | Pre-training  | GKE          | [Link](./training/a4/mixtral-8x7b/nemo-pretraining-gke/README.md)
 **PaliGemma2**     | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | Hugging Face Accelerate | Finetuning | GKE     | [Link](./training/a4/paligemma2/README.md)
 
+### Training benchmarks A4X
+
+Models             | GPU Machine Type                                                                                     | Framework | Workload Type | Orchestrator | Link to the recipe
+------------------ | ---------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
+**Llama-3.1-8B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms)      | NeMo   | Pre-training  | GKE          | [Link](./training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/README.md)
 
 ### Inference benchmarks A3 Mega
 
 
@@ -0,0 +1,20 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: a4x_jobset_workload
+description: a4x_jobset_workload
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
@@ -0,0 +1,155 @@
+# Pretrain Llama-3.1-8B workloads on a4x GKE Node pools with Nvidia NeMo Framework
+
+This recipe outlines the steps for running a Llama-3.1-8B pretraining workload
+on [a4x GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
+[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
+
+## Orchestration and deployment tools
+
+For this recipe, the following setup is used:
+
+- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy
+  the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
+  resource which manages the execution  of the
+  [NeMo pretraining workload](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py).
+
+## Test environment
+
+This recipe has been optimized for and tested with the following configuration:
+
+- GKE cluster
+Please follow Cluster Toolkit [instructions] (https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) to create your A4x GKE cluster.
+
+## Training dataset
+
+This recipe uses a mock pretraining dataset provided by the NeMo framework
+
+## Docker container image
+
+This recipe uses the following docker image:
+- `nvcr.io/nvidia/nemo:25.07`
+- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.6`
+
+This image is based on NVIDIA NeMo 25.07 and contains the NCCL gIB plugin
+v1.0.6, bundling all NCCL binaries validated for use with a4x GPUs.
+
+
+## Run the recipe
+
+From your client workstation, complete the following steps:
+
+### Configure environment settings
+
+Set the environment variables to match your environment:
+
+ ```bash
+ export PROJECT_ID=<PROJECT_ID>
+ export CLUSTER_REGION=<CLUSTER_REGION>
+ export CLUSTER_NAME=<CLUSTER_NAME>
+ export GCS_BUCKET=<GCS_BUCKET> #You don't need to add gs://
+ export KUEUE_NAME=<KUEUE_NAME>
+ ```
+
+Replace the following values:
+
+ - `<PROJECT_ID>`: your Google Cloud project ID.
+ - `<CLUSTER_REGION>`: the region where your cluster is located.
+ - `<CLUSTER_NAME>`: the name of your GKE cluster.
+ - `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
+ - `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Make sure to verify the name of the local queue in your cluster.
+
+Set the default project:
+
+ ```bash
+ gcloud config set project $PROJECT_ID
+ ```
+
+
+### Get the recipe
+
+Clone the `gpu-recipes` repository and set a reference to the recipe folder.
+
+```
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=`git rev-parse --show-toplevel`
+export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3-1-8b/nemo-pretraining-gke/1node
+cd $RECIPE_ROOT
+```
+
+### Get cluster credentials
+
+```
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+### Configure and submit a pretraining job
+
+#### Using 1 node (4 GPUs) BF16 precision
+
+The default job setting is 20 training steps and bf16 precision. To execute the
+job with the default settings, run the following command from your client:
+
+    ```bash 
+    cd $RECIPE_ROOT
+    export WORKLOAD_NAME=$USER-a4x-llama31-8b-bf16-1node
+    helm install $WORKLOAD_NAME . -f values.yaml \
+    --set-file workload_launcher=launcher.sh \
+    --set-file workload_config=llama31-8b.py \
+    --set workload.image=nvcr.io/nvidia/nemo:25.07 \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    --set volumes.gcsMounts[0].mountPath=/job-logs \
+    --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+    --set queue=${KUEUE_NAME}
+    ```
+
+**Examples**
+
+-   To set the number of training steps to 100, run the following command from
+    your client:
+
+    ```bash 
+    cd $RECIPE_ROOT
+    export WORKLOAD_NAME=$USER-a4x-llama31-8b-bf16-1node33
+    helm install $WORKLOAD_NAME . -f values.yaml \
+    --set-file workload_launcher=launcher.sh \
+    --set-file workload_config=llama31-8b.py \
+    --set workload.image=nvcr.io/nvidia/nemo:25.07 \
+    --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+    --set volumes.gcsMounts[0].mountPath=/job-logs \
+    --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
+    --set queue=${KUEUE_NAME} \
+    --set workload.arguments[0]="trainer.max_steps=100"
+    ```
+
+### Monitor the job
+
+To check the status of pods in your job, run the following command:
+
+```
+kubectl get pods | grep $USER-a4x-llama31-8b-bf16-1node
+```
+
+Replace the following:
+- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-llama31-8b-bf16-1node.
+
+To get the logs for one of the pods, run the following command:
+
+```
+kubectl logs POD_NAME
+```
+
+Information about the training job's progress, including crucial details such as loss,
+step count, and step time, is generated by the rank 0 process.
+This process runs on the pod whose name begins with `JOB_NAME_PREFIX-workload-0-0`.
+For example: `user-llama-3-1-8b-nemo-fp8-workload-0-0-s9zrv`.
+
+### Uninstall the Helm release
+
+You can delete the job and other resources created by the Helm chart. To
+uninstall Helm, run the following command from your client:
+
+```bash
+helm uninstall $USER-a4x-llama31-8b-bf16-1node
+```
@@ -0,0 +1,75 @@
+#!/bin/bash
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
+ldconfig $LD_LIBRARY_PATH
+echo "Added $LD_LIBRARY_PATH to ldconfig:"
+ldconfig -p | grep libcuda | sed 's/^/  /'
+echo ""
+if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
+  explicit_log_dir=${EXPLICIT_LOG_DIR}
+else
+  explicit_log_dir=workload_logs
+fi
+echo "Logging to ${explicit_log_dir}"
+if [[ -n "${TOKENIZER_PATH}" ]]; then
+  echo "Getting tokenizer files"
+  cp ${TOKENIZER_PATH}/* .
+  echo ""
+fi
+echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
+
+
+# Update nemo run so we can export the config.
+pip install git+https://github.com/NVIDIA/NeMo-Run.git@6550ff68204e5095452098eed3765ed765de5d33
+
+# Export the nemo2 config to yaml.
+python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
+trainer.num_nodes="$NNODES" \
+log.explicit_log_dir="${explicit_log_dir}" \
+trainer.max_steps=10 trainer.num_nodes=1 trainer.devices=4 \
+--to-yaml exported_nemo_config.yaml
+
+# Create the nsys directory.
+mkdir -p ${explicit_log_dir}/nsys
+
+OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \
+/usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \
+-o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \
+--session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \
+--wait all \
+torchrun \
+--nproc-per-node="${GPUS_PER_NODE}" \
+--nnodes="${NNODES}" \
+--node_rank="${JOB_COMPLETION_INDEX}" \
+--rdzv_id="${JOB_IDENTIFIER}" \
+--master_addr="${MASTER_ADDR}" \
+--master_port="${MASTER_PORT}" \
+${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
+trainer.num_nodes="$NNODES" \
+log.explicit_log_dir="${explicit_log_dir}" \
+trainer.max_steps=10 trainer.num_nodes=1 trainer.devices=4
+
+if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
+  mkdir -p ${ARTIFACT_DIR}
+  cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
+  cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py
+  cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml
+  env > ${ARTIFACT_DIR}/environ.txt
+  ls ${ARTIFACT_DIR}
+fi
+echo "Training completed"
+echo "Pod on $(hostname --fqdn) is exiting"
+    
@@ -0,0 +1,83 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Nemo2 pretraining recipe for Llama 3.1 8B model."""
+
+from nemo.collections import llm
+from nemo.collections.llm.recipes import llama31_8b
+from nemo.lightning.pytorch.callbacks import NsysCallback
+from nemo.lightning.pytorch.callbacks.flops_callback import FLOPsMeasurementCallback
+import nemo_run as run
+
+
+def recipe(
+    profile_enabled: bool = False,
+    profile_start_step: int = 0,
+    profile_end_step: int = 0,
+    profile_ranks: str = "0",
+) -> run.Partial:
+  """Returns a Nemo2 training recipe for Llama 3.1 8B model.
+
+  Args:
+      profile_enabled: Whether to enable Nsys profiling.
+      profile_start_step: The step to start profiling.
+      profile_end_step: The step to end profiling.
+      profile_ranks: The ranks to profile, comma separated.
+
+  Returns:
+      A Nemo2 training recipe.
+  """
+  # Start from the Nemo standard recipe.
+  pretrain = llama31_8b.pretrain_recipe(performance_mode=True)
+
+  # Set the number of steps to 20 for a quicker benchmark.
+  pretrain.trainer.max_steps = 20
+
+  # Disable validation batches.
+  pretrain.trainer.limit_val_batches = 0.0
+  pretrain.trainer.val_check_interval = 100
+
+  # Add the Nsys profiling callback if enabled.
+  if profile_enabled:
+    pretrain.trainer.callbacks.append(
+        run.Config(
+            NsysCallback,
+            start_step=profile_start_step,
+            end_step=profile_end_step,
+            ranks=[int(x) for x in profile_ranks.split(",")],
+            gen_shape=False,
+        )
+    )
+
+  # Add the FLOPs measurement callback.
+  pretrain.trainer.callbacks.append(
+      run.Config(
+          FLOPsMeasurementCallback,
+          model_name="llama31-8b",
+          model_config=pretrain.model.config,
+          data_config=pretrain.data,
+      )
+  )
+
+  # Disable checkpointing.
+  pretrain.log.ckpt = None
+  pretrain.trainer.enable_checkpointing = False
+
+  # Log every step.
+  pretrain.trainer.log_every_n_steps = 1
+
+  return pretrain
+
+
+if __name__ == "__main__":
+  run.cli.main(llm.pretrain, default_factory=recipe)
@@ -0,0 +1 @@
+helm install a4x-llama31-8b-1node . -f values.yaml --set-file workload_launcher=launcher.sh --set-file workload_config=llama31-8b.py --set workload.image=nvcr.io/nvidia/nemo:25.07 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/a4x-llama31-8b-1node --set queue=a4x
@@ -0,0 +1,26 @@
+# yamllint disable
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: "{{ .Release.Name }}-config"
+data:
+  workload-configuration: |-
+{{- if .Values.workload_config }}
+{{ .Values.workload_config | nindent 4 }}
+{{- else }}
+{{ "config: null" | nindent 4 }}
+{{- end }}
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+helm install a4x-llama31-8b-1node . -f values.yaml --set-file workload_launcher=launcher.sh --set-file workload_config=llama31-8b.py --set workload.image=nvcr.io/nvidia/nemo:25.07 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/a4x-llama31-8b-1node --set queue=a4x`