Skip to content

Commit 81713ab

Browse files
authored
Merge pull request #19 from AI-Hypercomputer/a4x-tony
Add llama3.1-8b on 1 node A4X as a recipe example.
2 parents 1e4f05f + d18a613 commit 81713ab

File tree

11 files changed

+791
-0
lines changed

11 files changed

+791
-0
lines changed

README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,11 @@ Models | GPU Machine Type
4747
**Mixtral-8-7B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | NeMo | Pre-training | GKE | [Link](./training/a4/mixtral-8x7b/nemo-pretraining-gke/README.md)
4848
**PaliGemma2** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | Hugging Face Accelerate | Finetuning | GKE | [Link](./training/a4/paligemma2/README.md)
4949

50+
### Training benchmarks A4X
51+
52+
Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe
53+
------------------ | ---------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
54+
**Llama-3.1-8B** | [A4X (NVIDIA GB200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4x-vms) | NeMo | Pre-training | GKE | [Link](./training/a4x/llama3-1-8b/nemo-pretraining-gke/1node/README.md)
5055

5156
### Inference benchmarks A3 Mega
5257

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4x_jobset_workload
17+
description: a4x_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"
Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,155 @@
1+
# Pretrain Llama-3.1-8B workloads on a4x GKE Node pools with Nvidia NeMo Framework
2+
3+
This recipe outlines the steps for running a Llama-3.1-8B pretraining workload
4+
on [a4x GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
5+
[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo).
6+
7+
## Orchestration and deployment tools
8+
9+
For this recipe, the following setup is used:
10+
11+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
12+
- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy
13+
the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
14+
resource which manages the execution of the
15+
[NeMo pretraining workload](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py).
16+
17+
## Test environment
18+
19+
This recipe has been optimized for and tested with the following configuration:
20+
21+
- GKE cluster
22+
Please follow Cluster Toolkit [instructions] (https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) to create your A4x GKE cluster.
23+
24+
## Training dataset
25+
26+
This recipe uses a mock pretraining dataset provided by the NeMo framework
27+
28+
## Docker container image
29+
30+
This recipe uses the following docker image:
31+
- `nvcr.io/nvidia/nemo:25.07`
32+
- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.6`
33+
34+
This image is based on NVIDIA NeMo 25.07 and contains the NCCL gIB plugin
35+
v1.0.6, bundling all NCCL binaries validated for use with a4x GPUs.
36+
37+
38+
## Run the recipe
39+
40+
From your client workstation, complete the following steps:
41+
42+
### Configure environment settings
43+
44+
Set the environment variables to match your environment:
45+
46+
```bash
47+
export PROJECT_ID=<PROJECT_ID>
48+
export CLUSTER_REGION=<CLUSTER_REGION>
49+
export CLUSTER_NAME=<CLUSTER_NAME>
50+
export GCS_BUCKET=<GCS_BUCKET> #You don't need to add gs://
51+
export KUEUE_NAME=<KUEUE_NAME>
52+
```
53+
54+
Replace the following values:
55+
56+
- `<PROJECT_ID>`: your Google Cloud project ID.
57+
- `<CLUSTER_REGION>`: the region where your cluster is located.
58+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
59+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
60+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Make sure to verify the name of the local queue in your cluster.
61+
62+
Set the default project:
63+
64+
```bash
65+
gcloud config set project $PROJECT_ID
66+
```
67+
68+
69+
### Get the recipe
70+
71+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
72+
73+
```
74+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
75+
cd gpu-recipes
76+
export REPO_ROOT=`git rev-parse --show-toplevel`
77+
export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3-1-8b/nemo-pretraining-gke/1node
78+
cd $RECIPE_ROOT
79+
```
80+
81+
### Get cluster credentials
82+
83+
```
84+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
85+
```
86+
87+
### Configure and submit a pretraining job
88+
89+
#### Using 1 node (4 GPUs) BF16 precision
90+
91+
The default job setting is 20 training steps and bf16 precision. To execute the
92+
job with the default settings, run the following command from your client:
93+
94+
```bash
95+
cd $RECIPE_ROOT
96+
export WORKLOAD_NAME=$USER-a4x-llama31-8b-bf16-1node
97+
helm install $WORKLOAD_NAME . -f values.yaml \
98+
--set-file workload_launcher=launcher.sh \
99+
--set-file workload_config=llama31-8b.py \
100+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
101+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
102+
--set volumes.gcsMounts[0].mountPath=/job-logs \
103+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
104+
--set queue=${KUEUE_NAME}
105+
```
106+
107+
**Examples**
108+
109+
- To set the number of training steps to 100, run the following command from
110+
your client:
111+
112+
```bash
113+
cd $RECIPE_ROOT
114+
export WORKLOAD_NAME=$USER-a4x-llama31-8b-bf16-1node33
115+
helm install $WORKLOAD_NAME . -f values.yaml \
116+
--set-file workload_launcher=launcher.sh \
117+
--set-file workload_config=llama31-8b.py \
118+
--set workload.image=nvcr.io/nvidia/nemo:25.07 \
119+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
120+
--set volumes.gcsMounts[0].mountPath=/job-logs \
121+
--set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \
122+
--set queue=${KUEUE_NAME} \
123+
--set workload.arguments[0]="trainer.max_steps=100"
124+
```
125+
126+
### Monitor the job
127+
128+
To check the status of pods in your job, run the following command:
129+
130+
```
131+
kubectl get pods | grep $USER-a4x-llama31-8b-bf16-1node
132+
```
133+
134+
Replace the following:
135+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-llama31-8b-bf16-1node.
136+
137+
To get the logs for one of the pods, run the following command:
138+
139+
```
140+
kubectl logs POD_NAME
141+
```
142+
143+
Information about the training job's progress, including crucial details such as loss,
144+
step count, and step time, is generated by the rank 0 process.
145+
This process runs on the pod whose name begins with `JOB_NAME_PREFIX-workload-0-0`.
146+
For example: `user-llama-3-1-8b-nemo-fp8-workload-0-0-s9zrv`.
147+
148+
### Uninstall the Helm release
149+
150+
You can delete the job and other resources created by the Helm chart. To
151+
uninstall Helm, run the following command from your client:
152+
153+
```bash
154+
helm uninstall $USER-a4x-llama31-8b-bf16-1node
155+
```
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
#!/bin/bash
2+
# Copyright 2025 Google LLC
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
17+
ldconfig $LD_LIBRARY_PATH
18+
echo "Added $LD_LIBRARY_PATH to ldconfig:"
19+
ldconfig -p | grep libcuda | sed 's/^/ /'
20+
echo ""
21+
if [[ -n "${EXPLICIT_LOG_DIR}" ]]; then
22+
explicit_log_dir=${EXPLICIT_LOG_DIR}
23+
else
24+
explicit_log_dir=workload_logs
25+
fi
26+
echo "Logging to ${explicit_log_dir}"
27+
if [[ -n "${TOKENIZER_PATH}" ]]; then
28+
echo "Getting tokenizer files"
29+
cp ${TOKENIZER_PATH}/* .
30+
echo ""
31+
fi
32+
echo "Launching Torch distributed on the node rank $JOB_COMPLETION_INDEX out of $NNODES nodes"
33+
34+
35+
# Update nemo run so we can export the config.
36+
pip install git+https://github.com/NVIDIA/NeMo-Run.git@6550ff68204e5095452098eed3765ed765de5d33
37+
38+
# Export the nemo2 config to yaml.
39+
python ${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
40+
trainer.num_nodes="$NNODES" \
41+
log.explicit_log_dir="${explicit_log_dir}" \
42+
trainer.max_steps=10 trainer.num_nodes=1 trainer.devices=4 \
43+
--to-yaml exported_nemo_config.yaml
44+
45+
# Create the nsys directory.
46+
mkdir -p ${explicit_log_dir}/nsys
47+
48+
OMP_NUM_THREADS=12 NSYS_CONFIG_DIRECTIVES="AgentLaunchTimeoutSec=240;AppLaunchTimeoutSec=240" TORCH_NCCL_ENABLE_MONITORING=0 \
49+
/usr/local/bin/nsys profile -s none -t nvtx,cuda --capture-range=cudaProfilerApi --capture-range-end=stop \
50+
-o ${explicit_log_dir}/nsys/noderank-${JOB_COMPLETION_INDEX} \
51+
--session-new "nemo-rank${JOB_COMPLETION_INDEX}"-$RANDOM \
52+
--wait all \
53+
torchrun \
54+
--nproc-per-node="${GPUS_PER_NODE}" \
55+
--nnodes="${NNODES}" \
56+
--node_rank="${JOB_COMPLETION_INDEX}" \
57+
--rdzv_id="${JOB_IDENTIFIER}" \
58+
--master_addr="${MASTER_ADDR}" \
59+
--master_port="${MASTER_PORT}" \
60+
${NEMO_LAUNCH_SCRIPT} --factory "recipe()" \
61+
trainer.num_nodes="$NNODES" \
62+
log.explicit_log_dir="${explicit_log_dir}" \
63+
trainer.max_steps=10 trainer.num_nodes=1 trainer.devices=4
64+
65+
if [[ "$JOB_COMPLETION_INDEX" == "0" ]]; then
66+
mkdir -p ${ARTIFACT_DIR}
67+
cp -r ${explicit_log_dir}/* ${ARTIFACT_DIR}/
68+
cp ${NEMO_LAUNCH_SCRIPT} ${ARTIFACT_DIR}/run-cli.py
69+
cp exported_nemo_config.yaml ${ARTIFACT_DIR}/nemo-configuration.yaml
70+
env > ${ARTIFACT_DIR}/environ.txt
71+
ls ${ARTIFACT_DIR}
72+
fi
73+
echo "Training completed"
74+
echo "Pod on $(hostname --fqdn) is exiting"
75+
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
"""Nemo2 pretraining recipe for Llama 3.1 8B model."""
15+
16+
from nemo.collections import llm
17+
from nemo.collections.llm.recipes import llama31_8b
18+
from nemo.lightning.pytorch.callbacks import NsysCallback
19+
from nemo.lightning.pytorch.callbacks.flops_callback import FLOPsMeasurementCallback
20+
import nemo_run as run
21+
22+
23+
def recipe(
24+
profile_enabled: bool = False,
25+
profile_start_step: int = 0,
26+
profile_end_step: int = 0,
27+
profile_ranks: str = "0",
28+
) -> run.Partial:
29+
"""Returns a Nemo2 training recipe for Llama 3.1 8B model.
30+
31+
Args:
32+
profile_enabled: Whether to enable Nsys profiling.
33+
profile_start_step: The step to start profiling.
34+
profile_end_step: The step to end profiling.
35+
profile_ranks: The ranks to profile, comma separated.
36+
37+
Returns:
38+
A Nemo2 training recipe.
39+
"""
40+
# Start from the Nemo standard recipe.
41+
pretrain = llama31_8b.pretrain_recipe(performance_mode=True)
42+
43+
# Set the number of steps to 20 for a quicker benchmark.
44+
pretrain.trainer.max_steps = 20
45+
46+
# Disable validation batches.
47+
pretrain.trainer.limit_val_batches = 0.0
48+
pretrain.trainer.val_check_interval = 100
49+
50+
# Add the Nsys profiling callback if enabled.
51+
if profile_enabled:
52+
pretrain.trainer.callbacks.append(
53+
run.Config(
54+
NsysCallback,
55+
start_step=profile_start_step,
56+
end_step=profile_end_step,
57+
ranks=[int(x) for x in profile_ranks.split(",")],
58+
gen_shape=False,
59+
)
60+
)
61+
62+
# Add the FLOPs measurement callback.
63+
pretrain.trainer.callbacks.append(
64+
run.Config(
65+
FLOPsMeasurementCallback,
66+
model_name="llama31-8b",
67+
model_config=pretrain.model.config,
68+
data_config=pretrain.data,
69+
)
70+
)
71+
72+
# Disable checkpointing.
73+
pretrain.log.ckpt = None
74+
pretrain.trainer.enable_checkpointing = False
75+
76+
# Log every step.
77+
pretrain.trainer.log_every_n_steps = 1
78+
79+
return pretrain
80+
81+
82+
if __name__ == "__main__":
83+
run.cli.main(llm.pretrain, default_factory=recipe)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
helm install a4x-llama31-8b-1node . -f values.yaml --set-file workload_launcher=launcher.sh --set-file workload_config=llama31-8b.py --set workload.image=nvcr.io/nvidia/nemo:25.07 --set volumes.gcsMounts[0].bucketName=ubench-logs --set volumes.gcsMounts[0].mountPath=/job-logs --set workload.envs[0].value=/job-logs/a4x-llama31-8b-1node --set queue=a4x
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# yamllint disable
2+
# Copyright 2025 Google LLC
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
apiVersion: v1
17+
kind: ConfigMap
18+
metadata:
19+
name: "{{ .Release.Name }}-config"
20+
data:
21+
workload-configuration: |-
22+
{{- if .Values.workload_config }}
23+
{{ .Values.workload_config | nindent 4 }}
24+
{{- else }}
25+
{{ "config: null" | nindent 4 }}
26+
{{- end }}

0 commit comments

Comments
 (0)