|
| 1 | +# Pretrain Llama-3.1-8B workloads on a4x GKE Node pools with Nvidia NeMo Framework |
| 2 | + |
| 3 | +This recipe outlines the steps for running a Llama-3.1-8B pretraining workload |
| 4 | +on [a4x GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the |
| 5 | +[NVIDIA NeMo framework](https://github.com/NVIDIA/nemo). |
| 6 | + |
| 7 | +## Orchestration and deployment tools |
| 8 | + |
| 9 | +For this recipe, the following setup is used: |
| 10 | + |
| 11 | +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) |
| 12 | +- Pretraining job configuration and deployment - A Helm chart is used to configure and deploy |
| 13 | + the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) |
| 14 | + resource which manages the execution of the |
| 15 | + [NeMo pretraining workload](https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/megatron_gpt_pretraining.py). |
| 16 | + |
| 17 | +## Test environment |
| 18 | + |
| 19 | +This recipe has been optimized for and tested with the following configuration: |
| 20 | + |
| 21 | +- GKE cluster |
| 22 | +Please follow Cluster Toolkit [instructions] (https://github.com/GoogleCloudPlatform/cluster-toolkit/tree/main/examples/gke-a4x) to create your A4x GKE cluster. |
| 23 | + |
| 24 | +## Training dataset |
| 25 | + |
| 26 | +This recipe uses a mock pretraining dataset provided by the NeMo framework |
| 27 | + |
| 28 | +## Docker container image |
| 29 | + |
| 30 | +This recipe uses the following docker image: |
| 31 | +- `nvcr.io/nvidia/nemo:25.07` |
| 32 | +- `us-docker.pkg.dev/gce-ai-infra/gpudirect-gib/nccl-plugin-gib-arm64:v1.0.6` |
| 33 | + |
| 34 | +This image is based on NVIDIA NeMo 25.07 and contains the NCCL gIB plugin |
| 35 | +v1.0.6, bundling all NCCL binaries validated for use with a4x GPUs. |
| 36 | + |
| 37 | + |
| 38 | +## Run the recipe |
| 39 | + |
| 40 | +From your client workstation, complete the following steps: |
| 41 | + |
| 42 | +### Configure environment settings |
| 43 | + |
| 44 | +Set the environment variables to match your environment: |
| 45 | + |
| 46 | + ```bash |
| 47 | + export PROJECT_ID=<PROJECT_ID> |
| 48 | + export CLUSTER_REGION=<CLUSTER_REGION> |
| 49 | + export CLUSTER_NAME=<CLUSTER_NAME> |
| 50 | + export GCS_BUCKET=<GCS_BUCKET> #You don't need to add gs:// |
| 51 | + export KUEUE_NAME=<KUEUE_NAME> |
| 52 | + ``` |
| 53 | + |
| 54 | +Replace the following values: |
| 55 | + |
| 56 | + - `<PROJECT_ID>`: your Google Cloud project ID. |
| 57 | + - `<CLUSTER_REGION>`: the region where your cluster is located. |
| 58 | + - `<CLUSTER_NAME>`: the name of your GKE cluster. |
| 59 | + - `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. |
| 60 | + - `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4x`. Make sure to verify the name of the local queue in your cluster. |
| 61 | + |
| 62 | +Set the default project: |
| 63 | + |
| 64 | + ```bash |
| 65 | + gcloud config set project $PROJECT_ID |
| 66 | + ``` |
| 67 | + |
| 68 | + |
| 69 | +### Get the recipe |
| 70 | + |
| 71 | +Clone the `gpu-recipes` repository and set a reference to the recipe folder. |
| 72 | + |
| 73 | +``` |
| 74 | +git clone https://github.com/ai-hypercomputer/gpu-recipes.git |
| 75 | +cd gpu-recipes |
| 76 | +export REPO_ROOT=`git rev-parse --show-toplevel` |
| 77 | +export RECIPE_ROOT=$REPO_ROOT/training/a4x/llama3-1-8b/nemo-pretraining-gke/1node |
| 78 | +cd $RECIPE_ROOT |
| 79 | +``` |
| 80 | + |
| 81 | +### Get cluster credentials |
| 82 | + |
| 83 | +``` |
| 84 | +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION |
| 85 | +``` |
| 86 | + |
| 87 | +### Configure and submit a pretraining job |
| 88 | + |
| 89 | +#### Using 1 node (4 GPUs) BF16 precision |
| 90 | + |
| 91 | +The default job setting is 20 training steps and bf16 precision. To execute the |
| 92 | +job with the default settings, run the following command from your client: |
| 93 | + |
| 94 | + ```bash |
| 95 | + cd $RECIPE_ROOT |
| 96 | + export WORKLOAD_NAME=$USER-a4x-llama31-8b-bf16-1node |
| 97 | + helm install $WORKLOAD_NAME . -f values.yaml \ |
| 98 | + --set-file workload_launcher=launcher.sh \ |
| 99 | + --set-file workload_config=llama31-8b.py \ |
| 100 | + --set workload.image=nvcr.io/nvidia/nemo:25.07 \ |
| 101 | + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ |
| 102 | + --set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 103 | + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ |
| 104 | + --set queue=${KUEUE_NAME} |
| 105 | + ``` |
| 106 | + |
| 107 | +**Examples** |
| 108 | + |
| 109 | +- To set the number of training steps to 100, run the following command from |
| 110 | + your client: |
| 111 | + |
| 112 | + ```bash |
| 113 | + cd $RECIPE_ROOT |
| 114 | + export WORKLOAD_NAME=$USER-a4x-llama31-8b-bf16-1node33 |
| 115 | + helm install $WORKLOAD_NAME . -f values.yaml \ |
| 116 | + --set-file workload_launcher=launcher.sh \ |
| 117 | + --set-file workload_config=llama31-8b.py \ |
| 118 | + --set workload.image=nvcr.io/nvidia/nemo:25.07 \ |
| 119 | + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ |
| 120 | + --set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 121 | + --set workload.envs[0].value=/job-logs/$WORKLOAD_NAME \ |
| 122 | + --set queue=${KUEUE_NAME} \ |
| 123 | + --set workload.arguments[0]="trainer.max_steps=100" |
| 124 | + ``` |
| 125 | + |
| 126 | +### Monitor the job |
| 127 | + |
| 128 | +To check the status of pods in your job, run the following command: |
| 129 | + |
| 130 | +``` |
| 131 | +kubectl get pods | grep $USER-a4x-llama31-8b-bf16-1node |
| 132 | +``` |
| 133 | +
|
| 134 | +Replace the following: |
| 135 | +- JOB_NAME_PREFIX - your job name prefix. For example $USER-a4x-llama31-8b-bf16-1node. |
| 136 | +
|
| 137 | +To get the logs for one of the pods, run the following command: |
| 138 | +
|
| 139 | +``` |
| 140 | +kubectl logs POD_NAME |
| 141 | +``` |
| 142 | +
|
| 143 | +Information about the training job's progress, including crucial details such as loss, |
| 144 | +step count, and step time, is generated by the rank 0 process. |
| 145 | +This process runs on the pod whose name begins with `JOB_NAME_PREFIX-workload-0-0`. |
| 146 | +For example: `user-llama-3-1-8b-nemo-fp8-workload-0-0-s9zrv`. |
| 147 | +
|
| 148 | +### Uninstall the Helm release |
| 149 | +
|
| 150 | +You can delete the job and other resources created by the Helm chart. To |
| 151 | +uninstall Helm, run the following command from your client: |
| 152 | +
|
| 153 | +```bash |
| 154 | +helm uninstall $USER-a4x-llama31-8b-bf16-1node |
| 155 | +``` |
0 commit comments