|
| 1 | +# Finetune PaliGemma2 workloads on A4 GKE Node pools with Hugging Face Accelerate |
| 2 | + |
| 3 | +This recipe outlines the steps for running a PaliGemma2 finetune workload on |
| 4 | +[A4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the |
| 5 | +[Hugging Face Accelerate](https://huggingface.co/docs/accelerate/en/index). |
| 6 | + |
| 7 | +## Orchestration and deployment tools |
| 8 | + |
| 9 | +For this recipe, the following setup is used: |
| 10 | + |
| 11 | +- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine) |
| 12 | +- Finetuning job configuration and deployment - A Helm chart is used to configure and deploy |
| 13 | + the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset) |
| 14 | + resource. |
| 15 | + |
| 16 | +## Test environment |
| 17 | + |
| 18 | +This recipe has been optimized for and tested with the following configuration: |
| 19 | + |
| 20 | +- GKE cluster |
| 21 | + - [A regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: 1.32.4-gke.1236000 or later. |
| 22 | + - A GPU node pool with 1, 2 or 4 |
| 23 | + [a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) provisioned using the DENSE deployment type. |
| 24 | + - [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled. |
| 25 | + - [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled. |
| 26 | + - [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled. |
| 27 | + - [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed. |
| 28 | + - Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/). |
| 29 | +- A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs. |
| 30 | + |
| 31 | +To prepare the required environment, see |
| 32 | +[GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md). |
| 33 | + |
| 34 | +## Training dataset |
| 35 | + |
| 36 | +This recipe uses the [merve/vqav2-small](https://huggingface.co/datasets/merve/vqav2-small) dataset. |
| 37 | + |
| 38 | +## Docker container image |
| 39 | + |
| 40 | +This recipe uses the following [Deep Learning Software Layer](https://cloud.google.com/ai-hypercomputer/docs/software-stack#cluster_images) container image: |
| 41 | + |
| 42 | +`nvcr.io/nvidia/pytorch:25.01-py3`. |
| 43 | + |
| 44 | +This image is based on NVIDIA NeMo 25.02 and contains the NCCL gIB plugin v1.1.0, bundling all NCCL binaries validated for use with A4 GPUs. |
| 45 | + |
| 46 | + |
| 47 | +## Run the recipe |
| 48 | + |
| 49 | +From your client workstation, complete the following steps: |
| 50 | + |
| 51 | +### Configure environment settings |
| 52 | + |
| 53 | +Set the environment variables to match your environment: |
| 54 | + |
| 55 | + ```bash |
| 56 | + export PROJECT_ID=<PROJECT_ID> |
| 57 | + export CLUSTER_REGION=<CLUSTER_REGION> |
| 58 | + export CLUSTER_NAME=<CLUSTER_NAME> |
| 59 | + export GCS_BUCKET=<GCS_BUCKET> |
| 60 | + export KUEUE_NAME=<KUEUE_NAME> |
| 61 | + export HF_TOKEN=<HF_TOKEN> |
| 62 | + ``` |
| 63 | + |
| 64 | +Replace the following values: |
| 65 | + |
| 66 | + - `<PROJECT_ID>`: your Google Cloud project ID. |
| 67 | + - `<CLUSTER_REGION>`: the region where your cluster is located. |
| 68 | + - `<CLUSTER_NAME>`: the name of your GKE cluster. |
| 69 | + - `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix. |
| 70 | + - `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster. |
| 71 | + - `<HF_TOKEN>`: your Hugging Face token. You can create one [here](https://huggingface.co/settings/tokens). |
| 72 | + |
| 73 | +Set the default project: |
| 74 | + |
| 75 | + ```bash |
| 76 | + gcloud config set project $PROJECT_ID |
| 77 | + ``` |
| 78 | + |
| 79 | +### Get the recipe |
| 80 | + |
| 81 | +Clone the `gpu-recipes` repository and set a reference to the recipe folder. |
| 82 | + |
| 83 | +``` |
| 84 | +git clone https://github.com/ai-hypercomputer/gpu-recipes.git |
| 85 | +cd gpu-recipes |
| 86 | +export REPO_ROOT=`git rev-parse --show-toplevel` |
| 87 | +export RECIPE_ROOT=$REPO_ROOT/training/a4/paligemma2 |
| 88 | +cd $RECIPE_ROOT |
| 89 | +``` |
| 90 | + |
| 91 | +### Get cluster credentials |
| 92 | + |
| 93 | +``` |
| 94 | +gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION |
| 95 | +``` |
| 96 | + |
| 97 | +### Configure and submit a pretraining job |
| 98 | +_Update `<HF_TOKEN>` (your Hugging Face [token](https://huggingface.co/settings/tokens)) in [launcher.sh](./launcher.sh)._ |
| 99 | + |
| 100 | +#### Using 4 nodes (32 GPUs) BF16 |
| 101 | + |
| 102 | +The default job setting is 50 training steps and fp8 precision. To execute the job with the |
| 103 | +default settings, run the following command from your client: |
| 104 | + |
| 105 | +```bash |
| 106 | +helm install $USER-paligemma2 ${RECIPE_ROOT} -f ${RECIPE_ROOT}/values.yaml \ |
| 107 | + --set-file workload_launcher=${RECIPE_ROOT}/launcher.sh \ |
| 108 | + --set-file workload_config=${RECIPE_ROOT}/main.py \ |
| 109 | + --set workload.image=nvcr.io/nvidia/pytorch:25.01-py3 \ |
| 110 | + --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \ |
| 111 | + --set volumes.gcsMounts[0].mountPath=/job-logs \ |
| 112 | + --set workload.envs[0].value=/job-logs/${user}-paligemma2 |
| 113 | +``` |
| 114 | + |
| 115 | + |
| 116 | +#### Configure job settings |
| 117 | + |
| 118 | +You can overwrite any of the default |
| 119 | +[training configuration envs](./main.py) |
| 120 | +for this job. To do this, we can set the new env values in [launcher.sh](./launcher.sh) |
| 121 | + |
| 122 | +**Examples** |
| 123 | + |
| 124 | +- To set the number of PER_DEVICE_TRAIN_BATCH_SIZE to 64, update following in [launcher.sh](./launcher.sh). |
| 125 | + |
| 126 | + ```bash |
| 127 | + export PER_DEVICE_TRAIN_BATCH_SIZE=64 |
| 128 | + ``` |
| 129 | +Run the previous helm command from client. |
| 130 | + |
| 131 | +### Monitor the job |
| 132 | + |
| 133 | +To check the status of pods in your job, run the following command: |
| 134 | + |
| 135 | +``` |
| 136 | +kubectl get pods | grep JOB_NAME_PREFIX |
| 137 | +``` |
| 138 | + |
| 139 | +Replace the following: |
| 140 | +- JOB_NAME_PREFIX - your job name prefix. For example $USER-paligemma2. |
| 141 | + |
| 142 | +To get the logs for one of the pods, run the following command: |
| 143 | + |
| 144 | +``` |
| 145 | +kubectl logs POD_NAME |
| 146 | +``` |
| 147 | + |
| 148 | +Information about the training job's progress, including crucial details such as loss, |
| 149 | +step count, and step time, is generated by the rank 0 process. |
| 150 | +This process runs on the pod whose name begins with `JOB_NAME_PREFIX-workload-0-0`. |
| 151 | +For example: `user-paligemma2-0-0-s9zrv`. |
| 152 | + |
| 153 | + |
| 154 | + |
| 155 | +### Troubleshooting |
| 156 | + |
| 157 | +This section provides guidance on troubleshooting issues with the training job. |
| 158 | + |
| 159 | +To check the status of the job's pods, use the following command: |
| 160 | + |
| 161 | +```bash |
| 162 | +kubectl get pods | grep JOB_NAME_PREFIX |
| 163 | +``` |
| 164 | + |
| 165 | +Replace `JOB_NAME_PREFIX` with the prefix of your job name. For example, `$USER-paligemma2`. This command will list all pods associated with the specified job, along with their current status. |
| 166 | + |
| 167 | + |
| 168 | +To get the logs from a specific pod, use the following command: |
| 169 | + |
| 170 | +```bash |
| 171 | +kubectl logs POD_NAME |
| 172 | +``` |
| 173 | + |
| 174 | +Replace `POD_NAME` with the name of the pod you want to inspect. |
| 175 | + |
| 176 | +In this recipe, the training job is orchestrated by the [Kubernetes JobSet](https://jobset.sigs.k8s.io/docs/overview/). If the JobSet encounters a fatal failure, it removes all pods, making it impossible to inspect their logs directly. To analyze logs from a failed job, retrieve them from Cloud Logging using the following filter: |
| 177 | + |
| 178 | +``` |
| 179 | +resource.type="k8s_container" |
| 180 | +resource.labels.project_id="PROJECT_ID" |
| 181 | +resource.labels.location="CLUSTER_REGION" |
| 182 | +resource.labels.cluster_name="CLUSTER_NAME" |
| 183 | +resource.labels.namespace_name="default" |
| 184 | +resource.labels.pod_name=~"^JOB_NAME_PREFIX.*" |
| 185 | +severity>=DEFAULT |
| 186 | +``` |
| 187 | + |
| 188 | +Replace the following: |
| 189 | +- `PROJECT_ID`: your Google Cloud project ID. |
| 190 | +- `CLUSTER_REGION`: the region where your cluster is located. |
| 191 | +- `CLUSTER_NAME`: the name of your GKE cluster. |
| 192 | +- `JOB_NAME_PREFIX`: the prefix of your job name (e.g., `$USER-paligemma2`). |
| 193 | + |
| 194 | +This filter will retrieve logs from all containers within pods that match the job with the specified name prefix. |
| 195 | + |
| 196 | + |
| 197 | +### Uninstall the Helm release |
| 198 | + |
| 199 | +You can delete the job and other resources created by the Helm chart. |
| 200 | +To uninstall Helm, run the following command from your client: |
| 201 | + |
| 202 | +```bash |
| 203 | +helm uninstall $USER-paligmma2 |
| 204 | +``` |
| 205 | + |
0 commit comments