Skip to content

Commit 1e4f05f

Browse files
authored
Merge pull request #18 from AI-Hypercomputer/junjieqian/paligemma2
PaliGemma2 recipe on A4 VM
2 parents a487884 + 2d38a80 commit 1e4f05f

File tree

10 files changed

+824
-1
lines changed

10 files changed

+824
-1
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,15 @@ Models | GPU Machine Type
3838

3939
### Training benchmarks A4
4040

41-
Models | GPU Machine Type | Framework | Workload Type | Orchestrator | Link to the recipe
41+
Models | GPU Machine Type | Framework / Library | Workload Type | Orchestrator | Link to the recipe
4242
------------------ | ---------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
4343
**Llama-3.1-70B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | MaxText | Pre-training | GKE | [Link](./training/a4/llama3-1-70b/maxtext-pretraining-gke/README.md)
4444
**Llama-3.1-70B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | NeMo | Pre-training | GKE | [Link](./training/a4/llama3-1-70b/nemo-pretraining-gke/README.md)
4545
**Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | MaxText | Pre-training | GKE | [Link](./training/a4/llama3-1-405b/maxtext-pretraining-gke/README.md)
4646
**Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | NeMo | Pre-training | GKE | [Link](./training/a4/llama3-1-405b/nemo-pretraining-gke/README.md)
4747
**Mixtral-8-7B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | NeMo | Pre-training | GKE | [Link](./training/a4/mixtral-8x7b/nemo-pretraining-gke/README.md)
48+
**PaliGemma2** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms) | Hugging Face Accelerate | Finetuning | GKE | [Link](./training/a4/paligemma2/README.md)
49+
4850

4951
### Inference benchmarks A3 Mega
5052

training/a4/paligemma2/Chart.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
apiVersion: v2
16+
name: a4_jobset_workload
17+
description: a4_jobset_workload
18+
type: application
19+
version: 0.1.0
20+
appVersion: "1.16.0"

training/a4/paligemma2/README.md

Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# Finetune PaliGemma2 workloads on A4 GKE Node pools with Hugging Face Accelerate
2+
3+
This recipe outlines the steps for running a PaliGemma2 finetune workload on
4+
[A4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
5+
[Hugging Face Accelerate](https://huggingface.co/docs/accelerate/en/index).
6+
7+
## Orchestration and deployment tools
8+
9+
For this recipe, the following setup is used:
10+
11+
- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
12+
- Finetuning job configuration and deployment - A Helm chart is used to configure and deploy
13+
the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
14+
resource.
15+
16+
## Test environment
17+
18+
This recipe has been optimized for and tested with the following configuration:
19+
20+
- GKE cluster
21+
- [A regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: 1.32.4-gke.1236000 or later.
22+
- A GPU node pool with 1, 2 or 4
23+
[a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) provisioned using the DENSE deployment type.
24+
- [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
25+
- [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
26+
- [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
27+
- [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
28+
- Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
29+
- A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.
30+
31+
To prepare the required environment, see
32+
[GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md).
33+
34+
## Training dataset
35+
36+
This recipe uses the [merve/vqav2-small](https://huggingface.co/datasets/merve/vqav2-small) dataset.
37+
38+
## Docker container image
39+
40+
This recipe uses the following [Deep Learning Software Layer](https://cloud.google.com/ai-hypercomputer/docs/software-stack#cluster_images) container image:
41+
42+
`nvcr.io/nvidia/pytorch:25.01-py3`.
43+
44+
This image is based on NVIDIA NeMo 25.02 and contains the NCCL gIB plugin v1.1.0, bundling all NCCL binaries validated for use with A4 GPUs.
45+
46+
47+
## Run the recipe
48+
49+
From your client workstation, complete the following steps:
50+
51+
### Configure environment settings
52+
53+
Set the environment variables to match your environment:
54+
55+
```bash
56+
export PROJECT_ID=<PROJECT_ID>
57+
export CLUSTER_REGION=<CLUSTER_REGION>
58+
export CLUSTER_NAME=<CLUSTER_NAME>
59+
export GCS_BUCKET=<GCS_BUCKET>
60+
export KUEUE_NAME=<KUEUE_NAME>
61+
export HF_TOKEN=<HF_TOKEN>
62+
```
63+
64+
Replace the following values:
65+
66+
- `<PROJECT_ID>`: your Google Cloud project ID.
67+
- `<CLUSTER_REGION>`: the region where your cluster is located.
68+
- `<CLUSTER_NAME>`: the name of your GKE cluster.
69+
- `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
70+
- `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster.
71+
- `<HF_TOKEN>`: your Hugging Face token. You can create one [here](https://huggingface.co/settings/tokens).
72+
73+
Set the default project:
74+
75+
```bash
76+
gcloud config set project $PROJECT_ID
77+
```
78+
79+
### Get the recipe
80+
81+
Clone the `gpu-recipes` repository and set a reference to the recipe folder.
82+
83+
```
84+
git clone https://github.com/ai-hypercomputer/gpu-recipes.git
85+
cd gpu-recipes
86+
export REPO_ROOT=`git rev-parse --show-toplevel`
87+
export RECIPE_ROOT=$REPO_ROOT/training/a4/paligemma2
88+
cd $RECIPE_ROOT
89+
```
90+
91+
### Get cluster credentials
92+
93+
```
94+
gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
95+
```
96+
97+
### Configure and submit a pretraining job
98+
_Update `<HF_TOKEN>` (your Hugging Face [token](https://huggingface.co/settings/tokens)) in [launcher.sh](./launcher.sh)._
99+
100+
#### Using 4 nodes (32 GPUs) BF16
101+
102+
The default job setting is 50 training steps and fp8 precision. To execute the job with the
103+
default settings, run the following command from your client:
104+
105+
```bash
106+
helm install $USER-paligemma2 ${RECIPE_ROOT} -f ${RECIPE_ROOT}/values.yaml \
107+
--set-file workload_launcher=${RECIPE_ROOT}/launcher.sh \
108+
--set-file workload_config=${RECIPE_ROOT}/main.py \
109+
--set workload.image=nvcr.io/nvidia/pytorch:25.01-py3 \
110+
--set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
111+
--set volumes.gcsMounts[0].mountPath=/job-logs \
112+
--set workload.envs[0].value=/job-logs/${user}-paligemma2
113+
```
114+
115+
116+
#### Configure job settings
117+
118+
You can overwrite any of the default
119+
[training configuration envs](./main.py)
120+
for this job. To do this, we can set the new env values in [launcher.sh](./launcher.sh)
121+
122+
**Examples**
123+
124+
- To set the number of PER_DEVICE_TRAIN_BATCH_SIZE to 64, update following in [launcher.sh](./launcher.sh).
125+
126+
```bash
127+
export PER_DEVICE_TRAIN_BATCH_SIZE=64
128+
```
129+
Run the previous helm command from client.
130+
131+
### Monitor the job
132+
133+
To check the status of pods in your job, run the following command:
134+
135+
```
136+
kubectl get pods | grep JOB_NAME_PREFIX
137+
```
138+
139+
Replace the following:
140+
- JOB_NAME_PREFIX - your job name prefix. For example $USER-paligemma2.
141+
142+
To get the logs for one of the pods, run the following command:
143+
144+
```
145+
kubectl logs POD_NAME
146+
```
147+
148+
Information about the training job's progress, including crucial details such as loss,
149+
step count, and step time, is generated by the rank 0 process.
150+
This process runs on the pod whose name begins with `JOB_NAME_PREFIX-workload-0-0`.
151+
For example: `user-paligemma2-0-0-s9zrv`.
152+
153+
154+
155+
### Troubleshooting
156+
157+
This section provides guidance on troubleshooting issues with the training job.
158+
159+
To check the status of the job's pods, use the following command:
160+
161+
```bash
162+
kubectl get pods | grep JOB_NAME_PREFIX
163+
```
164+
165+
Replace `JOB_NAME_PREFIX` with the prefix of your job name. For example, `$USER-paligemma2`. This command will list all pods associated with the specified job, along with their current status.
166+
167+
168+
To get the logs from a specific pod, use the following command:
169+
170+
```bash
171+
kubectl logs POD_NAME
172+
```
173+
174+
Replace `POD_NAME` with the name of the pod you want to inspect.
175+
176+
In this recipe, the training job is orchestrated by the [Kubernetes JobSet](https://jobset.sigs.k8s.io/docs/overview/). If the JobSet encounters a fatal failure, it removes all pods, making it impossible to inspect their logs directly. To analyze logs from a failed job, retrieve them from Cloud Logging using the following filter:
177+
178+
```
179+
resource.type="k8s_container"
180+
resource.labels.project_id="PROJECT_ID"
181+
resource.labels.location="CLUSTER_REGION"
182+
resource.labels.cluster_name="CLUSTER_NAME"
183+
resource.labels.namespace_name="default"
184+
resource.labels.pod_name=~"^JOB_NAME_PREFIX.*"
185+
severity>=DEFAULT
186+
```
187+
188+
Replace the following:
189+
- `PROJECT_ID`: your Google Cloud project ID.
190+
- `CLUSTER_REGION`: the region where your cluster is located.
191+
- `CLUSTER_NAME`: the name of your GKE cluster.
192+
- `JOB_NAME_PREFIX`: the prefix of your job name (e.g., `$USER-paligemma2`).
193+
194+
This filter will retrieve logs from all containers within pods that match the job with the specified name prefix.
195+
196+
197+
### Uninstall the Helm release
198+
199+
You can delete the job and other resources created by the Helm chart.
200+
To uninstall Helm, run the following command from your client:
201+
202+
```bash
203+
helm uninstall $USER-paligmma2
204+
```
205+

training/a4/paligemma2/launcher.sh

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
#!/bin/bash
2+
# Copyright 2025 Google LLC
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
export HF_TOKEN=
17+
export PYTHONUNBUFFERED=1
18+
19+
pip3 install \
20+
transformers==4.46.3 \
21+
datasets \
22+
accelerate \
23+
peft \
24+
bitsandbytes \
25+
pillow \
26+
tensorboard
27+
28+
export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
29+
ldconfig "$LD_LIBRARY_PATH"
30+
echo "Added $LD_LIBRARY_PATH to ldconfig:"
31+
ldconfig -p | grep libcuda | sed 's/^/ /'
32+
echo ""
33+
34+
export NODE_RANK=$JOB_COMPLETION_INDEX
35+
export HYDRA_FULL_ERROR=1
36+
export NVIDIA_VISIBLE_DEVICES=0
37+
38+
echo "Launching Torch distributed as node rank $NODE_RANK out of $NNODES nodes"
39+
40+
mkdir /app
41+
cat > /app/train.sh << 'EOF'
42+
#!/bin/bash
43+
44+
python "$PYTHON_MAIN"
45+
EOF
46+
47+
export TOKENIZERS_PARALLELISM=false
48+
export NVTE_UB_SOCKET_IFNAME="eth1"
49+
50+
# Training parameters
51+
export NUM_TRAIN_EPOCHS=1
52+
export PER_DEVICE_TRAIN_BATCH_SIZE=8
53+
export GRADIENT_ACCUMULATION_STEPS=2
54+
55+
chmod +x /app/train.sh
56+
57+
accelerate launch --no_python /app/train.sh

0 commit comments

Comments
 (0)