AI-Hypercomputer
diff --git a/‎README.md‎
Lines changed: 3 additions & 1 deletion b/‎README.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎training/a4/paligemma2/Chart.yaml‎
Lines changed: 20 additions & 0 deletions b/‎training/a4/paligemma2/Chart.yaml‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎training/a4/paligemma2/README.md‎
Lines changed: 205 additions & 0 deletions b/‎training/a4/paligemma2/README.md‎
Lines changed: 205 additions & 0 deletions
diff --git a/‎training/a4/paligemma2/launcher.sh‎
Lines changed: 57 additions & 0 deletions b/‎training/a4/paligemma2/launcher.sh‎
Lines changed: 57 additions & 0 deletions
@@ -38,13 +38,15 @@ Models             | GPU Machine Type
 
 ### Training benchmarks A4
 
-Models             | GPU Machine Type                                                                                     | Framework | Workload Type | Orchestrator | Link to the recipe
+Models             | GPU Machine Type                                                                                     | Framework / Library | Workload Type | Orchestrator | Link to the recipe
 ------------------ | ---------------------------------------------------------------------------------------------------- | --------- | ------------- | ------------ | ------------------
 **Llama-3.1-70B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | MaxText   | Pre-training  | GKE          | [Link](./training/a4/llama3-1-70b/maxtext-pretraining-gke/README.md)
 **Llama-3.1-70B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | NeMo      | Pre-training  | GKE          | [Link](./training/a4/llama3-1-70b/nemo-pretraining-gke/README.md)
 **Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | MaxText   | Pre-training  | GKE          | [Link](./training/a4/llama3-1-405b/maxtext-pretraining-gke/README.md)
 **Llama-3.1-405B** | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | NeMo      | Pre-training  | GKE          | [Link](./training/a4/llama3-1-405b/nemo-pretraining-gke/README.md)
 **Mixtral-8-7B**   | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | NeMo      | Pre-training  | GKE          | [Link](./training/a4/mixtral-8x7b/nemo-pretraining-gke/README.md)
+**PaliGemma2**     | [A4 (NVIDIA B200)](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-vms)      | Hugging Face Accelerate | Finetuning | GKE     | [Link](./training/a4/paligemma2/README.md)
+
 
 ### Inference benchmarks A3 Mega
 
 
@@ -0,0 +1,20 @@
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+apiVersion: v2
+name: a4_jobset_workload
+description: a4_jobset_workload
+type: application
+version: 0.1.0
+appVersion: "1.16.0"
@@ -0,0 +1,205 @@
+# Finetune PaliGemma2 workloads on A4 GKE Node pools with Hugging Face Accelerate
+
+This recipe outlines the steps for running a PaliGemma2 finetune workload on
+[A4 GKE Node pools](https://cloud.google.com/kubernetes-engine) by using the
+[Hugging Face Accelerate](https://huggingface.co/docs/accelerate/en/index).
+
+## Orchestration and deployment tools
+
+For this recipe, the following setup is used:
+
+- Orchestration - [Google Kubernetes Engine (GKE)](https://cloud.google.com/kubernetes-engine)
+- Finetuning job configuration and deployment - A Helm chart is used to configure and deploy
+  the [Kubernetes Jobset](https://kubernetes.io/blog/2025/03/23/introducing-jobset)
+  resource.
+
+## Test environment
+
+This recipe has been optimized for and tested with the following configuration:
+
+- GKE cluster
+    - [A regional standard cluster](https://cloud.google.com/kubernetes-engine/docs/concepts/configuration-overview) version: 1.32.4-gke.1236000 or later.
+    - A GPU node pool with 1, 2 or 4
+    [a4-highgpu-8g](https://cloud.google.com/compute/docs/accelerator-optimized-machines#a4-high-vms) provisioned using the DENSE deployment type.
+    - [Workload Identity Federation for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity) enabled.
+    - [Cloud Storage FUSE CSI driver for GKE](https://cloud.google.com/kubernetes-engine/docs/concepts/cloud-storage-fuse-csi-driver) enabled.
+    - [DCGM metrics](https://cloud.google.com/kubernetes-engine/docs/how-to/dcgm-metrics) enabled.
+    - [Kueue](https://kueue.sigs.k8s.io/docs/reference/kueue.v1beta1/) and [JobSet](https://jobset.sigs.k8s.io/docs/overview/) APIs installed.
+    - Kueue configured to support [Topology Aware Scheduling](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/).
+- A regional Google Cloud Storage (GCS) bucket to store logs generated by the recipe runs.
+
+To prepare the required environment, see
+[GKE environment setup guide](../../../../docs/configuring-environment-gke-a4.md).
+
+## Training dataset
+
+This recipe uses the [merve/vqav2-small](https://huggingface.co/datasets/merve/vqav2-small) dataset.
+
+## Docker container image
+
+This recipe uses the following [Deep Learning Software Layer](https://cloud.google.com/ai-hypercomputer/docs/software-stack#cluster_images) container image:
+
+`nvcr.io/nvidia/pytorch:25.01-py3`.
+
+This image is based on NVIDIA NeMo 25.02 and contains the NCCL gIB plugin v1.1.0, bundling all NCCL binaries validated for use with A4 GPUs.
+
+
+## Run the recipe
+
+From your client workstation, complete the following steps:
+
+### Configure environment settings
+
+Set the environment variables to match your environment:
+
+ ```bash
+ export PROJECT_ID=<PROJECT_ID>
+ export CLUSTER_REGION=<CLUSTER_REGION>
+ export CLUSTER_NAME=<CLUSTER_NAME>
+ export GCS_BUCKET=<GCS_BUCKET>
+ export KUEUE_NAME=<KUEUE_NAME>
+ export HF_TOKEN=<HF_TOKEN>
+ ```
+
+Replace the following values:
+
+ - `<PROJECT_ID>`: your Google Cloud project ID.
+ - `<CLUSTER_REGION>`: the region where your cluster is located.
+ - `<CLUSTER_NAME>`: the name of your GKE cluster.
+ - `<GCS_BUCKET>`: the name of your Cloud Storage bucket. Don't include the `gs://` prefix.
+ - `<KUEUE_NAME>`: the name of the Kueue local queue. The default queue created by the cluster toolkit is `a4`. Make sure to verify the name of the local queue in your cluster.
+ - `<HF_TOKEN>`: your Hugging Face token. You can create one [here](https://huggingface.co/settings/tokens).
+
+Set the default project:
+
+ ```bash
+ gcloud config set project $PROJECT_ID
+ ```
+
+### Get the recipe
+
+Clone the `gpu-recipes` repository and set a reference to the recipe folder.
+
+```
+git clone https://github.com/ai-hypercomputer/gpu-recipes.git
+cd gpu-recipes
+export REPO_ROOT=`git rev-parse --show-toplevel`
+export RECIPE_ROOT=$REPO_ROOT/training/a4/paligemma2
+cd $RECIPE_ROOT
+```
+
+### Get cluster credentials
+
+```
+gcloud container clusters get-credentials $CLUSTER_NAME --region $CLUSTER_REGION
+```
+
+### Configure and submit a pretraining job
+_Update `<HF_TOKEN>` (your Hugging Face [token](https://huggingface.co/settings/tokens)) in [launcher.sh](./launcher.sh)._
+
+#### Using 4 nodes (32 GPUs) BF16
+
+The default job setting is 50 training steps and fp8 precision. To execute the job with the
+default settings, run the following command from your client:
+
+```bash
+helm install $USER-paligemma2 ${RECIPE_ROOT} -f ${RECIPE_ROOT}/values.yaml \
+   --set-file workload_launcher=${RECIPE_ROOT}/launcher.sh \
+   --set-file workload_config=${RECIPE_ROOT}/main.py \
+   --set workload.image=nvcr.io/nvidia/pytorch:25.01-py3 \
+   --set volumes.gcsMounts[0].bucketName=${GCS_BUCKET} \
+   --set volumes.gcsMounts[0].mountPath=/job-logs \
+   --set workload.envs[0].value=/job-logs/${user}-paligemma2
+```
+
+
+#### Configure job settings
+
+You can overwrite any of the default
+[training configuration envs](./main.py)
+for this job. To do this, we can set the new env values in [launcher.sh](./launcher.sh)
+
+**Examples**
+
+-   To set the number of PER_DEVICE_TRAIN_BATCH_SIZE to 64, update following in [launcher.sh](./launcher.sh).
+
+   ```bash
+   export PER_DEVICE_TRAIN_BATCH_SIZE=64
+   ```
+Run the previous helm command from client.
+
+### Monitor the job
+
+To check the status of pods in your job, run the following command:
+
+```
+kubectl get pods | grep JOB_NAME_PREFIX
+```
+
+Replace the following:
+- JOB_NAME_PREFIX - your job name prefix. For example $USER-paligemma2.
+
+To get the logs for one of the pods, run the following command:
+
+```
+kubectl logs POD_NAME
+```
+
+Information about the training job's progress, including crucial details such as loss,
+step count, and step time, is generated by the rank 0 process.
+This process runs on the pod whose name begins with `JOB_NAME_PREFIX-workload-0-0`.
+For example: `user-paligemma2-0-0-s9zrv`.
+
+
+
+### Troubleshooting
+
+This section provides guidance on troubleshooting issues with the training job.
+
+To check the status of the job's pods, use the following command:
+
+```bash
+kubectl get pods | grep JOB_NAME_PREFIX
+```
+
+Replace `JOB_NAME_PREFIX` with the prefix of your job name. For example, `$USER-paligemma2`. This command will list all pods associated with the specified job, along with their current status.
+
+
+To get the logs from a specific pod, use the following command:
+
+```bash
+kubectl logs POD_NAME
+```
+
+Replace `POD_NAME` with the name of the pod you want to inspect.
+
+In this recipe, the training job is orchestrated by the [Kubernetes JobSet](https://jobset.sigs.k8s.io/docs/overview/). If the JobSet encounters a fatal failure, it removes all pods, making it impossible to inspect their logs directly. To analyze logs from a failed job, retrieve them from Cloud Logging using the following filter:
+
+```
+resource.type="k8s_container"
+resource.labels.project_id="PROJECT_ID"
+resource.labels.location="CLUSTER_REGION"
+resource.labels.cluster_name="CLUSTER_NAME"
+resource.labels.namespace_name="default"
+resource.labels.pod_name=~"^JOB_NAME_PREFIX.*"
+severity>=DEFAULT
+```
+
+Replace the following:
+- `PROJECT_ID`: your Google Cloud project ID.
+- `CLUSTER_REGION`: the region where your cluster is located.
+- `CLUSTER_NAME`: the name of your GKE cluster.
+- `JOB_NAME_PREFIX`: the prefix of your job name (e.g., `$USER-paligemma2`).
+
+This filter will retrieve logs from all containers within pods that match the job with the specified name prefix.
+
+
+### Uninstall the Helm release
+
+You can delete the job and other resources created by the Helm chart.
+To uninstall Helm, run the following command from your client:
+
+```bash
+helm uninstall $USER-paligmma2
+```
+
@@ -0,0 +1,57 @@
+#!/bin/bash
+# Copyright 2025 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+export HF_TOKEN=
+export PYTHONUNBUFFERED=1
+
+pip3 install \
+  transformers==4.46.3 \
+  datasets \
+  accelerate \
+  peft \
+  bitsandbytes \
+  pillow \
+  tensorboard
+
+export LD_LIBRARY_PATH="$NCCL_PLUGIN_PATH"
+ldconfig "$LD_LIBRARY_PATH"
+echo "Added $LD_LIBRARY_PATH to ldconfig:"
+ldconfig -p | grep libcuda | sed 's/^/  /'
+echo ""
+
+export NODE_RANK=$JOB_COMPLETION_INDEX
+export HYDRA_FULL_ERROR=1
+export NVIDIA_VISIBLE_DEVICES=0
+
+echo "Launching Torch distributed as node rank $NODE_RANK out of $NNODES nodes"
+
+mkdir /app
+cat > /app/train.sh << 'EOF'
+#!/bin/bash
+
+python "$PYTHON_MAIN"
+EOF
+
+export TOKENIZERS_PARALLELISM=false
+export NVTE_UB_SOCKET_IFNAME="eth1"
+
+# Training parameters
+export NUM_TRAIN_EPOCHS=1
+export PER_DEVICE_TRAIN_BATCH_SIZE=8
+export GRADIENT_ACCUMULATION_STEPS=2
+
+chmod +x /app/train.sh
+
+accelerate launch --no_python /app/train.sh