Tekton Orchestrated Benchmarking POC #423

kalantar · 2025-10-08T14:14:19Z

See tekton_poc/README.md

Signed-off-by: Michael Kalantar <[email protected]>

sriumcp · 2025-10-08T14:57:20Z

charts/model-download/templates/job.yaml

+          mkdir -p "${MOUNT_PATH}/${MODEL_PATH}";
+          python -m pip install huggingface_hub;
+          hf auth login --token "${HF_TOKEN}";
+          hf download "${HF_MODEL_ID}" --local-dir "/cache/${MODEL_PATH}"


Does this skip the model download if it is already present in the path? I believe this is the entire idea of using the PVC to reuse models locally ... right?

I believe hf download detects whether or not the model has been downloaded. Furthermore mkdir -p accepts folders that already exist, so I think this is good.

hf download is intelligent and will skip the work. In the current implementation, the PVC is created locally and used only once. Each parallel experiment is running in a different namespace. In principle, we can define a PVC to an existing PV that has only once copy of the model. If we do this, we probably want to add a task before the parallelism to download the model.

sriumcp · 2025-10-08T15:01:20Z

charts/harness/Chart.yaml

@@ -0,0 +1,40 @@
+apiVersion: v2


Is this chart needed?

Can we simply use the inference-perf container as part of the StepAction, and configure it so that the step actually directly sends the Inference request?

This approach eliminates the chart, along with the need for a separate harness pod (the Tekton step will run as part of some pod anyway).

The step has now been updated to use the llm-d-benchmark image directly. It isn't as clean as it might be to use the inference-perf image directly. However, it should also be possible to modify it to be easier to use.

tekton-poc/pipeline/experiment-task.yaml

sriumcp · 2025-10-08T15:07:34Z

tekton-poc/pipeline/experiment-task.yaml

+
+        echo "✅ workload completed"
+
+    - name: upload-results


This section needs to be completed.

Let us start by implementing HTTP(S)-based results upload to an S3 compatible storage.

See example here: https://www.ibm.com/docs/en/storage-scale/5.2.1?topic=storage-connectivity-cloud-object

This has been done; the results folder is tarred and uploaded to and s3 compatible bucket.

jgchn · 2025-10-08T15:32:24Z

charts/model-download/templates/job.yaml

+          mkdir -p "${MOUNT_PATH}/${MODEL_PATH}";
+          python -m pip install huggingface_hub;
+          hf auth login --token "${HF_TOKEN}";
+          hf download "${HF_MODEL_ID}" --local-dir "/cache/${MODEL_PATH}"


I believe hf download detects whether or not the model has been downloaded. Furthermore mkdir -p accepts folders that already exist, so I think this is good.

jgchn · 2025-10-08T15:33:09Z

tekton-poc/examples/inference-scheduling/gaie-values.yaml

@@ -0,0 +1,150 @@
+inferenceExtension:


I am guessing the user would be creating these values yaml files for the experiment pipeline?

Today, the pipeline takes the location (url) as input. It can take a stringified values file as well. When sweeping through values, the values file ise overridden using --set.

An alternative is to use a (yaml) description of the desired environment to generate the values files. This seems to assume we can express it more simply than the values files to today. It is not clear to me that this is the case.

tekton-poc/README.md

jgchn · 2025-10-08T15:38:43Z

tekton-poc/README.md

+A _matrix_ based `Task` can be unrolled into multiple tasks to reduce the parallelism.
+The utility script `utility/transform-pr-parallel.py` does this as follows:
+
+1. Unroll a single parameter into one `Task` per value. Each resulting Task defines a matrix over the remaining parameters.


Curious what the "unrolled" output looks like here.

tekton-poc/README.md

Signed-off-by: Michael Kalantar <[email protected]>

tekton-poc/README.md

Signed-off-by: Michael Kalantar <[email protected]>

tekton-poc/README.md

Signed-off-by: Michael Kalantar <[email protected]>

jgchn · 2025-10-10T14:15:58Z

tekton-poc/README.md

+
+1. Create a namespace where the Tekton pipeline will execute.
+    ```shell
+    export $NAMESPACE=your_namespace


Suggested change

export $NAMESPACE=your_namespace

export NAMESPACE=your_namespace

jgchn · 2025-10-10T14:28:01Z

tekton-poc/README.md

+    - the namespace (where the PipelineRun executes)
+    - s3 details: secret name, bucket name and endpoint URL
+
+Run by creating the PipelineRun:


This appears as one-liner after rendering. May need to un-indent.

jgchn · 2025-10-10T14:28:14Z

tekton-poc/README.md

+    ```shell
+    kubectl apply -f pipeline/stepactions.yaml


Suggested change

```shell

kubectl apply -f pipeline/stepactions.yaml

```shell

cd tekton-poc

kubectl apply -f pipeline/stepactions.yaml

Signed-off-by: Michael Kalantar <[email protected]>

jgchn

Tested and looks good! I was able to execute three experiments in parallel asynchronously, and saw the results uploaded to IBM COS. Lets get this merged!

jgchn · 2025-10-10T18:03:40Z

tekton-poc/README.md

+This proof of concept currently implements a variation of the inference-scheduling [scenairo](https://github.com/llm-d/llm-d-benchmark/blob/main/scenarios/guides/inference-scheduling.sh)/[experiment](https://github.com/llm-d/llm-d-benchmark/blob/main/experiments/inference-scheduling.yaml).
+


Suggested change

This proof of concept currently implements a variation of the inference-scheduling [scenairo](https://github.com/llm-d/llm-d-benchmark/blob/main/scenarios/guides/inference-scheduling.sh)/[experiment](https://github.com/llm-d/llm-d-benchmark/blob/main/experiments/inference-scheduling.yaml).

This proof of concept currently implements a variation of the inference-scheduling [scenairo](https://github.com/llm-d/llm-d-benchmark/blob/main/scenarios/guides/inference-scheduling.sh)/[experiment](https://github.com/llm-d/llm-d-benchmark/blob/main/experiments/inference-scheduling.yaml).

To change the Inference Scheduling configs for the experiment, update `tekton-poc/examples/inference-scheduling/gaie-values.yaml`, then `git push` to your fork, and supply the new URL to `inference-scheduling/` for the `experimentBaseUrl` value in the [pipeline run yaml](./pipeline/pipelinerun-matrix.yaml#L46).

Signed-off-by: Michael Kalantar <[email protected]>

kalantar added 17 commits October 1, 2025 07:42

values files

4644ab6

Signed-off-by: Michael Kalantar <[email protected]>

update gateway

40bf8cd

Signed-off-by: Michael Kalantar <[email protected]>

change model label

9041ccc

Signed-off-by: Michael Kalantar <[email protected]>

harness launcher chart

698d774

Signed-off-by: Michael Kalantar <[email protected]>

model-download chart

0516f2b

Signed-off-by: Michael Kalantar <[email protected]>

try without backslash

ae36f88

Signed-off-by: Michael Kalantar <[email protected]>

try without backslash

aeec000

Signed-off-by: Michael Kalantar <[email protected]>

restructure args

90473a5

Signed-off-by: Michael Kalantar <[email protected]>

hack

b68c2f7

Signed-off-by: Michael Kalantar <[email protected]>

hack

37893a1

Signed-off-by: Michael Kalantar <[email protected]>

extend hack

f9e1372

Signed-off-by: Michael Kalantar <[email protected]>

more image configurability

91c6c2f

Signed-off-by: Michael Kalantar <[email protected]>

quote

cde1549

Signed-off-by: Michael Kalantar <[email protected]>

MODELID

9792fa5

Signed-off-by: Michael Kalantar <[email protected]>

inital pipeline

24c8fe1

Signed-off-by: Michael Kalantar <[email protected]>

inital pipeline

5c66007

Signed-off-by: Michael Kalantar <[email protected]>

utility to manage parallelism

abc0419

Signed-off-by: Michael Kalantar <[email protected]>

kalantar marked this pull request as draft October 8, 2025 14:14

sriumcp requested changes Oct 8, 2025

View reviewed changes

jgchn reviewed Oct 8, 2025

View reviewed changes

namasl reviewed Oct 8, 2025

View reviewed changes

tekton-poc/README.md Show resolved Hide resolved

namasl reviewed Oct 8, 2025

View reviewed changes

tekton-poc/README.md Show resolved Hide resolved

kalantar added 3 commits October 8, 2025 14:45

change workload steps

46d469f

Signed-off-by: Michael Kalantar <[email protected]>

update readme

59b329a

Signed-off-by: Michael Kalantar <[email protected]>

pin image version

8dfadd8

Signed-off-by: Michael Kalantar <[email protected]>

kalantar force-pushed the tekton-poc branch from 0782687 to 8dfadd8 Compare October 8, 2025 19:50

jgchn reviewed Oct 9, 2025

View reviewed changes

tekton-poc/README.md Outdated Show resolved Hide resolved

jgchn reviewed Oct 9, 2025

View reviewed changes

tekton-poc/README.md Outdated Show resolved Hide resolved

update roles.yaml

ed015a8

Signed-off-by: Michael Kalantar <[email protected]>

kalantar added 2 commits October 9, 2025 10:30

update readme

0fb4271

Signed-off-by: Michael Kalantar <[email protected]>

update readme

6535c56

Signed-off-by: Michael Kalantar <[email protected]>

jgchn reviewed Oct 9, 2025

View reviewed changes

tekton-poc/README.md Outdated Show resolved Hide resolved

tekton-poc/README.md Outdated Show resolved Hide resolved

kalantar added 5 commits October 9, 2025 13:29

remove hardcoded param

d349a01

Signed-off-by: Michael Kalantar <[email protected]>

expose s3 config

28fabf1

Signed-off-by: Michael Kalantar <[email protected]>

document s3

ec1a648

Signed-off-by: Michael Kalantar <[email protected]>

Merge branch 'main' into tekton-poc

34c4181

clarify use of namespace

f6c062b

Signed-off-by: Michael Kalantar <[email protected]>

jgchn reviewed Oct 10, 2025

View reviewed changes

kalantar added 3 commits October 10, 2025 11:53

change image for s3 upload

afb5655

Signed-off-by: Michael Kalantar <[email protected]>

prevent using kalantar ns

78de7de

Signed-off-by: Michael Kalantar <[email protected]>

prevent using kalantar ns

17bee4b

Signed-off-by: Michael Kalantar <[email protected]>

jgchn reviewed Oct 10, 2025

View reviewed changes

maugustosilva added the do not merge label Oct 10, 2025

kalantar added 10 commits October 10, 2025 17:10

delete experiment namespaces

f918279

Signed-off-by: Michael Kalantar <[email protected]>

pd yaml

c5a5e38

Signed-off-by: Michael Kalantar <[email protected]>

update secret name

5f4e42b

Signed-off-by: Michael Kalantar <[email protected]>

update fullnameoverride

c3754c7

Signed-off-by: Michael Kalantar <[email protected]>

fix tensor

3fcbb20

Signed-off-by: Michael Kalantar <[email protected]>

gateway name

b09aa28

Signed-off-by: Michael Kalantar <[email protected]>

desitation name

b60f447

Signed-off-by: Michael Kalantar <[email protected]>

label

a3407ae

Signed-off-by: Michael Kalantar <[email protected]>

progress towards pd scenario

2af0868

Signed-off-by: Michael Kalantar <[email protected]>

remove --tensor-parallel-size

e730192

Signed-off-by: Michael Kalantar <[email protected]>

	export $NAMESPACE=your_namespace
	export NAMESPACE=your_namespace

		This proof of concept currently implements a variation of the inference-scheduling [scenairo](https://github.com/llm-d/llm-d-benchmark/blob/main/scenarios/guides/inference-scheduling.sh)/[experiment](https://github.com/llm-d/llm-d-benchmark/blob/main/experiments/inference-scheduling.yaml).

Tekton Orchestrated Benchmarking POC #423

Are you sure you want to change the base?

Tekton Orchestrated Benchmarking POC #423

Uh oh!

Conversation

kalantar commented Oct 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgchn left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jgchn left a comment •

edited

Loading