Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions benchmarking/benchmarking.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Benchmarking Harness

Author: @biswapanda

# Problem statement:
In current state, benchmarking has a few problems:

1. UX: Experimentation, debugging, and iteration is hard.
use case: As a user, I want to easily experiment with different configs, and get results quickly and compare them.

2. Reproducibility is hard: we don't store the input configs and results.
use case: As a user, I want to be able to reproduce my experiments and share them with others.

3. benchmarking steps are tightly coupled. If a sinlge step/benchmark config fails the entire process is aborted/retried.

4. port-forwarding and benchmarking has non-deterministic latency characteristics.

## Proposed plan:

1. decouple all steps and then compose them together: prep a model, deploy k8s cr, benchmark, collect data

2. capture configs for the experiment: deploy (config or a reference to deployment), benchmark, model etc

3. we'd run benchmarks inside k8s cluster in k8s native approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still run into k8s DNS issues when trying to solve the "non-deterministic" problem? I think we have this same issue in slurm where we are not guaranteed the nodes are in the same topology. What are your tolerances for "reproducibility?"



## Steps:
Following steps are executed by the harness:

Note: These steps are reusable across different tests (LLM benchmarking, Accuracy testing, Functional testing etc)

Since these steps are reusable across different tests, we can swap the container used for each step.

1. Initialize experiment

a. (Optional) deploy model

b. wait for the model to be ready

2. Run Benchmarking test using configs and benchmark container (genai-perf, ai perf or 3rd party tool)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is hand-waving a lot of other configurations which need to happen for the particular setup. If the user wants aggregated vs disaggregated, if disaggregated, then how many workers in each? What is the gpu allocation for each of the workers, etc.?

How does this proposal address this, or are you making assumptions about the environment or use case? If so, that's fine, and I would like that explanation here.


a. Prepare configs (matrix of params: isl/osl, concurrency, etc)
pass as a config file to the harness container

b. Run test for each config

3. Teardown

a. (Optional) Collect artifacts - push files to upstream storage (s3/minio)

b. Collect output results:
Push benchmark metrics to a data storage layer (s3/minio/database) using a cli tool

4. Analytics:
a. Generate charts, graphs, and tables from the benchmark metrics


LLM benchmarking container can use the `python3 -m benchmarks.utils.*` utils to generate the config and run the benchmark.

## Config

Benchmarking config file:
```yaml
name: "blueprint-name"
model:
name: "RedHat/Llama-3.3-70B-Instruct"
path: "/path/to/model"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention you want a specific version of the model to be used. Can you add that field here to be explicit?

concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
endpoint: "/v1/chat/completions"
endpoint_type: "chat"
benchmark:
isl_osl
- [8192, 1024]
- [1024, 1024]
- [1024, 8192]
```

## Alternatives:

### Alternative 1: Benchmarking as a first class citizen in dynamo

```
kind: DynamoBenchmark
metadata:
name: vllm-agg-benchmark
spec:
model:
modelRef: llama-3-70b-instruct-v1
config:
model: "RedHat/Llama-3.3-70B-Instruct"
path: "/path/to/model"
concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
endpoint: "/v1/chat/completions"
endpoint_type: "chat"
benchmark:
isl_osl
- [8192, 1024]
- [1024, 1024]
- [1024, 8192]
```

### Alternative 2: Benchmarking helm chart + workflow manager

Simpler to manage and deploy.
Reuse Argo workflows for the workflow manager to orchestrate deps and workflow.