-
Notifications
You must be signed in to change notification settings - Fork 5
benchmarking harness #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# Benchmarking Harness | ||
|
||
Author: @biswapanda | ||
|
||
# Problem statement: | ||
In current state, benchmarking has a few problems: | ||
|
||
1. UX: Experimentation, debugging, and iteration is hard. | ||
use case: As a user, I want to easily experiment with different configs, and get results quickly and compare them. | ||
|
||
2. Reproducibility is hard: we don't store the input configs and results. | ||
use case: As a user, I want to be able to reproduce my experiments and share them with others. | ||
|
||
3. benchmarking steps are tightly coupled. If a sinlge step/benchmark config fails the entire process is aborted/retried. | ||
|
||
4. port-forwarding and benchmarking has non-deterministic latency characteristics. | ||
|
||
## Proposed plan: | ||
|
||
1. decouple all steps and then compose them together: prep a model, deploy k8s cr, benchmark, collect data | ||
|
||
2. capture configs for the experiment: deploy (config or a reference to deployment), benchmark, model etc | ||
|
||
3. we'd run benchmarks inside k8s cluster in k8s native approach. | ||
|
||
|
||
## Steps: | ||
Following steps are executed by the harness: | ||
|
||
Note: These steps are reusable across different tests (LLM benchmarking, Accuracy testing, Functional testing etc) | ||
|
||
Since these steps are reusable across different tests, we can swap the container used for each step. | ||
|
||
1. Initialize experiment | ||
|
||
a. (Optional) deploy model | ||
|
||
b. wait for the model to be ready | ||
|
||
2. Run Benchmarking test using configs and benchmark container (genai-perf, ai perf or 3rd party tool) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is hand-waving a lot of other configurations which need to happen for the particular setup. If the user wants aggregated vs disaggregated, if disaggregated, then how many workers in each? What is the gpu allocation for each of the workers, etc.? How does this proposal address this, or are you making assumptions about the environment or use case? If so, that's fine, and I would like that explanation here. |
||
|
||
a. Prepare configs (matrix of params: isl/osl, concurrency, etc) | ||
pass as a config file to the harness container | ||
|
||
b. Run test for each config | ||
|
||
3. Teardown | ||
|
||
a. (Optional) Collect artifacts - push files to upstream storage (s3/minio) | ||
|
||
b. Collect output results: | ||
Push benchmark metrics to a data storage layer (s3/minio/database) using a cli tool | ||
|
||
4. Analytics: | ||
a. Generate charts, graphs, and tables from the benchmark metrics | ||
|
||
|
||
LLM benchmarking container can use the `python3 -m benchmarks.utils.*` utils to generate the config and run the benchmark. | ||
|
||
## Config | ||
|
||
Benchmarking config file: | ||
```yaml | ||
name: "blueprint-name" | ||
model: | ||
name: "RedHat/Llama-3.3-70B-Instruct" | ||
path: "/path/to/model" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You mention you want a specific version of the model to be used. Can you add that field here to be explicit? |
||
concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] | ||
endpoint: "/v1/chat/completions" | ||
endpoint_type: "chat" | ||
benchmark: | ||
isl_osl | ||
- [8192, 1024] | ||
- [1024, 1024] | ||
- [1024, 8192] | ||
``` | ||
|
||
## Alternatives: | ||
|
||
### Alternative 1: Benchmarking as a first class citizen in dynamo | ||
|
||
``` | ||
kind: DynamoBenchmark | ||
metadata: | ||
name: vllm-agg-benchmark | ||
spec: | ||
model: | ||
modelRef: llama-3-70b-instruct-v1 | ||
config: | ||
model: "RedHat/Llama-3.3-70B-Instruct" | ||
path: "/path/to/model" | ||
concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] | ||
endpoint: "/v1/chat/completions" | ||
endpoint_type: "chat" | ||
benchmark: | ||
isl_osl | ||
- [8192, 1024] | ||
- [1024, 1024] | ||
- [1024, 8192] | ||
``` | ||
|
||
### Alternative 2: Benchmarking helm chart + workflow manager | ||
|
||
Simpler to manage and deploy. | ||
Reuse Argo workflows for the workflow manager to orchestrate deps and workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still run into k8s DNS issues when trying to solve the "non-deterministic" problem? I think we have this same issue in slurm where we are not guaranteed the nodes are in the same topology. What are your tolerances for "reproducibility?"