|
| 1 | +# Benchmarking Harness |
| 2 | + |
| 3 | +In current state, benchmarking has a few problems: |
| 4 | + |
| 5 | +1. UX: Experimentation, debugging, and iteration is hard. |
| 6 | +use case: As a user, I want to easily experiment with different configs, and get results quickly and compare them. |
| 7 | + |
| 8 | +2. Reproducibility is hard: we don't store the input configs and results. |
| 9 | +use case: As a user, I want to be able to reproduce my experiments and share them with others. |
| 10 | + |
| 11 | +3. benchmarking steps are tightly coupled. If a sinlge step/benchmark config fails the entire process is aborted/retried. |
| 12 | + |
| 13 | +4. port-forwarding and benchmarking has non-deterministic latency characteristics. |
| 14 | + |
| 15 | +## Proposed plan: |
| 16 | + |
| 17 | +1. decouple all steps and then compose them together: prep a model, deploy k8s cr, benchmark, collect data |
| 18 | + |
| 19 | +2. capture configs for the experiment: deploy (config or a reference to deployment), benchmark, model etc |
| 20 | + |
| 21 | +3. we'd run benchmarks inside k8s cluster in k8s native approach. |
| 22 | + |
| 23 | + |
| 24 | +## Steps: |
| 25 | +Following steps are executed by the harness: |
| 26 | + |
| 27 | +Note: These steps are reusable across different tests (LLM benchmarking, Accuracy testing, Functional testing etc) |
| 28 | + |
| 29 | +Since these steps are reusable across different tests, we can swap the container used for each step. |
| 30 | + |
| 31 | +1. Initialize experiment |
| 32 | + |
| 33 | + a. (Optional) deploy model |
| 34 | + |
| 35 | + b. wait for the model to be ready |
| 36 | + |
| 37 | +2. Run Benchmarking test using configs and benchmark container (genai-perf, ai perf or 3rd party tool) |
| 38 | + |
| 39 | + a. Prepare configs (matrix of params: isl/osl, concurrency, etc) |
| 40 | + pass as a config file to the harness container |
| 41 | + |
| 42 | + b. Run test for each config |
| 43 | + |
| 44 | +3. Teardown |
| 45 | + |
| 46 | + a. (Optional) Collect artifacts - push files to upstream storage (s3/minio) |
| 47 | + |
| 48 | + b. Collect output results: |
| 49 | + Push benchmark metrics to a data storage layer (s3/minio/database) using a cli tool |
| 50 | + |
| 51 | +4. Analytics: |
| 52 | + a. Generate charts, graphs, and tables from the benchmark metrics |
| 53 | + |
| 54 | + |
| 55 | +## Config |
| 56 | + |
| 57 | +Benchmarking config file: |
| 58 | +```yaml |
| 59 | +name: "blueprint-name" |
| 60 | +model: |
| 61 | + name: "RedHat/Llama-3.3-70B-Instruct" |
| 62 | + path: "/path/to/model" |
| 63 | +concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] |
| 64 | +endpoint: "/v1/chat/completions" |
| 65 | +endpoint_type: "chat" |
| 66 | +benchmark: |
| 67 | + isl_osl |
| 68 | + - [8192, 1024] |
| 69 | + - [1024, 1024] |
| 70 | + - [1024, 8192] |
| 71 | +``` |
| 72 | +
|
| 73 | +## Alternatives: |
| 74 | +
|
| 75 | +### Alternative 1: Benchmarking as a first class citizen in dynamo |
| 76 | +
|
| 77 | +``` |
| 78 | +kind: DynamoBenchmark |
| 79 | +metadata: |
| 80 | + name: vllm-agg-benchmark |
| 81 | +spec: |
| 82 | + model: |
| 83 | + modelRef: llama-3-70b-instruct-v1 |
| 84 | + config: |
| 85 | + model: "RedHat/Llama-3.3-70B-Instruct" |
| 86 | + path: "/path/to/model" |
| 87 | + concurrency: [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] |
| 88 | + endpoint: "/v1/chat/completions" |
| 89 | + endpoint_type: "chat" |
| 90 | + benchmark: |
| 91 | + isl_osl |
| 92 | + - [8192, 1024] |
| 93 | + - [1024, 1024] |
| 94 | + - [1024, 8192] |
| 95 | +``` |
| 96 | +
|
| 97 | +### Alternative 2: Benchmarking helm chart + workflow manager |
| 98 | +
|
| 99 | +Simpler to manage and deploy. |
| 100 | +Reuse Argo workflows for the workflow manager to orchestrate deps and workflow. |
0 commit comments