-
Notifications
You must be signed in to change notification settings - Fork 309
Remove double baseline calculations for CI microbenchmarks #2613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2613
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit 28f3f6a with merge base ebfe173 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
# uncompiled base model so that quantized versions can be derived | ||
# without mutating the cached copy. | ||
|
||
_BASELINE_CACHE: Dict[Tuple, Tuple[float, float]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add comment for what key is and maybe give an example
result.eager_baseline_inference_time_in_ms = cached_eager_time | ||
result.compile_baseline_inference_time_in_ms = cached_compile_time | ||
|
||
# At this point, ``base_model`` is an uncompiled model ready for quantization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base_model
could be compiled in L124 right?
result.eager_speedup_on_baseline = round( | ||
result.eager_baseline_inference_time_in_ms | ||
/ result.eager_model_inference_time_in_ms, | ||
2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: pass by keyword arg to show what this is
# Benchmark time to run an inference call for quantized model | ||
# Measure inference time for quantized model | ||
print("Benchmarking eager quantized model.....") | ||
result.eager_model_inference_time_in_ms = model_inference_time_in_ms( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add quantized somewhere in the name?
print("Benchmarking quantized model.....") | ||
result.model_inference_time_in_ms = model_inference_time_in_ms( | ||
m_copy = torch.compile(m_copy, mode=config.torch_compile_mode, fullgraph=True) | ||
result.compile_model_inference_time_in_ms = model_inference_time_in_ms( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this one
/ result.compile_model_inference_time_in_ms, | ||
2, | ||
) | ||
# Compute compile speedup for quantized model relative to eager quantized model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this comparison? I think it might be more useful to just compare eager quantized v.s. eager baseline and compile quantized v.s. compile baseline, since these shows the speedup in different serving environments
this seems ok, I think it would be even better if the code was refactored to only measure the baseline once and compare each experiment against it. This way, complexity is lower and there is no need for a cache. high level: baseline_metrics = calc_baseline_metrics(...)
for experiment_config in experiment_configs:
experiment_metrics = calc_experiment_metrics(...)
speedup_vs_baseline = calc_speedup(experiment_metrics, baseline_metrics) non-blocking comment, up to you |
This pull request introduces significant updates to the benchmarking framework, focusing on measuring latency for eager modes, and adding caching mechanisms for baseline performance.
_BASELINE_CACHE
to store eager and compile baseline inference times, reducing redundant computations; resulted in reducing the CI runtime substantially.use_torch_compile
parameter as compile and eager performance are being calculated by default