Skip to content

Remove double baseline calculations for CI microbenchmarks #2613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

jainapurva
Copy link
Contributor

@jainapurva jainapurva commented Jul 28, 2025

This pull request introduces significant updates to the benchmarking framework, focusing on measuring latency for eager modes, and adding caching mechanisms for baseline performance.

  • Introduced _BASELINE_CACHE to store eager and compile baseline inference times, reducing redundant computations; resulted in reducing the CI runtime substantially.
  • Calculate performance results for compile and eager mode
  • Removed the use_torch_compile parameter as compile and eager performance are being calculated by default

Copy link

pytorch-bot bot commented Jul 28, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2613

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 28f3f6a with merge base ebfe173 (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 28, 2025
@jainapurva jainapurva changed the title Remove double calculation of baseline Update Microbenchmarks CI run Jul 28, 2025
@jainapurva jainapurva added ciflow/benchmark topic: performance Use this tag if this PR improves the performance of a feature topic: for developers Use this tag if this PR is mainly developer facing labels Jul 28, 2025
@jainapurva jainapurva marked this pull request as ready for review July 28, 2025 16:18
@jainapurva jainapurva changed the title Update Microbenchmarks CI run Remove double baseline calculations for CI microbenchmarks Aug 1, 2025
# uncompiled base model so that quantized versions can be derived
# without mutating the cached copy.

_BASELINE_CACHE: Dict[Tuple, Tuple[float, float]] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add comment for what key is and maybe give an example

result.eager_baseline_inference_time_in_ms = cached_eager_time
result.compile_baseline_inference_time_in_ms = cached_compile_time

# At this point, ``base_model`` is an uncompiled model ready for quantization,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base_model could be compiled in L124 right?

result.eager_speedup_on_baseline = round(
result.eager_baseline_inference_time_in_ms
/ result.eager_model_inference_time_in_ms,
2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: pass by keyword arg to show what this is

# Benchmark time to run an inference call for quantized model
# Measure inference time for quantized model
print("Benchmarking eager quantized model.....")
result.eager_model_inference_time_in_ms = model_inference_time_in_ms(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add quantized somewhere in the name?

print("Benchmarking quantized model.....")
result.model_inference_time_in_ms = model_inference_time_in_ms(
m_copy = torch.compile(m_copy, mode=config.torch_compile_mode, fullgraph=True)
result.compile_model_inference_time_in_ms = model_inference_time_in_ms(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this one

/ result.compile_model_inference_time_in_ms,
2,
)
# Compute compile speedup for quantized model relative to eager quantized model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this comparison? I think it might be more useful to just compare eager quantized v.s. eager baseline and compile quantized v.s. compile baseline, since these shows the speedup in different serving environments

@vkuzo
Copy link
Contributor

vkuzo commented Aug 4, 2025

this seems ok, I think it would be even better if the code was refactored to only measure the baseline once and compare each experiment against it. This way, complexity is lower and there is no need for a cache.

high level:

baseline_metrics = calc_baseline_metrics(...)
for experiment_config in experiment_configs:
    experiment_metrics = calc_experiment_metrics(...)
    speedup_vs_baseline = calc_speedup(experiment_metrics, baseline_metrics)

non-blocking comment, up to you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/benchmark CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: for developers Use this tag if this PR is mainly developer facing topic: performance Use this tag if this PR improves the performance of a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants