-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[WIP] Calculate MFU/MBU for whole model forward pass #29853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| mn = metric_name | ||
| if "sm__throughput" in mn and "pct_of_peak" in mn: | ||
| metrics["sm_pct_of_peak"] = val | ||
| elif "dram__throughput" in mn and "pct_of_peak" in mn: | ||
| metrics["dram_pct_of_peak"] = val |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRAM throughput metric never parsed
The NCU summary rows always leave ncu_dram_pct_of_peak as None because parse_metrics_csv searches for the substring dram__throughput while the requested metric name is gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed. Since that string lacks the double underscore pattern, the elif branch never fires and MBU/DRAM utilization is never captured for any run using the default metrics list (--metrics gpu__dram_throughput...).
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new hardware utilization toolkit with three scripts to automate profiling with Nsight Compute and calculating MFU/MBU. The scripts are well-structured for their purpose. My main feedback is a critical fix in calculate_mfu_mbu.py to ensure the MFU/MBU calculations are correct by properly handling kernels that may have failed profiling.
| def compute_mfu_mbu(rows: List[Dict]) -> Dict[str, float]: | ||
| """ | ||
| rows: from ncu_kernel_summary_*.csv for a single workload. | ||
| Uses NSys time fraction + NCU MFU/MBU per kernel. | ||
| """ | ||
| # Sum time across all kernels (ms) | ||
| total_time_ms = 0.0 | ||
| for r in rows: | ||
| t = float(r["nsys_time_ms"]) | ||
| total_time_ms += t | ||
|
|
||
| if total_time_ms == 0.0: | ||
| return {"MFU": 0.0, "MBU": 0.0} | ||
|
|
||
| mfu_num = 0.0 | ||
| mbu_num = 0.0 | ||
|
|
||
| for r in rows: | ||
| t = float(r["nsys_time_ms"]) | ||
| sm = r["ncu_sm_pct_of_peak"] | ||
| dram = r["ncu_dram_pct_of_peak"] | ||
| if sm == "" or dram == "": | ||
| continue | ||
| sm = float(sm) / 100.0 | ||
| dram = float(dram) / 100.0 | ||
|
|
||
| weight = t / total_time_ms | ||
| mfu_num += sm * weight | ||
| mbu_num += dram * weight | ||
|
|
||
| return {"MFU": mfu_num, "MBU": mbu_num} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for calculating MFU and MBU is incorrect. It calculates total_time_ms by summing up the time of all kernels, but then it only includes kernels with valid SM and DRAM utilization metrics in the weighted sum. If some kernels failed NCU profiling and have missing metrics, their time is still in the denominator (total_time_ms), but they don't contribute to the numerator (mfu_num, mbu_num). This will lead to an incorrect and underestimated MFU/MBU.
The calculation should be a weighted average over only the kernels for which utilization data is available. I've suggested a refactoring to perform this calculation in a single pass, which is both more correct and slightly more efficient.
def compute_mfu_mbu(rows: List[Dict]) -> Dict[str, float]:
"""
rows: from ncu_kernel_summary_*.csv for a single workload.
Uses NSys time fraction + NCU MFU/MBU per kernel.
"""
total_valid_time_ms = 0.0
mfu_numerator = 0.0
mbu_numerator = 0.0
for r in rows:
sm_str = r.get("ncu_sm_pct_of_peak")
dram_str = r.get("ncu_dram_pct_of_peak")
# Skip kernels that were not successfully profiled by NCU.
if not sm_str or not dram_str:
continue
try:
t_ms = float(r["nsys_time_ms"])
sm_pct = float(sm_str)
dram_pct = float(dram_str)
except (ValueError, TypeError):
# Handle cases where data is malformed.
continue
total_valid_time_ms += t_ms
mfu_numerator += sm_pct * t_ms
mbu_numerator += dram_pct * t_ms
if total_valid_time_ms == 0.0:
return {"MFU": 0.0, "MBU": 0.0}
# The metrics are percentages, so divide by 100 at the end.
mfu = mfu_numerator / total_valid_time_ms / 100.0
mbu = mbu_numerator / total_valid_time_ms / 100.0
return {"MFU": mfu, "MBU": mbu}|
Hi @LinWang-avivia, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.