Skip to content

Conversation

@linwang-aviva
Copy link

@linwang-aviva linwang-aviva commented Dec 2, 2025

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +374 to +378
mn = metric_name
if "sm__throughput" in mn and "pct_of_peak" in mn:
metrics["sm_pct_of_peak"] = val
elif "dram__throughput" in mn and "pct_of_peak" in mn:
metrics["dram_pct_of_peak"] = val

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge DRAM throughput metric never parsed

The NCU summary rows always leave ncu_dram_pct_of_peak as None because parse_metrics_csv searches for the substring dram__throughput while the requested metric name is gpu__dram_throughput.avg.pct_of_peak_sustained_elapsed. Since that string lacks the double underscore pattern, the elif branch never fires and MBU/DRAM utilization is never captured for any run using the default metrics list (--metrics gpu__dram_throughput...).

Useful? React with 👍 / 👎.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new hardware utilization toolkit with three scripts to automate profiling with Nsight Compute and calculating MFU/MBU. The scripts are well-structured for their purpose. My main feedback is a critical fix in calculate_mfu_mbu.py to ensure the MFU/MBU calculations are correct by properly handling kernels that may have failed profiling.

Comment on lines +17 to +47
def compute_mfu_mbu(rows: List[Dict]) -> Dict[str, float]:
"""
rows: from ncu_kernel_summary_*.csv for a single workload.
Uses NSys time fraction + NCU MFU/MBU per kernel.
"""
# Sum time across all kernels (ms)
total_time_ms = 0.0
for r in rows:
t = float(r["nsys_time_ms"])
total_time_ms += t

if total_time_ms == 0.0:
return {"MFU": 0.0, "MBU": 0.0}

mfu_num = 0.0
mbu_num = 0.0

for r in rows:
t = float(r["nsys_time_ms"])
sm = r["ncu_sm_pct_of_peak"]
dram = r["ncu_dram_pct_of_peak"]
if sm == "" or dram == "":
continue
sm = float(sm) / 100.0
dram = float(dram) / 100.0

weight = t / total_time_ms
mfu_num += sm * weight
mbu_num += dram * weight

return {"MFU": mfu_num, "MBU": mbu_num}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation for calculating MFU and MBU is incorrect. It calculates total_time_ms by summing up the time of all kernels, but then it only includes kernels with valid SM and DRAM utilization metrics in the weighted sum. If some kernels failed NCU profiling and have missing metrics, their time is still in the denominator (total_time_ms), but they don't contribute to the numerator (mfu_num, mbu_num). This will lead to an incorrect and underestimated MFU/MBU.

The calculation should be a weighted average over only the kernels for which utilization data is available. I've suggested a refactoring to perform this calculation in a single pass, which is both more correct and slightly more efficient.

def compute_mfu_mbu(rows: List[Dict]) -> Dict[str, float]:
    """
    rows: from ncu_kernel_summary_*.csv for a single workload.
    Uses NSys time fraction + NCU MFU/MBU per kernel.
    """
    total_valid_time_ms = 0.0
    mfu_numerator = 0.0
    mbu_numerator = 0.0

    for r in rows:
        sm_str = r.get("ncu_sm_pct_of_peak")
        dram_str = r.get("ncu_dram_pct_of_peak")

        # Skip kernels that were not successfully profiled by NCU.
        if not sm_str or not dram_str:
            continue

        try:
            t_ms = float(r["nsys_time_ms"])
            sm_pct = float(sm_str)
            dram_pct = float(dram_str)
        except (ValueError, TypeError):
            # Handle cases where data is malformed.
            continue

        total_valid_time_ms += t_ms
        mfu_numerator += sm_pct * t_ms
        mbu_numerator += dram_pct * t_ms

    if total_valid_time_ms == 0.0:
        return {"MFU": 0.0, "MBU": 0.0}

    # The metrics are percentages, so divide by 100 at the end.
    mfu = mfu_numerator / total_valid_time_ms / 100.0
    mbu = mbu_numerator / total_valid_time_ms / 100.0

    return {"MFU": mfu, "MBU": mbu}

@mergify
Copy link

mergify bot commented Dec 5, 2025

Hi @LinWang-avivia, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant