Skip to content

Conversation

chunhuanMeng
Copy link
Contributor

@chunhuanMeng chunhuanMeng commented Aug 5, 2025

Part 2 of #1861
on PVC, 101,628 Scoreboard stalls decrease to 75,976. Significantly fewer instruction fetch and distance stalls, enabling higher effective bandwidth to HBM.

shape device before opt after opt
[4096, 64, 27, 27] pvc 27.10ms 12.70 ms
[4096, 192, 13, 13] pvc 17.97ms 8.51 ms
[4096, 256, 6, 6] pvc 5.10 ms 2.47 ms

@Copilot Copilot AI review requested due to automatic review settings August 5, 2025 07:10
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds a vectorization path for maxpool backward operations in channel-last memory layout to improve performance. The change introduces a new templated kernel implementation that processes multiple elements simultaneously using vector operations.

  • Refactors existing backward kernel to accumulate gradients locally before writing
  • Adds new vectorized kernel implementation for channel-last memory layout
  • Includes vectorization logic (currently commented out) with macro for launching vectorized kernels

@chunhuanMeng chunhuanMeng requested review from jianyizh and toyxu August 11, 2025 05:23
@jianyizh jianyizh requested a review from liangan1 August 13, 2025 02:39
@chuanqi129 chuanqi129 linked an issue Aug 13, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maxpooling takes too long on BMG
3 participants