-
Notifications
You must be signed in to change notification settings - Fork 63
add vectorization path on maxpool forward channel last #1883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a vectorized code path for the max pooling forward operation when using channel-last memory layout, providing significant performance improvements on Intel GPU architectures. The optimization uses vectorized memory operations and SYCL kernels to improve throughput.
Key changes:
- Introduces a new vectorized kernel
MaxPool2dChannelLastVecthat processes multiple channels simultaneously - Adds automatic vector size selection (8, 4, 2, or 1) based on data alignment and hardware capabilities
- Implements dynamic work group sizing based on hardware thread availability
Co-authored-by: Copilot <[email protected]>
follows #1883, shape [4096,256,6,6] channel last with output shape [6,6] in torchbench alexnet can get ~4x improvement on bmg --------- Co-authored-by: Copilot <[email protected]>
follows #1883, shape [4096,256,6,6] channel last with output shape [6,6] in torchbench alexnet can get ~4x improvement on bmg --------- Co-authored-by: Copilot <[email protected]>
Part 1 of #1861
tested on shapes from alexnet training
on BMG, 831719 Scoreboard stalls decrease to 497,098. instruction fetch and distance stall also get better.