-
Notifications
You must be signed in to change notification settings - Fork 49
add vectorization path on maxpool forward channel last #1883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a vectorized code path for the max pooling forward operation when using channel-last memory layout, providing significant performance improvements on Intel GPU architectures. The optimization uses vectorized memory operations and SYCL kernels to improve throughput.
Key changes:
- Introduces a new vectorized kernel
MaxPool2dChannelLastVec
that processes multiple channels simultaneously - Adds automatic vector size selection (8, 4, 2, or 1) based on data alignment and hardware capabilities
- Implements dynamic work group sizing based on hardware thread availability
load_offset = batch * inputSizeH_*inputSizeW_*numPlane_ / vec_size + plane + | ||
h * inputSizeW_ * numPlane_ / vec_size + w * numPlane_ / vec_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The complex offset calculation spans multiple lines and lacks spacing around operators. Consider breaking this into intermediate variables or adding consistent spacing (e.g., 'inputSizeH_ * inputSizeW_ * numPlane_').
load_offset = batch * inputSizeH_*inputSizeW_*numPlane_ / vec_size + plane + | |
h * inputSizeW_ * numPlane_ / vec_size + w * numPlane_ / vec_size; | |
int64_t batch_offset = batch * inputSizeH_ * inputSizeW_ * numPlane_ / vec_size; | |
int64_t plane_offset = plane; | |
int64_t height_offset = h * inputSizeW_ * numPlane_ / vec_size; | |
int64_t width_offset = w * numPlane_ / vec_size; | |
load_offset = batch_offset + plane_offset + height_offset + width_offset; |
Copilot uses AI. Check for mistakes.
#pragma unroll | ||
for (int i = 0; i < vec_size; i++) { | ||
if ((static_cast<scalar_t>(val_vec[i]) > maxVal_vec[i]) || | ||
at::_isnan(val_vec[i])) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the private '_isnan' function instead of the standard 'std::isnan'. Consider using 'std::isnan' for better portability and standards compliance.
Copilot uses AI. Check for mistakes.
int64_t num_wg; | ||
if constexpr (is_channels_last) { | ||
for (vec_size = | ||
std::min(8, memory::can_vectorize_up_to<scalar_t>((char*)input)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C-style cast to 'char*' should be replaced with a safer C++ cast like 'reinterpret_cast<char*>(input)' for better type safety and clarity.
std::min(8, memory::can_vectorize_up_to<scalar_t>((char*)input)); | |
std::min(8, memory::can_vectorize_up_to<scalar_t>(reinterpret_cast<char*>(input))); |
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
Part 1 of #1861
tested on shapes from alexnet training
on BMG, 831719 Scoreboard stalls decrease to 497,098. instruction fetch and distance stall also get better.