Skip to content

add vectorization path on maxpool forward channel last #1883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jianyizh
Copy link
Contributor

@jianyizh jianyizh commented Jul 28, 2025

Part 1 of #1861
tested on shapes from alexnet training
on BMG, 831719 Scoreboard stalls decrease to 497,098. instruction fetch and distance stall also get better.

shape device before opt after opt
[4096, 64, 55, 55] pvc 8.02 ms 5.44 ms
[4096, 64, 55, 55] bmg 12.45 ms 8.89 ms
[4096, 192, 27, 27] pvc 5.72 ms 3.85 ms
[4096, 192, 27, 27] bmg 9.00 ms 5.06 ms
[4096, 256, 13, 13] pvc 1.68 ms 1.12 ms
[4096, 256, 13, 13] bmg 2.83 ms 1.35 ms

@Copilot Copilot AI review requested due to automatic review settings July 28, 2025 08:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a vectorized code path for the max pooling forward operation when using channel-last memory layout, providing significant performance improvements on Intel GPU architectures. The optimization uses vectorized memory operations and SYCL kernels to improve throughput.

Key changes:

  • Introduces a new vectorized kernel MaxPool2dChannelLastVec that processes multiple channels simultaneously
  • Adds automatic vector size selection (8, 4, 2, or 1) based on data alignment and hardware capabilities
  • Implements dynamic work group sizing based on hardware thread availability

Comment on lines +187 to +188
load_offset = batch * inputSizeH_*inputSizeW_*numPlane_ / vec_size + plane +
h * inputSizeW_ * numPlane_ / vec_size + w * numPlane_ / vec_size;
Copy link
Preview

Copilot AI Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The complex offset calculation spans multiple lines and lacks spacing around operators. Consider breaking this into intermediate variables or adding consistent spacing (e.g., 'inputSizeH_ * inputSizeW_ * numPlane_').

Suggested change
load_offset = batch * inputSizeH_*inputSizeW_*numPlane_ / vec_size + plane +
h * inputSizeW_ * numPlane_ / vec_size + w * numPlane_ / vec_size;
int64_t batch_offset = batch * inputSizeH_ * inputSizeW_ * numPlane_ / vec_size;
int64_t plane_offset = plane;
int64_t height_offset = h * inputSizeW_ * numPlane_ / vec_size;
int64_t width_offset = w * numPlane_ / vec_size;
load_offset = batch_offset + plane_offset + height_offset + width_offset;

Copilot uses AI. Check for mistakes.

#pragma unroll
for (int i = 0; i < vec_size; i++) {
if ((static_cast<scalar_t>(val_vec[i]) > maxVal_vec[i]) ||
at::_isnan(val_vec[i])) {
Copy link
Preview

Copilot AI Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the private '_isnan' function instead of the standard 'std::isnan'. Consider using 'std::isnan' for better portability and standards compliance.

Copilot uses AI. Check for mistakes.

int64_t num_wg;
if constexpr (is_channels_last) {
for (vec_size =
std::min(8, memory::can_vectorize_up_to<scalar_t>((char*)input));
Copy link
Preview

Copilot AI Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C-style cast to 'char*' should be replaced with a safer C++ cast like 'reinterpret_cast<char*>(input)' for better type safety and clarity.

Suggested change
std::min(8, memory::can_vectorize_up_to<scalar_t>((char*)input));
std::min(8, memory::can_vectorize_up_to<scalar_t>(reinterpret_cast<char*>(input)));

Copilot uses AI. Check for mistakes.

Co-authored-by: Copilot <[email protected]>
@jianyizh jianyizh requested review from xytintel and EikanWang July 28, 2025 09:15
@jianyizh jianyizh requested a review from liangan1 August 13, 2025 02:35
@chuanqi129 chuanqi129 linked an issue Aug 13, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Maxpooling takes too long on BMG
2 participants