-
Notifications
You must be signed in to change notification settings - Fork 12.7k
vulkan: optimizations for direct convolution #14933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV.
On my rtx 2070 mobile:
sd1 vae 512x768 before:
pr:
This does not look good in practice with sd.cpp. I did not run the sampling with this patch. |
Please share the command line you used for sd. Maybe this is the same as the 4070 regression, I'll look into it. |
should be the same with the base f16.safetensors file. |
Another example using sd2 https://huggingface.co/Green-Sky/SD-Turbo-GGUF/blob/main/sd_turbo-f16-q8_0.gguf before:
after:
edit: im2col+mat_mul for comparison
|
Did you use coopmat/coopmat2 for the im2col comparison? 5090 reaches 163.77 TFLOPS but wikipedia says that it can only do 104.8 TFLOPS fp16/fp32. Otherwise there is a bug in the perf code for graphs. |
Yes, I kept coopmat2 enabled for the im2col test. |
I decorated the conv calls to log the tensor shapes: For sd2 (same gguf as above). UNET
VAE
|
@Green-Sky do I need to make any manual edits to use direct convolution? I'm using your |
Right. You need leejet/stable-diffusion.cpp#744 currently. |
Perf numbers from my side:
master:
pr:
|
I can see a similar 25%-ish slowdown on 4070:
I'll start by looking at the test-backend-ops test. |
I looked at the backend tests. The larger tile size is better for very large convolutions (like the first test). I think I need to allow for multiple tile sizes, and doing so should allow for further gains for some of the other shapes. I'll do some more experimentation. |
I pushed another commit that has three tile sizes - the original, the one I had used in the previous commit, and a third for smaller K values. It chooses between them based on K/NPQ sizes and number of SMs. This restores the performance in the directed tests:
This also restores the performance in sd.cpp according to the debug message, which doesn't seem to be very precise. The GGML_VK_PERF_LOGGER output is more clear, for example this is previous TOT ggml vs this updated PR on 5090:
Many of the buckets show a similar improvement to the first directed test (30 TFLOPS -> 43 TFLOPS), and the two buckets with small K show even bigger improvements. The gains on 4070 are smaller. The tile selection logic is no longer NV-specific. I'd appreciate some testing on other hardware. If it doesn't help, I can restrict the changes again. I think we still don't have a way to query number of SMs for Intel, but the logic is written in a way where it'll then just base it on K/NPQ. |
New numbers time.
master:
pr:
Most are now faster. sd2 vae decode (same as above): master:
pr:
Looks better now. The difference between now and before master is pretty high, so I should probably bench this on a clean reboot with no other stuff open. Will try different models tomorrow. I'm a fan of your work btw, on vulkan generally, but also the llama.cpp/ggml stuff specifially (: |
I tested it with my ancient GTX 1060 GPU, and surprisingly, this PR is much faster in some cases (except for the 4096 by 4096 case). I did not test which specific improvements are responsible for the performance gains, but I expect that fastdiv is a better option than shuffle. I was also considering adding fastdiv, but since the warp shuffle allowed computing divisions only once per thread per blocktile, I did not expect much gain from it. Disadvantages of shuffle:
I suspect this behavior is not unique to Nvidia, and fastdiv performs better than shuffle on old non-Nvidia devices as well. If this is confirmed, then it would be a good idea to drop shuffle for fastdiv.
|
Performance numbers on my laptop look really nice with this PR. Ryzen 7 4700U (Radeon Vega 7 iGPU), 16gb ram
Master:
PR:
|
Here's a run on my RX 470, with everything being faster than master except for the PR:
PR with no collectives:
Master:
I also have a 512x512 SDXL run with all PR: 1.03 s/it sampling, 1.33s decoding PR with no collectives: 1.05 s/it sampling, 1.38s decoding Master: 1.09 s/it sampling, 1.41s decoding |
Motivated by #14933 (comment), I tested with this PR using collectives on/off on GTX 1060 (Noteboook). It seems that the effects of collectives (warp shuffle) and the contributions of this PR can add up even for the seemingly peaking 4096x4096 test case where the combination of the two enables to reach 2.08 to 2.10 TFLOPS that I could not achieve before. PS: collectives also helped on this Nvidia card before this PR, I guess because of the poor idiv throughput. llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp Line 3103 in 136ecfb
|
Integer math was slower before Turing, I'll make a change to keep collectives enabled for older NVIDIA GPUs. |
Latest commit reenables collectives for pre-Turing. |
Does it hurt the perormance on recent devices or why it is better to selectively enable/disable? |
On my 5090 it's like 25% slower to enable collectives. |
Do you know why that is? Shouldn't subgroup ops be pretty quick? |
Shuffle is not nearly as fast as integer math on recent GPUs, and to some extent competes with shared/global memory accesses (e.g. see https://forums.developer.nvidia.com/t/whats-the-difference-between-mio-and-lsu-instruction-queue-in-volta-architecture/124749) which are also a bottleneck in this shader. |
Here are results from my hardware: Nvidia RTX 3090
AMD Radeon Pro VII
Intel A770
There's an issue with Intel, the new results are worse than the old ones for large inputs. Any ideas? Everything else looks fine. |
Can you try forcing CONV_SHAPE_128x128 for intel? That should be closer to the old behavior. |
Oh wait, the large convolutions are probably already using the large tile size. But maybe still worth verifying that's happening. |
Other possibilities might be that the shared memory stride padding or loop unrolling are harmful on Intel. |
Then it definitely makes sense to disable. Now it would be good to know if this also holds for recent AMD/Intel/Adreno devices and disable |
Did you try with disabled collectives on AMD? |
#ifdef USE_COLLECTIVES | ||
# extension GL_KHR_shader_subgroup_shuffle : enable | ||
#endif | ||
|
||
#include "types.comp" | ||
|
||
// Make spec constant | ||
#define SHMEM_PAD 0 | ||
#define SHMEM_PAD 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use padding, wouldn't it be better if we converted this to spec constant to make sure that its value is consistent with the host code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a good idea. I'll change it.
When I worked on mat vec optimizations shared memory was actually faster than subgroup shuffles on GCN cards, even though in theory it should be the opposite. These things don't always make sense and I guess the only way to find out is to get the hardware and test it. |
Hmm GCN5 has single cycle integer multiplication so there's a chance it'll be faster with no collectives. |
#14982 is a follow-on change to use coopmat2 in this shader. |
Here's another example where the first test is negatively affected. I'm doing some more tests on Intel and AMD. AMD RX 6800 XT
Edit: This one is easily solved by disabling subgroup shuffle. I added the results from that to the details. That means that they should also be disabled at least on AMD RDNA. I'll check GCN again. |
Here's a diff that gets good performance on Intel: diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 6647b1cc2..9b3feb9ce 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3093,7 +3093,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
uint32_t use_collectives = 0; // Enables subgroup ops for preventing the re-calculation of indices.
uint32_t conv2d_BS_NPQ = 128;
uint32_t conv2d_TS_K = 8;
- uint32_t conv2d_SHMEM_PAD = 4;
+ uint32_t conv2d_SHMEM_PAD = 0;
switch (s) {
default:
@@ -7060,9 +7061,10 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
for (uint32_t i = 0; i < CONV_SHAPE_COUNT; ++i) {
tiles[i] = CEIL_DIV(elements[0], ctx->device->pipeline_conv2d_f32[i]->wg_denoms[0]) * CEIL_DIV(elements[1], ctx->device->pipeline_conv2d_f32[i]->wg_denoms[1]);
}
- if (elements[0] > 64 && tiles[CONV_SHAPE_128x128] >= ctx->device->shader_core_count * 2) {
+ const uint32_t shader_core_count = ctx->device->shader_core_count > 0 ? ctx->device->shader_core_count : 32;
+ if (elements[0] > 64 && tiles[CONV_SHAPE_128x128] >= shader_core_count * 2) {
shape = CONV_SHAPE_128x128;
- } else if (elements[0] <= 32 && tiles[CONV_SHAPE_32x256] >= ctx->device->shader_core_count * 2) {
+ } else if (elements[0] <= 32 && tiles[CONV_SHAPE_32x256] >= shader_core_count * 2) {
shape = CONV_SHAPE_32x256;
} else {
shape = CONV_SHAPE_64x32;
diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp b/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
index 32bd9d4d6..69494c119 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
@@ -202,7 +202,7 @@ void main() {
Ash[B_ly * Ash_stride + B_lx] = val;
}
/* Load input to B_block: (BS_CRS x BS_NPQ) */
- [[unroll]] for (uint32_t r_offset = 0; r_offset < BS_CRS; r_offset += BrpWg) {
+ for (uint32_t r_offset = 0; r_offset < BS_CRS; r_offset += BrpWg) {
uint32_t B_ly = r_offset + Br; /* Row index of B block */
uint32_t B_lx = Bc;
uint32_t NPQ_idx = B_idx_NPQ * BS_NPQ + B_lx; /* Global NPQ index (column index of B) */
@@ -248,7 +248,7 @@ void main() {
}
barrier();
if (T_y * TS_K < K) {
- [[unroll]] for (uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++) {
+ for (uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++) {
for (uint32_t T_ly = 0; T_ly < TS_K; T_ly++) {
regA[T_ly] = Ash[(T_y * TS_K + T_ly) * Ash_stride + CRS_lidx];
} The problems are:
|
Does AMD benefit from the unrolling? I retested and it seems like the first unroll doesn't make a difference on NV, and the second one is in code that will be replaced by coopmat2. So we could potentially just remove the force unrolling if AMD performance is OK.
I had tried to write it in a way where it will continue to use the original tile size on Intel. But it sounds like there are benefits to the other tile sizes on Intel, so I think a hardcoded constant like 32 is OK for now. |
Actually, it looks like there's a small gain. I'm going to try multiple variants and only do unrolling for smaller loop counts, we'll see how that works across all HW. |
This didn't work out as cleanly as I had hoped, even on NV. So I just did variants with/without unrolling selected based on device type. I think we just need a decision on whether to enable unrolling and/or collectives on AMD. |
Thank you, now it works well on Intel. I ran further benchmarks on AMD Radeon Pro VII and RX 6800 XT, this is what I came up with: diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 5898bf8df..2cd32fbb5 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3099,6 +3099,8 @@ static void ggml_vk_load_shaders(vk_device& device) {
if (device->vendor_id == VK_VENDOR_ID_INTEL) {
conv2d_SHMEM_PAD = 0;
conv2d_UNROLL = false;
+ } else if (device->vendor_id == VK_VENDOR_ID_AMD) {
+ conv2d_SHMEM_PAD = device->architecture == vk_device_architecture::AMD_GCN ? 1 : 4;
}
switch (s) {
@@ -3107,6 +3109,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
conv2d_BS_K = 128;
conv2d_BS_NPQ = 128;
conv2d_BS_CRS = 16;
+ if (device->vendor_id == VK_VENDOR_ID_AMD && device->architecture != vk_device_architecture::AMD_GCN) {
+ conv2d_UNROLL = false;
+ }
break;
case CONV_SHAPE_64x32:
conv2d_BS_K = 64;
@@ -3121,13 +3126,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
break;
}
- // Use collectives on pre-Turing NVIDIA GPUs, which had slower integer math.
+ // Use collectives on pre-Turing NVIDIA GPUs and GCN AMD cards, which had slower integer math.
bool allow_collectives_nv = device->vendor_id != VK_VENDOR_ID_NVIDIA ||
device->architecture == vk_device_architecture::NVIDIA_PRE_TURING;
+ bool allow_collectives_amd = device->vendor_id != VK_VENDOR_ID_AMD ||
+ device->architecture == vk_device_architecture::AMD_GCN;
if (device->subgroup_shuffle &&
device->vendor_id != VK_VENDOR_ID_INTEL && // Do not enable collectives on Intel, see PR 14316.
- allow_collectives_nv) {
+ allow_collectives_nv &&
+ allow_collectives_amd) {
use_collectives = 1;
conv2d_BS_CRS = std::min(
device->subgroup_size, There seems to be a difference in whether unrolling helps on RDNA2 depending on the shader size. It helps with the small variants and slows the large tests slightly, that's why I added the if statement in the switch case. Not sure what the reason for this behaviour is. |
Co-authored-by: 0cc4m <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, all my GPUs get a boost from this now.
I haven't looked into why the first test case (
// K=CRS=NPQ=4096 conv2d matmul performance
) is slower on 4070. That's the one that seems most likely to benefit from coopmat, so I'd prefer to wait until we add coopmat support to worry about that.Here's a comparison to the im2col path using #14833. All test cases except the first are faster than the im2col path.
cc @etasnadi