-
Notifications
You must be signed in to change notification settings - Fork 13.2k
ggml-cuda: Vulkan direct conv 2D ported to CUDA #16088
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
* Extra: reduces bank conflicts
@Green-Sky Can you check it? |
sd.cp relies exclusively on f16 kernels. |
May I suggest using |
This would make things easy for me, yes. BTW, forgot to thank you @etasnadi for working on this :) ... even though we now have 3 competing prs, more or less. |
I'll close my PR. This one is way better:) |
Same, closing mine too. |
Maybe you can add a parallel pr based on this for f16?
|
I’m new to CUDA but I’d love to give this a shot @Green-Sky @etasnadi if the fp16 isn’t super urgent?, I can take a crack at it in the next week or two.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give me a list of what parts of the code you changed relative to the Vulkan version, if any? Some things like fastdiv and how to retrieve the SM count have equivalents in the CUDA backend. But if this is just a copy-paste of the Vulkan code I would preferably change as little as possible.
Can you give any reference to doing fastdiv/sm_count() in proper ggml-cuda way? I will refactor then. Only the necessary things are changed compared to Vulkan, but they are significant
IMO it already changes as little as possible compared to Vulkan. |
For a |
@bssrdf Do you want to contribute the ggml-cuda conformant fastdiv as a patch to my branch or in a separate PR so everyone gets the authorship for conv2d for their effort? |
@etasnadi, I can give a try. Will do a patch on your branch. |
In the f16 pr by @bssrdf , we found that sd.cpp crashes with this pr. I double checked by forcing sd.cpp to use f32 for the kernel without @bssrdf 's pr.
|
@Green-Sky, it may be due to my changes. I'll investigate. |
I redid the test without your changes, and the issue was the same, as I state right there. |
@Green-Sky, without my change, it will fall back to using the slow direct version. Did it even fail there? |
I patched sd.cpp to cast the kernel to f32, so it would fall back. - x = ggml_conv_2d_direct(ctx, w, x, s0, s1, p0, p1, d0, d1);
+ x = ggml_conv_2d_direct(ctx, ggml_cast(ctx, w, GGML_TYPE_F32), x, s0, s1, p0, p1, d0, d1); I guess I should have made my report more wordy (: edit: fun side note, it seems like with the current naive fallback and f32 kernel, sd.cpp vae decode is ever so slightly faster (~35s vs ~33s) |
So what is the current state of convolution? Is this PR in its current state something that the three of you can agree should be reviewed and merged? |
@bssrdf added the missing parts for this PR in etasnadi#1. We had an issue that seems to be fixed now. If it is ready, I will update the branch behind this PR to be merged (so it will contain commits authored by both of us). Are you going to squash the commits to two commits to preserve @bssrdf's commit or you plan to create one commit with co-authored-by when merging? I think @mnehete32 will re-open his PR and rebase their code to this to add Tensor Cores support. |
Pull requests in llama.cpp/ggml are always squashed to a single commit. |
My testing so far showed this PR is still lagging behind im2col+gemm in terms of performance, by several times for some cases. The only benefit is memory savings. We still have an memory access issue to be resolved. Down the road, I think applications using conv_2d op like sd.cpp should choose either im2col or implicit depending on problem size. For small activation dimensions, im2col is still the way to go. |
Now I think that the implicit GEMM is faster, so I suggest to merge that one. Actually when it was tested with stable-diffusion the code wasn't activated, so always the previous, less efficient implementation was used. When I properly merged the implicit gemm proposed by @bssrdf to my tree it showed that it performs somewhat better than this alg so I suggest to continue to work on that PR. See my comment for details: etasnadi#1 (comment) |
Regarding the suggestion to merge the implicit gemm: I do not see any license in the repository @bssrdf forked the code (https://github.com/Qwesh157/conv_op_optimization) from. @JohannesGaessler, @bssrdf I suggest to make sure that we can legally use all the code pulled from that repo in llama.cpp. |
@etasnadi, my PR may be slightly faster than yours, but it is based on https://github.com/Qwesh157/conv_op_optimization which has no license (I am too lazy to write from scratch). I reached out to the author and asked for adding a license but who knows whether/when they will respond. Pending the license issue, we may continue working on your PR and improve it further ( I have some ideas). Plus your code is more in line with ggml's style. What do you think? |
That's great. It depends on how much you want to work on this. You can either wait for the approval from their side or you can integrate their optimizations to this PR. I think the optimizations are additional, so merging the two would make sense and since you don't use their code directly just the ideas you don't need any license. |
I can give some of my suggestions. I don’t know how Vulkan implements conv2d, but the code in my repo is mainly intended to push the GPU CUDA cores to their performance limits under as large conv2d parameters as possible (larger h, w, and number of filters, etc.), rather than being optimized for a specific conv2d parameter. So you can observe that the Implicit GEMM implementation is generally closer to the theoretical performance value under large size. If the goal is to achieve ideal performance in all aspects, I suggest adding as many tile shapes as possible. For example, in my repo I used 128x128, but you could also try 32x128, 64x128, and so on to generate enough CTAs to fully occupy the GPU. |
@Qwesh157, thanks for the suggestions and offering the generous license. I agree. To achieve higher performance for all sizes of input, there has to be multiple block/tile shapes. This PR has 3 tile shapes and I already see the differences over limited test cases. I am going to implement other shapes in my PR. I am also exploring other optimizations, e.g., vectorized load, split-k etc. |
Is there now consensus that the implicit GEMM kernel is the one that should be reviewed and merged? @Qwesh157 would you be fine with licensing your code to us with the MIT license? We already have a copy of the MIT license at the project root, so my suggestion would be to simply add a copyright notice and a link to your repository in the file containing the copied CUDA code. |
Currently, @Qwesh157's implicit implementation fork proposed by @bssrdf in #15805 is significantly faster (mostly 20% improvement, but there is a test case where it is 100%), but that code uses optimizations missing from this PR. However, this PR also has optimizations missing from the other. I will add warp tiling and double buffering asap and you can decide base on the numbers. I expect that once I updated the code with these optimizations, this will be the fastest. (Adding optimizations of this PR to #15805 is also possible, but I know this code better, so I will work on this.) With stable diffusion, the results are mixed. It was shown on a device that this PR is already marginally better than implcit conv (etasnadi#1 (comment)), but on my device the implicit conv is faster (etasnadi#1 (comment)). |
OK, i changed my license. |
Since the current implementation does not specifically take FP16 scenarios into account, my repo may not fully exploit the CUDA core FP16 units, as certain techniques are not (yet) implemented. Examples include vectorized load/store, swizzling, the use of hfma2 instructions, larger tiles (e.g., 256×128) to increase computational intensity, multi-stage pipelining(>2) on devices with larger shared memory, cp.async on post-ampere arch GPU, and iterator-based mechanisms (as in CUTLASS), among others. |
I added warptiling to this PR and your kernel is 3.75% faster than this PR on average on my device. (I removed variants from this PR for fair comparison). Here are the results with warptiling (https://github.com/etasnadi/llama.cppxx.git): The format is:
This PR is faster on memory bound problems and your kernel is faster when the input is compute bound. I checked your code to see why is the difference so I realized that your tiling strategy is heavily optimized to Nvidia cards. Now I am really curious if your optimizations would be also effective on non-Nvidia devices as well. @bssrdf are you motivated enough to also add the Vulkan port of the kernel? In summary, I suggest to reopen #15805 and merge the features of this PR (variants and fastdiv) so the alg could be even faster! |
@etasnadi, thank you for the detailed benchmarking. I am working on adding multiple blocking strategies into my PR. If it worked on memory bound cases as well, I will reopen it. BTW, fastdiv have been added. I also added vectorized load which seems also giving a speed bump. As to the specific optimization, I wonder whether double buffering helped by hiding the latency. Unfortunately I don't know vulkan so can not help optimizing it. |
I am adding this, because the current conv2d alg #15635 seems to underutilize the GPU -- the Vulkan version #14316 & #14933 is 8-10 times faster on my device. Additionally, the Tensor Cores extension #15813 of the previous alg also seems to be slower than this.
There is another CUDA conv2d proposal that could be related #15805.
Furthermore, this version introduces bank conflict reduction that is not added to Vulkan yet. It seems to be effective on large problems. I expect that this version will be even more efficient than the Vulkan backend.
I do not support f16 yet, a future contribution might do that. Currently this alg will be used when for f32 inputs, otherwise it falls back to the previous implementation.
GGML_CUDA_USE_LEGACY_CONV
forces to use the previous (probably slower) implementation.Perf of previous on RTX 2060:
Perf of proposed: