Skip to content

vulkan: optimizations for direct convolution #14933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Aug 2, 2025

Conversation

jeffbolznv
Copy link
Collaborator

  • Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too.
  • Fix shmem bank conflicts. 16B padding should work with coopmat.
  • Some explicit loop unrolling.
  • Skip math/stores work for parts of the tile that are OOB.
  • Apply fastdiv opt.
  • Disable shuffles for NV.
5090 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    220 runs -  4554.01 us/run - 137.42 GFLOP/run -  30.18 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.52 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.20 us/run - 135.78 MFLOP/run -   3.65 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -    10.03 us/run - 642.82 kFLOP/run -  64.06 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.84 us/run -  20.90 MFLOP/run - 636.32 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.82 us/run -   2.78 MFLOP/run - 112.22 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.47 us/run -  22.28 MFLOP/run - 173.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.51 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10246 runs -    98.46 us/run - 923.24 MFLOP/run -   9.38 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3630 runs -   277.22 us/run -   1.85 GFLOP/run -   6.67 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    223 runs -  4493.81 us/run - 137.42 GFLOP/run -  30.58 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.55 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.32 us/run - 135.78 MFLOP/run -   3.64 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.96 us/run - 642.82 kFLOP/run -  64.54 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.90 us/run -  20.90 MFLOP/run - 635.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.85 us/run -   2.78 MFLOP/run - 112.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.29 us/run -  22.28 MFLOP/run - 173.66 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.36 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10137 runs -    99.29 us/run - 923.24 MFLOP/run -   9.30 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3685 runs -   275.26 us/run -   1.85 GFLOP/run -   6.72 TFLOPS

5090 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    212 runs -  4720.67 us/run - 137.42 GFLOP/run -  29.11 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133144 runs -     7.52 us/run - 133.69 MFLOP/run -  17.78 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99495 runs -    10.12 us/run - 135.78 MFLOP/run -  13.42 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                491520 runs -     2.05 us/run - 642.82 kFLOP/run - 312.83 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.61 us/run -  20.90 MFLOP/run -   3.72 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.48 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                35912 runs -    31.51 us/run -  22.28 MFLOP/run - 706.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99705 runs -    10.06 us/run - 115.40 MFLOP/run -  11.47 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26705 runs -    37.50 us/run - 923.24 MFLOP/run -  24.62 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    10670 runs -    94.18 us/run -   1.85 GFLOP/run -  19.63 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    217 runs -  4612.13 us/run - 137.42 GFLOP/run -  29.80 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133892 runs -     7.50 us/run - 133.69 MFLOP/run -  17.82 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               98021 runs -    10.21 us/run - 135.78 MFLOP/run -  13.29 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                516096 runs -     1.95 us/run - 642.82 kFLOP/run - 329.59 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               177082 runs -     5.67 us/run -  20.90 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               180224 runs -     5.65 us/run -   2.78 MFLOP/run - 492.74 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                31423 runs -    32.23 us/run -  22.28 MFLOP/run - 691.18 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              102306 runs -     9.82 us/run - 115.40 MFLOP/run -  11.75 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27032 runs -    37.03 us/run - 923.24 MFLOP/run -  24.93 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    11440 runs -    87.54 us/run -   1.85 GFLOP/run -  21.12 TFLOPS

4070 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     95 runs - 10632.43 us/run - 137.42 GFLOP/run -  12.92 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.27 us/run - 133.69 MFLOP/run -   3.69 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.70 us/run - 135.78 MFLOP/run -   3.34 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.28 us/run - 642.82 kFLOP/run - 102.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    58.79 us/run -  20.90 MFLOP/run - 355.42 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.52 us/run -   2.78 MFLOP/run -  61.18 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   314.76 us/run -  22.28 MFLOP/run -  70.78 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    41.63 us/run - 115.40 MFLOP/run -   2.77 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6104 runs -   166.49 us/run - 923.24 MFLOP/run -   5.55 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   253.72 us/run -   1.85 GFLOP/run -   7.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     99 runs - 10197.10 us/run - 137.42 GFLOP/run -  13.48 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.33 us/run - 133.69 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24321 runs -    41.20 us/run - 135.78 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.36 us/run - 642.82 kFLOP/run - 101.03 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    59.09 us/run -  20.90 MFLOP/run - 353.67 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.46 us/run -   2.78 MFLOP/run -  61.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   316.72 us/run -  22.28 MFLOP/run -  70.34 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    42.07 us/run - 115.40 MFLOP/run -   2.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5995 runs -   169.17 us/run - 923.24 MFLOP/run -   5.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   255.64 us/run -   1.85 GFLOP/run -   7.23 TFLOPS

4070 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     55 runs - 18398.33 us/run - 137.42 GFLOP/run -   7.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               54604 runs -    18.35 us/run - 133.69 MFLOP/run -   7.28 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               38324 runs -    26.10 us/run - 135.78 MFLOP/run -   5.20 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.73 us/run - 642.82 kFLOP/run - 235.85 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.21 us/run -  20.90 MFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    12.98 us/run -   2.78 MFLOP/run - 214.49 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    95.47 us/run -  22.28 MFLOP/run - 233.36 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               50286 runs -    20.09 us/run - 115.40 MFLOP/run -   5.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   114.18 us/run - 923.24 MFLOP/run -   8.09 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   224.02 us/run -   1.85 GFLOP/run -   8.25 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     68 runs - 14908.06 us/run - 137.42 GFLOP/run -   9.22 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               53856 runs -    18.68 us/run - 133.69 MFLOP/run -   7.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    26.01 us/run - 135.78 MFLOP/run -   5.22 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.75 us/run - 642.82 kFLOP/run - 233.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.33 us/run -  20.90 MFLOP/run -   1.57 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    13.06 us/run -   2.78 MFLOP/run - 213.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    96.22 us/run -  22.28 MFLOP/run - 231.53 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               49419 runs -    20.45 us/run - 115.40 MFLOP/run -   5.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   113.43 us/run - 923.24 MFLOP/run -   8.14 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   222.58 us/run -   1.85 GFLOP/run -   8.31 TFLOPS

I haven't looked into why the first test case (// K=CRS=NPQ=4096 conv2d matmul performance) is slower on 4070. That's the one that seems most likely to benefit from coopmat, so I'd prefer to wait until we add coopmat support to worry about that.

Here's a comparison to the im2col path using #14833. All test cases except the first are faster than the im2col path.

5090
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            1192 runs -   839.13 us/run - 137.42 GFLOP/run - 163.77 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                61336 runs -    16.50 us/run - 133.69 MFLOP/run -   8.10 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52327 runs -    19.17 us/run - 135.78 MFLOP/run -   7.08 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 100352 runs -     9.98 us/run - 642.82 kFLOP/run -  64.43 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 19456 runs -    53.39 us/run -  20.90 MFLOP/run - 391.40 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 20480 runs -    51.00 us/run -   2.78 MFLOP/run -  54.60 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  4096 runs -   281.73 us/run -  22.28 MFLOP/run -  79.08 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52887 runs -    19.01 us/run - 115.40 MFLOP/run -   6.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                15478 runs -    64.64 us/run - 923.24 MFLOP/run -  14.28 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             12210 runs -    82.24 us/run -   1.85 GFLOP/run -  22.48 TFLOPS

4070
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             350 runs -  2858.74 us/run - 137.42 GFLOP/run -  48.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37400 runs -    26.77 us/run - 133.69 MFLOP/run -   4.99 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37587 runs -    26.95 us/run - 135.78 MFLOP/run -   5.04 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  69632 runs -    14.42 us/run - 642.82 kFLOP/run -  44.58 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  6144 runs -   193.36 us/run -  20.90 MFLOP/run - 108.07 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  8192 runs -   135.03 us/run -   2.78 MFLOP/run -  20.62 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  1024 runs -  1010.80 us/run -  22.28 MFLOP/run -  22.04 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                27744 runs -    36.08 us/run - 115.40 MFLOP/run -   3.20 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4469 runs -   225.93 us/run - 923.24 MFLOP/run -   4.09 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              6105 runs -   165.17 us/run -   1.85 GFLOP/run -  11.19 TFLOPS

cc @etasnadi

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill
  the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.
@jeffbolznv jeffbolznv requested a review from 0cc4m July 29, 2025 02:36
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 29, 2025
@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 29, 2025

On my rtx 2070 mobile:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat

sd1 vae 512x768

before:
computing vae [mode: DECODE] graph completed, taking 1.16s

computing vae [mode: DECODE] graph completed, taking 1.13s

pr:
computing vae [mode: DECODE] graph completed, taking 1.56s

computing vae [mode: DECODE] graph completed, taking 1.55s

This does not look good in practice with sd.cpp.


I did not run the sampling with this patch.
I applyed this patch to ggml ggml-org/ggml@b96890f

@jeffbolznv
Copy link
Collaborator Author

Please share the command line you used for sd. Maybe this is the same as the 4070 regression, I'll look into it.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 29, 2025

Please share the command line you used for sd. Maybe this is the same as the 4070 regression, I'll look into it.

bin/sd -m ../models/CyberRealistic_V9-q8_0.gguf --sampling-method dpm++2m --schedule karras -W 512 -H 768 --cfg-scale 5 --steps 30 -p "a lovely cat" -v

should be the same with the base f16.safetensors file.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 29, 2025

Another example using sd2 https://huggingface.co/Green-Sky/SD-Turbo-GGUF/blob/main/sd_turbo-f16-q8_0.gguf

before:

$ bin/sd -m ../models/sd_turbo-f16-q8_0.gguf --cfg-scale 1 --steps 8 -p "a lovely cat" -v

[DEBUG] ggml_extend.hpp:1187 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.73s

after:

[DEBUG] ggml_extend.hpp:1187 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 1.00s

edit: im2col+mat_mul for comparison

[DEBUG] ggml_extend.hpp:1187 - vae compute buffer size: 1664.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 1.32s

@etasnadi
Copy link
Contributor

  • Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too.

    • Fix shmem bank conflicts. 16B padding should work with coopmat.

    • Some explicit loop unrolling.

    • Skip math/stores work for parts of the tile that are OOB.

    • Apply fastdiv opt.

    • Disable shuffles for NV.

5090 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    220 runs -  4554.01 us/run - 137.42 GFLOP/run -  30.18 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.52 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.20 us/run - 135.78 MFLOP/run -   3.65 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -    10.03 us/run - 642.82 kFLOP/run -  64.06 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.84 us/run -  20.90 MFLOP/run - 636.32 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.82 us/run -   2.78 MFLOP/run - 112.22 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.47 us/run -  22.28 MFLOP/run - 173.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.51 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10246 runs -    98.46 us/run - 923.24 MFLOP/run -   9.38 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3630 runs -   277.22 us/run -   1.85 GFLOP/run -   6.67 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    223 runs -  4493.81 us/run - 137.42 GFLOP/run -  30.58 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.55 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.32 us/run - 135.78 MFLOP/run -   3.64 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.96 us/run - 642.82 kFLOP/run -  64.54 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.90 us/run -  20.90 MFLOP/run - 635.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.85 us/run -   2.78 MFLOP/run - 112.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.29 us/run -  22.28 MFLOP/run - 173.66 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.36 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10137 runs -    99.29 us/run - 923.24 MFLOP/run -   9.30 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3685 runs -   275.26 us/run -   1.85 GFLOP/run -   6.72 TFLOPS

5090 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    212 runs -  4720.67 us/run - 137.42 GFLOP/run -  29.11 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133144 runs -     7.52 us/run - 133.69 MFLOP/run -  17.78 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99495 runs -    10.12 us/run - 135.78 MFLOP/run -  13.42 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                491520 runs -     2.05 us/run - 642.82 kFLOP/run - 312.83 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.61 us/run -  20.90 MFLOP/run -   3.72 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.48 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                35912 runs -    31.51 us/run -  22.28 MFLOP/run - 706.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99705 runs -    10.06 us/run - 115.40 MFLOP/run -  11.47 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26705 runs -    37.50 us/run - 923.24 MFLOP/run -  24.62 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    10670 runs -    94.18 us/run -   1.85 GFLOP/run -  19.63 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    217 runs -  4612.13 us/run - 137.42 GFLOP/run -  29.80 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133892 runs -     7.50 us/run - 133.69 MFLOP/run -  17.82 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               98021 runs -    10.21 us/run - 135.78 MFLOP/run -  13.29 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                516096 runs -     1.95 us/run - 642.82 kFLOP/run - 329.59 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               177082 runs -     5.67 us/run -  20.90 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               180224 runs -     5.65 us/run -   2.78 MFLOP/run - 492.74 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                31423 runs -    32.23 us/run -  22.28 MFLOP/run - 691.18 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              102306 runs -     9.82 us/run - 115.40 MFLOP/run -  11.75 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27032 runs -    37.03 us/run - 923.24 MFLOP/run -  24.93 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    11440 runs -    87.54 us/run -   1.85 GFLOP/run -  21.12 TFLOPS

4070 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     95 runs - 10632.43 us/run - 137.42 GFLOP/run -  12.92 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.27 us/run - 133.69 MFLOP/run -   3.69 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.70 us/run - 135.78 MFLOP/run -   3.34 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.28 us/run - 642.82 kFLOP/run - 102.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    58.79 us/run -  20.90 MFLOP/run - 355.42 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.52 us/run -   2.78 MFLOP/run -  61.18 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   314.76 us/run -  22.28 MFLOP/run -  70.78 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    41.63 us/run - 115.40 MFLOP/run -   2.77 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6104 runs -   166.49 us/run - 923.24 MFLOP/run -   5.55 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   253.72 us/run -   1.85 GFLOP/run -   7.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     99 runs - 10197.10 us/run - 137.42 GFLOP/run -  13.48 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.33 us/run - 133.69 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24321 runs -    41.20 us/run - 135.78 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.36 us/run - 642.82 kFLOP/run - 101.03 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    59.09 us/run -  20.90 MFLOP/run - 353.67 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.46 us/run -   2.78 MFLOP/run -  61.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   316.72 us/run -  22.28 MFLOP/run -  70.34 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    42.07 us/run - 115.40 MFLOP/run -   2.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5995 runs -   169.17 us/run - 923.24 MFLOP/run -   5.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   255.64 us/run -   1.85 GFLOP/run -   7.23 TFLOPS

4070 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     55 runs - 18398.33 us/run - 137.42 GFLOP/run -   7.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               54604 runs -    18.35 us/run - 133.69 MFLOP/run -   7.28 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               38324 runs -    26.10 us/run - 135.78 MFLOP/run -   5.20 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.73 us/run - 642.82 kFLOP/run - 235.85 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.21 us/run -  20.90 MFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    12.98 us/run -   2.78 MFLOP/run - 214.49 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    95.47 us/run -  22.28 MFLOP/run - 233.36 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               50286 runs -    20.09 us/run - 115.40 MFLOP/run -   5.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   114.18 us/run - 923.24 MFLOP/run -   8.09 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   224.02 us/run -   1.85 GFLOP/run -   8.25 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     68 runs - 14908.06 us/run - 137.42 GFLOP/run -   9.22 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               53856 runs -    18.68 us/run - 133.69 MFLOP/run -   7.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    26.01 us/run - 135.78 MFLOP/run -   5.22 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.75 us/run - 642.82 kFLOP/run - 233.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.33 us/run -  20.90 MFLOP/run -   1.57 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    13.06 us/run -   2.78 MFLOP/run - 213.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    96.22 us/run -  22.28 MFLOP/run - 231.53 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               49419 runs -    20.45 us/run - 115.40 MFLOP/run -   5.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   113.43 us/run - 923.24 MFLOP/run -   8.14 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   222.58 us/run -   1.85 GFLOP/run -   8.31 TFLOPS

I haven't looked into why the first test case (// K=CRS=NPQ=4096 conv2d matmul performance) is slower on 4070. That's the one that seems most likely to benefit from coopmat, so I'd prefer to wait until we add coopmat support to worry about that.

Here's a comparison to the im2col path using #14833. All test cases except the first are faster than the im2col path.

5090
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            1192 runs -   839.13 us/run - 137.42 GFLOP/run - 163.77 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                61336 runs -    16.50 us/run - 133.69 MFLOP/run -   8.10 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52327 runs -    19.17 us/run - 135.78 MFLOP/run -   7.08 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 100352 runs -     9.98 us/run - 642.82 kFLOP/run -  64.43 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 19456 runs -    53.39 us/run -  20.90 MFLOP/run - 391.40 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 20480 runs -    51.00 us/run -   2.78 MFLOP/run -  54.60 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  4096 runs -   281.73 us/run -  22.28 MFLOP/run -  79.08 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52887 runs -    19.01 us/run - 115.40 MFLOP/run -   6.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                15478 runs -    64.64 us/run - 923.24 MFLOP/run -  14.28 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             12210 runs -    82.24 us/run -   1.85 GFLOP/run -  22.48 TFLOPS

4070
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             350 runs -  2858.74 us/run - 137.42 GFLOP/run -  48.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37400 runs -    26.77 us/run - 133.69 MFLOP/run -   4.99 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37587 runs -    26.95 us/run - 135.78 MFLOP/run -   5.04 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  69632 runs -    14.42 us/run - 642.82 kFLOP/run -  44.58 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  6144 runs -   193.36 us/run -  20.90 MFLOP/run - 108.07 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  8192 runs -   135.03 us/run -   2.78 MFLOP/run -  20.62 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  1024 runs -  1010.80 us/run -  22.28 MFLOP/run -  22.04 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                27744 runs -    36.08 us/run - 115.40 MFLOP/run -   3.20 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4469 runs -   225.93 us/run - 923.24 MFLOP/run -   4.09 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              6105 runs -   165.17 us/run -   1.85 GFLOP/run -  11.19 TFLOPS

cc @etasnadi

Did you use coopmat/coopmat2 for the im2col comparison? 5090 reaches 163.77 TFLOPS but wikipedia says that it can only do 104.8 TFLOPS fp16/fp32. Otherwise there is a bug in the perf code for graphs.

@jeffbolznv
Copy link
Collaborator Author

Yes, I kept coopmat2 enabled for the im2col test.

@Green-Sky
Copy link
Collaborator

I decorated the conv calls to log the tensor shapes:

For sd2 (same gguf as above).

UNET
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 1280, 1280)
  input (f32): shape(16, 16, 1280, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 1280, 1280)
  input (f32): shape(32, 32, 1280, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 640, 640)
  input (f32): shape(64, 64, 640, 1)
VAE
[DEBUG] ggml_extend.hpp:1191 - vae compute buffer size: 640.00 MB(VRAM)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 4, 4)
  input (f32): shape(64, 64, 4, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 4, 512)
  input (f32): shape(64, 64, 4, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(256, 256, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 256)
  input (f32): shape(256, 256, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 256)
  input (f32): shape(256, 256, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(512, 512, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 128)
  input (f32): shape(512, 512, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 256, 128)
  input (f32): shape(512, 512, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 3)
  input (f32): shape(512, 512, 128, 1)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 1.04s

@jeffbolznv
Copy link
Collaborator Author

@Green-Sky do I need to make any manual edits to use direct convolution? I'm using your sd -m ../models/sd_turbo-f16-q8_0.gguf --cfg-scale 1 --steps 8 -p "a lovely cat" -v command line and seeing the 1664 MB number.

@Green-Sky
Copy link
Collaborator

@Green-Sky do I need to make any manual edits to use direct convolution? I'm using your sd -m ../models/sd_turbo-f16-q8_0.gguf --cfg-scale 1 --steps 8 -p "a lovely cat" -v command line and seeing the 1664 MB number.

Right. You need leejet/stable-diffusion.cpp#744 currently.
Sorry, forgot to mention.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Jul 29, 2025

Perf numbers from my side:

$ bin/test-backend-ops perf -o CONV_2D
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2070
  Device memory: 8192 MB (8192 MB free)

master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     31 runs - 33125.48 us/run - 137.42 GFLOP/run -   4.15 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               16456 runs -    61.75 us/run - 133.69 MFLOP/run -   2.17 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   104.84 us/run - 135.78 MFLOP/run -   1.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 65536 runs -    16.25 us/run - 642.82 kFLOP/run -  39.56 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   130.98 us/run -  20.90 MFLOP/run - 159.54 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -    97.61 us/run -   2.78 MFLOP/run -  28.53 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   732.50 us/run -  22.28 MFLOP/run -  30.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12138 runs -    87.13 us/run - 115.40 MFLOP/run -   1.32 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2071 runs -   483.85 us/run - 923.24 MFLOP/run -   1.91 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1760 runs -   574.47 us/run -   1.85 GFLOP/run -   3.22 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     31 runs - 32706.81 us/run - 137.42 GFLOP/run -   4.20 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               16456 runs -    62.66 us/run - 133.69 MFLOP/run -   2.13 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   106.36 us/run - 135.78 MFLOP/run -   1.28 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 65536 runs -    16.06 us/run - 642.82 kFLOP/run -  40.03 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   133.07 us/run -  20.90 MFLOP/run - 157.04 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -    98.99 us/run -   2.78 MFLOP/run -  28.13 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   741.31 us/run -  22.28 MFLOP/run -  30.05 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12138 runs -    87.74 us/run - 115.40 MFLOP/run -   1.32 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2071 runs -   487.95 us/run - 923.24 MFLOP/run -   1.89 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1760 runs -   575.22 us/run -   1.85 GFLOP/run -   3.21 TFLOPS

pr:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     20 runs - 51761.65 us/run - 137.42 GFLOP/run -   2.65 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               18700 runs -    54.20 us/run - 133.69 MFLOP/run -   2.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14003 runs -    74.78 us/run - 135.78 MFLOP/run -   1.82 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                196608 runs -     5.09 us/run - 642.82 kFLOP/run - 126.30 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    28.90 us/run -  20.90 MFLOP/run - 723.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    26.75 us/run -   2.78 MFLOP/run - 104.10 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   211.34 us/run -  22.28 MFLOP/run - 105.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               20808 runs -    49.10 us/run - 115.40 MFLOP/run -   2.35 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3052 runs -   335.65 us/run - 923.24 MFLOP/run -   2.75 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   660.23 us/run -   1.85 GFLOP/run -   2.80 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     21 runs - 49715.43 us/run - 137.42 GFLOP/run -   2.76 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               18700 runs -    54.80 us/run - 133.69 MFLOP/run -   2.44 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    75.49 us/run - 135.78 MFLOP/run -   1.80 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                196608 runs -     5.14 us/run - 642.82 kFLOP/run - 124.99 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    29.69 us/run -  20.90 MFLOP/run - 703.75 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    28.57 us/run -   2.78 MFLOP/run -  97.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   218.45 us/run -  22.28 MFLOP/run - 101.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               20808 runs -    48.68 us/run - 115.40 MFLOP/run -   2.37 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3052 runs -   335.69 us/run - 923.24 MFLOP/run -   2.75 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   656.95 us/run -   1.85 GFLOP/run -   2.81 TFLOPS

@jeffbolznv
Copy link
Collaborator Author

I can see a similar 25%-ish slowdown on 4070:


4070 before:

  |==================================================| 8/8 - 5.17it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 1.77s
[INFO ] stable-diffusion.cpp:1814 - generating 1 latent images completed, taking 1.82s
[INFO ] stable-diffusion.cpp:1817 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1190 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.44s

4070 after:

  |==================================================| 8/8 - 5.13it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 1.78s
[INFO ] stable-diffusion.cpp:1814 - generating 1 latent images completed, taking 1.84s
[INFO ] stable-diffusion.cpp:1817 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1190 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.56s

I'll start by looking at the test-backend-ops test.

@jeffbolznv
Copy link
Collaborator Author

I looked at the backend tests. The larger tile size is better for very large convolutions (like the first test). I think I need to allow for multiple tile sizes, and doing so should allow for further gains for some of the other shapes. I'll do some more experimentation.

@jeffbolznv
Copy link
Collaborator Author

I pushed another commit that has three tile sizes - the original, the one I had used in the previous commit, and a third for smaller K values. It chooses between them based on K/NPQ sizes and number of SMs. This restores the performance in the directed tests:

5090:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            308 runs -  3253.41 us/run - 137.42 GFLOP/run -  42.24 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              136884 runs -     7.33 us/run - 133.69 MFLOP/run -  18.24 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              100969 runs -     9.92 us/run - 135.78 MFLOP/run -  13.69 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                507904 runs -     1.99 us/run - 642.82 kFLOP/run - 323.05 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.54 us/run -  20.90 MFLOP/run -   3.77 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.83 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                58357 runs -    18.53 us/run -  22.28 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              103173 runs -     9.69 us/run - 115.40 MFLOP/run -  11.91 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27686 runs -    36.24 us/run - 923.24 MFLOP/run -  25.48 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            11055 runs -    90.61 us/run -   1.85 GFLOP/run -  20.41 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            315 runs -  3183.73 us/run - 137.42 GFLOP/run -  43.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              139128 runs -     7.20 us/run - 133.69 MFLOP/run -  18.58 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              102443 runs -     9.83 us/run - 135.78 MFLOP/run -  13.82 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                540672 runs -     1.86 us/run - 642.82 kFLOP/run - 345.80 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.54 us/run -  20.90 MFLOP/run -   3.77 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.95 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                58357 runs -    18.38 us/run -  22.28 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              105774 runs -     9.50 us/run - 115.40 MFLOP/run -  12.15 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               28122 runs -    35.60 us/run - 923.24 MFLOP/run -  25.93 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            11880 runs -    84.47 us/run -   1.85 GFLOP/run -  21.89 TFLOPS

4070:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             98 runs - 10299.26 us/run - 137.42 GFLOP/run -  13.34 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               55352 runs -    18.26 us/run - 133.69 MFLOP/run -   7.32 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    25.78 us/run - 135.78 MFLOP/run -   5.27 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                385024 runs -     2.65 us/run - 642.82 kFLOP/run - 242.29 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                71790 runs -    14.73 us/run -  20.90 MFLOP/run -   1.42 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               106496 runs -    10.07 us/run -   2.78 MFLOP/run - 276.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                22445 runs -    53.70 us/run -  22.28 MFLOP/run - 414.89 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51153 runs -    19.72 us/run - 115.40 MFLOP/run -   5.85 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9047 runs -   111.47 us/run - 923.24 MFLOP/run -   8.28 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             4510 runs -   222.19 us/run -   1.85 GFLOP/run -   8.32 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            103 runs -  9750.78 us/run - 137.42 GFLOP/run -  14.09 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               55352 runs -    18.20 us/run - 133.69 MFLOP/run -   7.35 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    25.63 us/run - 135.78 MFLOP/run -   5.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                376832 runs -     2.69 us/run - 642.82 kFLOP/run - 238.60 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                86148 runs -    12.24 us/run -  20.90 MFLOP/run -   1.71 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               131072 runs -     7.80 us/run -   2.78 MFLOP/run - 357.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                22445 runs -    45.85 us/run -  22.28 MFLOP/run - 485.89 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51153 runs -    19.78 us/run - 115.40 MFLOP/run -   5.84 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9047 runs -   110.81 us/run - 923.24 MFLOP/run -   8.33 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             4565 runs -   219.34 us/run -   1.85 GFLOP/run -   8.43 TFLOPS

This also restores the performance in sd.cpp according to the debug message, which doesn't seem to be very precise. The GGML_VK_PERF_LOGGER output is more clear, for example this is previous TOT ggml vs this updated PR on 5090:

5090 before:
Vulkan Timings:
ADD: 85 x 67.653 us
CONT: 3 x 22.346 us
CONV_2D M=Cout=128, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 5 x 2442.45 us (31638.6 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 4761.98 us (32462.4 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=256, N=N*OW*OH=262144: 1 x 896.448 us (19126.9 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 8825.86 us (35030.1 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=65536: 5 x 2553.93 us (30264.2 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 5143.94 us (30055.2 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=512, N=N*OW*OH=65536: 1 x 845.824 us (20291.6 GFLOPS/s)
CONV_2D M=Cout=3, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 1 x 2267.14 us (798.873 GFLOPS/s)
CONV_2D M=Cout=4, K=Cin*KW*KH=4, N=N*OW*OH=4096: 1 x 13.152 us (8.72019 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=36, N=N*OW*OH=4096: 1 x 32.768 us (4544 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=16384: 7 x 2659.37 us (29067.4 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=4096: 10 x 1043.53 us (18519.1 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 9121.79 us (33897.3 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=512, N=N*OW*OH=4096: 4 x 144.576 us (14839.2 GFLOPS/s)
GROUP_NORM: 30 x 831.536 us
MUL: 30 x 59.307 us
MUL_MAT m=4096 n=4096 k=512: 1 x 92 us (186555 GFLOPS/s)
MUL_MAT m=512 n=4096 k=4096: 1 x 100.352 us (171175 GFLOPS/s)
SCALE: 1 x 26.624 us
SILU: 29 x 63.772 us
SOFT_MAX: 1 x 43.008 us
UPSCALE: 3 x 165.898 us
Total time: 121672 us.

5090 after:
Vulkan Timings:
ADD: 85 x 65.703 us
CONT: 3 x 24.629 us
CONV_2D M=Cout=128, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 5 x 1969.77 us (39230.8 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 3259.42 us (47427.2 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=256, N=N*OW*OH=262144: 1 x 661.504 us (25920.2 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 7219.71 us (42823.1 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=65536: 5 x 1820.77 us (42450.6 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 3440.64 us (44934.1 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=512, N=N*OW*OH=65536: 1 x 557.088 us (30808.6 GFLOPS/s)
CONV_2D M=Cout=3, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 1 x 485.408 us (3731.2 GFLOPS/s)
CONV_2D M=Cout=4, K=Cin*KW*KH=4, N=N*OW*OH=4096: 1 x 3.648 us (31.4386 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=36, N=N*OW*OH=4096: 1 x 12.704 us (11720.5 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=16384: 7 x 1882.62 us (41060.4 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=4096: 10 x 695.929 us (27769 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 6290.3 us (49155.7 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=512, N=N*OW*OH=4096: 4 x 89.04 us (24094.6 GFLOPS/s)
GROUP_NORM: 30 x 853.14 us
MUL: 30 x 59.335 us
MUL_MAT m=4096 n=4096 k=512: 1 x 95.424 us (179861 GFLOPS/s)
MUL_MAT m=512 n=4096 k=4096: 1 x 100.8 us (170414 GFLOPS/s)
SCALE: 1 x 26.624 us
SILU: 29 x 64.54 us
SOFT_MAX: 1 x 43.008 us
UPSCALE: 3 x 166.613 us
Total time: 97047.3 us.

Many of the buckets show a similar improvement to the first directed test (30 TFLOPS -> 43 TFLOPS), and the two buckets with small K show even bigger improvements. The gains on 4070 are smaller.

The tile selection logic is no longer NV-specific. I'd appreciate some testing on other hardware. If it doesn't help, I can restrict the changes again. I think we still don't have a way to query number of SMs for Intel, but the logic is written in a way where it'll then just base it on K/NPQ.

@Green-Sky
Copy link
Collaborator

New numbers time.

$ bin/test-backend-ops perf -o CONV_2D
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2070
  Device memory: 8192 MB (8192 MB free)

master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     26 runs - 38736.73 us/run - 137.42 GFLOP/run -   3.55 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14960 runs -    68.96 us/run - 133.69 MFLOP/run -   1.94 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8844 runs -   117.11 us/run - 135.78 MFLOP/run -   1.16 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.05 us/run - 642.82 kFLOP/run -  35.61 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   146.66 us/run -  20.90 MFLOP/run - 142.49 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   109.64 us/run -   2.78 MFLOP/run -  25.40 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   836.26 us/run -  22.28 MFLOP/run -  26.64 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    97.61 us/run - 115.40 MFLOP/run -   1.18 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1853 runs -   558.98 us/run - 923.24 MFLOP/run -   1.65 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   655.81 us/run -   1.85 GFLOP/run -   2.82 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     27 runs - 37843.30 us/run - 137.42 GFLOP/run -   3.63 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14212 runs -    70.64 us/run - 133.69 MFLOP/run -   1.89 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8844 runs -   119.07 us/run - 135.78 MFLOP/run -   1.14 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.00 us/run - 642.82 kFLOP/run -  35.71 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   149.78 us/run -  20.90 MFLOP/run - 139.51 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   111.55 us/run -   2.78 MFLOP/run -  24.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   840.77 us/run -  22.28 MFLOP/run -  26.50 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    98.16 us/run - 115.40 MFLOP/run -   1.18 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1853 runs -   549.29 us/run - 923.24 MFLOP/run -   1.68 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   655.39 us/run -   1.85 GFLOP/run -   2.82 TFLOPS

pr:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     28 runs - 36300.93 us/run - 137.42 GFLOP/run -   3.79 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               15708 runs -    63.89 us/run - 133.69 MFLOP/run -   2.09 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12529 runs -    82.23 us/run - 135.78 MFLOP/run -   1.65 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                172032 runs -     5.92 us/run - 642.82 kFLOP/run - 108.56 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    27.44 us/run -  20.90 MFLOP/run - 761.40 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    16.94 us/run -   2.78 MFLOP/run - 164.37 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   120.99 us/run -  22.28 MFLOP/run - 184.14 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               17340 runs -    59.98 us/run - 115.40 MFLOP/run -   1.92 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2507 runs -   403.82 us/run - 923.24 MFLOP/run -   2.29 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1265 runs -   805.03 us/run -   1.85 GFLOP/run -   2.30 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     27 runs - 37721.93 us/run - 137.42 GFLOP/run -   3.64 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14960 runs -    69.47 us/run - 133.69 MFLOP/run -   1.92 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11792 runs -    86.13 us/run - 135.78 MFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                172032 runs -     5.89 us/run - 642.82 kFLOP/run - 109.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    27.96 us/run -  20.90 MFLOP/run - 747.26 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    17.12 us/run -   2.78 MFLOP/run - 162.71 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   119.52 us/run -  22.28 MFLOP/run - 186.40 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19074 runs -    54.06 us/run - 115.40 MFLOP/run -   2.13 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2725 runs -   374.94 us/run - 923.24 MFLOP/run -   2.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1375 runs -   741.01 us/run -   1.85 GFLOP/run -   2.50 TFLOPS

Most are now faster.


sd2 vae decode (same as above):

master:

[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.36s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.85s

pr:

[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.37s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.80s

Looks better now. The difference between now and before master is pretty high, so I should probably bench this on a clean reboot with no other stuff open. Will try different models tomorrow.


I'm a fan of your work btw, on vulkan generally, but also the llama.cpp/ggml stuff specifially (:

@etasnadi
Copy link
Contributor

etasnadi commented Jul 29, 2025

I tested it with my ancient GTX 1060 GPU, and surprisingly, this PR is much faster in some cases (except for the 4096 by 4096 case).

I did not test which specific improvements are responsible for the performance gains, but I expect that fastdiv is a better option than shuffle. I was also considering adding fastdiv, but since the warp shuffle allowed computing divisions only once per thread per blocktile, I did not expect much gain from it.

Disadvantages of shuffle:

  • It is not supported on pre-Kepler GPUs, as far as I know, so the fastdiv trick can boost performance on such really old devices as well. Edit: pre-Kepler GPus might not have Vulkan support either.
  • With warp shuffle, we need to limit the blocktile dot dimension to the warp size.

I suspect this behavior is not unique to Nvidia, and fastdiv performs better than shuffle on old non-Nvidia devices as well. If this is confirmed, then it would be a good idea to drop shuffle for fastdiv.

Master:
=======

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 69740.80 us/run - 137.42 GFLOP/run -   1.97 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   133.83 us/run - 133.69 MFLOP/run - 998.99 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5159 runs -   201.81 us/run - 135.78 MFLOP/run - 672.83 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    53.14 us/run - 642.82 kFLOP/run -  12.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   488.59 us/run -  20.90 MFLOP/run -  42.77 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   415.18 us/run -   2.78 MFLOP/run -   6.71 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3141.48 us/run -  22.28 MFLOP/run -   7.09 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6069 runs -   187.87 us/run - 115.40 MFLOP/run - 614.26 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 981 runs -  1050.38 us/run - 923.24 MFLOP/run - 878.96 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1165.10 us/run -   1.85 GFLOP/run -   1.59 TFLOPS

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 69245.20 us/run - 137.42 GFLOP/run -   1.98 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   142.16 us/run - 133.69 MFLOP/run - 940.47 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5159 runs -   217.08 us/run - 135.78 MFLOP/run - 625.48 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    53.31 us/run - 642.82 kFLOP/run -  12.06 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   478.72 us/run -  20.90 MFLOP/run -  43.65 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   403.03 us/run -   2.78 MFLOP/run -   6.91 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3131.36 us/run -  22.28 MFLOP/run -   7.11 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   201.01 us/run - 115.40 MFLOP/run - 574.12 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 981 runs -  1054.63 us/run - 923.24 MFLOP/run - 875.41 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1162.61 us/run -   1.85 GFLOP/run -   1.59 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

PR:
===

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 69762.27 us/run - 137.42 GFLOP/run -   1.97 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   126.56 us/run - 133.69 MFLOP/run -   1.06 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   190.03 us/run - 135.78 MFLOP/run - 714.53 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.97 us/run - 642.82 kFLOP/run -  64.44 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    92.13 us/run -  20.90 MFLOP/run - 226.81 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    57.55 us/run -   2.78 MFLOP/run -  48.39 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   416.38 us/run -  22.28 MFLOP/run -  53.51 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   102.83 us/run - 115.40 MFLOP/run -   1.12 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   758.62 us/run - 923.24 MFLOP/run -   1.22 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1181.37 us/run -   1.85 GFLOP/run -   1.57 TFLOPS

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 70923.33 us/run - 137.42 GFLOP/run -   1.94 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   141.42 us/run - 133.69 MFLOP/run - 945.40 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   192.64 us/run - 135.78 MFLOP/run - 704.84 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 98304 runs -    10.65 us/run - 642.82 kFLOP/run -  60.34 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    85.48 us/run -  20.90 MFLOP/run - 244.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    56.87 us/run -   2.78 MFLOP/run -  48.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   467.03 us/run -  22.28 MFLOP/run -  47.70 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9537 runs -   105.21 us/run - 115.40 MFLOP/run -   1.10 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   763.36 us/run - 923.24 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1176.98 us/run -   1.85 GFLOP/run -   1.57 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

@daniandtheweb
Copy link
Contributor

daniandtheweb commented Jul 29, 2025

Performance numbers on my laptop look really nice with this PR.

Ryzen 7 4700U (Radeon Vega 7 iGPU), 16gb ram

./test-backend-ops perf -o CONV_2D                                              0.001s 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: AMD Radeon Graphics (RADV RENOIR)
  Device memory: 5452 MB (5452 MB free)

Master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 537110.00 us/run - 137.42 GFLOP/run - 255.85 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1496 runs -   687.21 us/run - 133.69 MFLOP/run - 194.55 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -  1119.22 us/run - 135.78 MFLOP/run - 121.32 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    56.07 us/run - 642.82 kFLOP/run -  11.46 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   799.36 us/run -  20.90 MFLOP/run -  26.14 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   509.41 us/run -   2.78 MFLOP/run -   5.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3909.65 us/run -  22.28 MFLOP/run -   5.70 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2601 runs -   436.11 us/run - 115.40 MFLOP/run - 264.62 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2907.87 us/run - 923.24 MFLOP/run - 317.50 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3331.33 us/run -   1.85 GFLOP/run - 555.00 GFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      5 runs - 215140.40 us/run - 137.42 GFLOP/run - 638.76 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3740 runs -   273.53 us/run - 133.69 MFLOP/run - 488.77 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2211 runs -   488.62 us/run - 135.78 MFLOP/run - 277.89 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    55.79 us/run - 642.82 kFLOP/run -  11.52 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   795.08 us/run -  20.90 MFLOP/run -  26.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   506.09 us/run -   2.78 MFLOP/run -   5.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3885.25 us/run -  22.28 MFLOP/run -   5.73 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2601 runs -   435.96 us/run - 115.40 MFLOP/run - 264.72 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2903.64 us/run - 923.24 MFLOP/run - 317.96 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3295.16 us/run -   1.85 GFLOP/run - 561.09 GFLOPS

PR:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 529981.00 us/run - 137.42 GFLOP/run - 259.30 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2244 runs -   656.40 us/run - 133.69 MFLOP/run - 203.68 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -   946.40 us/run - 135.78 MFLOP/run - 143.47 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.31 us/run - 642.82 kFLOP/run -  35.12 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   224.87 us/run -  20.90 MFLOP/run -  92.93 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   139.94 us/run -   2.78 MFLOP/run -  19.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   990.73 us/run -  22.28 MFLOP/run -  22.49 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3468 runs -   352.27 us/run - 115.40 MFLOP/run - 327.60 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2687.49 us/run - 923.24 MFLOP/run - 343.53 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3267.60 us/run -   1.85 GFLOP/run - 565.82 GFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      5 runs - 220690.00 us/run - 137.42 GFLOP/run - 622.69 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4488 runs -   264.14 us/run - 133.69 MFLOP/run - 506.15 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2211 runs -   475.02 us/run - 135.78 MFLOP/run - 285.85 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.77 us/run - 642.82 kFLOP/run -  34.25 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   225.11 us/run -  20.90 MFLOP/run -  92.83 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   139.96 us/run -   2.78 MFLOP/run -  19.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   989.46 us/run -  22.28 MFLOP/run -  22.52 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3468 runs -   353.70 us/run - 115.40 MFLOP/run - 326.28 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2691.24 us/run - 923.24 MFLOP/run - 343.05 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3242.42 us/run -   1.85 GFLOP/run - 570.22 GFLOPS
stable diffusion.cpp im2col, sd 1.5, 512x512, 20 step: diffusion 3.83 s/it, vae decode 20s
stable-diffusion.cpp conv2d master, sd 1.5, 512x512, 20 step: diffusion 4.39 s/it, vae decode 6.18s
stable-diffusion.cpp conv2d PR, sd 1.5, 512x512, 20 step: diffusion 4.08 s/it, vae decode 5.96s

@netrunnereve
Copy link
Collaborator

Here's a run on my RX 470, with everything being faster than master except for the ne_input=[16,16,128,8],ne_kernel=[3,3,128,512] test. If I turn collectives off, that test runs faster but everything else runs slower!

PR:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 62334.35 us/run - 137.42 GFLOP/run -   2.20 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10472 runs -    96.89 us/run - 133.69 MFLOP/run -   1.38 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6633 runs -   164.13 us/run - 135.78 MFLOP/run - 827.27 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.07 us/run - 642.82 kFLOP/run -  33.72 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    78.54 us/run -  20.90 MFLOP/run - 266.05 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    48.72 us/run -   2.78 MFLOP/run -  57.16 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   270.09 us/run -  22.28 MFLOP/run -  82.49 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    99.76 us/run - 115.40 MFLOP/run -   1.16 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1635 runs -   613.66 us/run - 923.24 MFLOP/run -   1.50 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      825 runs -  1251.48 us/run -   1.85 GFLOP/run -   1.48 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 59013.53 us/run - 137.42 GFLOP/run -   2.33 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11220 runs -    94.72 us/run - 133.69 MFLOP/run -   1.41 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6633 runs -   162.89 us/run - 135.78 MFLOP/run - 833.60 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.03 us/run - 642.82 kFLOP/run -  33.77 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    78.29 us/run -  20.90 MFLOP/run - 266.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    48.71 us/run -   2.78 MFLOP/run -  57.18 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   270.79 us/run -  22.28 MFLOP/run -  82.27 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   100.71 us/run - 115.40 MFLOP/run -   1.15 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1744 runs -   587.41 us/run - 923.24 MFLOP/run -   1.57 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1157.66 us/run -   1.85 GFLOP/run -   1.60 TFLOPS

PR with no collectives:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 64289.75 us/run - 137.42 GFLOP/run -   2.14 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   143.74 us/run - 133.69 MFLOP/run - 930.12 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.86 us/run - 135.78 MFLOP/run - 794.71 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.54 us/run - 642.82 kFLOP/run -  23.34 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   231.00 us/run -  20.90 MFLOP/run -  90.46 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   145.13 us/run -   2.78 MFLOP/run -  19.19 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   999.29 us/run -  22.28 MFLOP/run -  22.29 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   211.30 us/run - 115.40 MFLOP/run - 546.17 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   926.03 us/run - 923.24 MFLOP/run - 996.99 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1152.03 us/run -   1.85 GFLOP/run -   1.60 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 61158.71 us/run - 137.42 GFLOP/run -   2.25 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   144.01 us/run - 133.69 MFLOP/run - 928.37 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.21 us/run - 135.78 MFLOP/run - 797.75 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.53 us/run - 642.82 kFLOP/run -  23.35 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   230.81 us/run -  20.90 MFLOP/run -  90.54 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   144.69 us/run -   2.78 MFLOP/run -  19.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1002.49 us/run -  22.28 MFLOP/run -  22.22 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   210.98 us/run - 115.40 MFLOP/run - 546.98 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   925.57 us/run - 923.24 MFLOP/run - 997.48 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1123.53 us/run -   1.85 GFLOP/run -   1.65 TFLOPS

Master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 64523.81 us/run - 137.42 GFLOP/run -   2.13 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   144.09 us/run - 133.69 MFLOP/run - 927.84 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.97 us/run - 135.78 MFLOP/run - 794.18 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.60 us/run - 642.82 kFLOP/run -  23.29 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   230.83 us/run -  20.90 MFLOP/run -  90.53 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   144.84 us/run -   2.78 MFLOP/run -  19.23 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   999.31 us/run -  22.28 MFLOP/run -  22.29 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   211.70 us/run - 115.40 MFLOP/run - 545.13 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   926.70 us/run - 923.24 MFLOP/run - 996.26 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1154.92 us/run -   1.85 GFLOP/run -   1.60 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 61533.29 us/run - 137.42 GFLOP/run -   2.23 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   143.74 us/run - 133.69 MFLOP/run - 930.09 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.53 us/run - 135.78 MFLOP/run - 796.26 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.47 us/run - 642.82 kFLOP/run -  23.40 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   230.49 us/run -  20.90 MFLOP/run -  90.66 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   144.64 us/run -   2.78 MFLOP/run -  19.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1002.43 us/run -  22.28 MFLOP/run -  22.22 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   211.31 us/run - 115.40 MFLOP/run - 546.13 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   926.93 us/run - 923.24 MFLOP/run - 996.01 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1121.26 us/run -   1.85 GFLOP/run -   1.65 TFLOPS

I also have a 512x512 SDXL run with all ggml_conv_2d functions switched to ggml_conv_2d_direct.

PR: 1.03 s/it sampling, 1.33s decoding

PR with no collectives: 1.05 s/it sampling, 1.38s decoding

Master: 1.09 s/it sampling, 1.41s decoding

@etasnadi
Copy link
Contributor

Motivated by #14933 (comment), I tested with this PR using collectives on/off on GTX 1060 (Noteboook).

It seems that the effects of collectives (warp shuffle) and the contributions of this PR can add up even for the seemingly peaking 4096x4096 test case where the combination of the two enables to reach 2.08 to 2.10 TFLOPS that I could not achieve before.

PS: collectives also helped on this Nvidia card before this PR, I guess because of the poor idiv throughput.
My 2060 desktop device is way less sensitive to enalbding/disabling collectives so I am not sure that we should disable collectives on all Nvidia devices.

device->vendor_id != VK_VENDOR_ID_NVIDIA) { // Collectives no faster on NVIDIA.

PR, Collectives=1

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 66094.44 us/run - 137.42 GFLOP/run -   2.08 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8976 runs -   115.36 us/run - 133.69 MFLOP/run -   1.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   173.05 us/run - 135.78 MFLOP/run - 784.62 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     7.74 us/run - 642.82 kFLOP/run -  83.02 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    67.39 us/run -  20.90 MFLOP/run - 310.07 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    44.97 us/run -   2.78 MFLOP/run -  61.93 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   334.86 us/run -  22.28 MFLOP/run -  66.53 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    97.66 us/run - 115.40 MFLOP/run -   1.18 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   745.92 us/run - 923.24 MFLOP/run -   1.24 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1090.30 us/run -   1.85 GFLOP/run -   1.70 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 64889.12 us/run - 137.42 GFLOP/run -   2.12 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8976 runs -   116.30 us/run - 133.69 MFLOP/run -   1.15 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   174.92 us/run - 135.78 MFLOP/run - 776.26 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     7.67 us/run - 642.82 kFLOP/run -  83.86 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    68.07 us/run -  20.90 MFLOP/run - 306.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.40 us/run -   2.78 MFLOP/run -  61.34 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   336.03 us/run -  22.28 MFLOP/run -  66.30 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11271 runs -    95.98 us/run - 115.40 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   740.06 us/run - 923.24 MFLOP/run -   1.25 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1070.72 us/run -   1.85 GFLOP/run -   1.73 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

PR, Collectives=0

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 70782.33 us/run - 137.42 GFLOP/run -   1.94 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   126.55 us/run - 133.69 MFLOP/run -   1.06 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   188.36 us/run - 135.78 MFLOP/run - 720.87 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 98304 runs -    10.62 us/run - 642.82 kFLOP/run -  60.55 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    84.49 us/run -  20.90 MFLOP/run - 247.33 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    54.87 us/run -   2.78 MFLOP/run -  50.75 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   416.30 us/run -  22.28 MFLOP/run -  53.52 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   100.85 us/run - 115.40 MFLOP/run -   1.14 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1308 runs -   769.67 us/run - 923.24 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1167.87 us/run -   1.85 GFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 71122.67 us/run - 137.42 GFLOP/run -   1.93 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   124.90 us/run - 133.69 MFLOP/run -   1.07 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   188.71 us/run - 135.78 MFLOP/run - 719.52 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 98304 runs -    10.22 us/run - 642.82 kFLOP/run -  62.90 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    85.33 us/run -  20.90 MFLOP/run - 244.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    55.35 us/run -   2.78 MFLOP/run -  50.31 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   420.99 us/run -  22.28 MFLOP/run -  52.92 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   100.87 us/run - 115.40 MFLOP/run -   1.14 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   758.10 us/run - 923.24 MFLOP/run -   1.22 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1149.91 us/run -   1.85 GFLOP/run -   1.61 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

@jeffbolznv
Copy link
Collaborator Author

Integer math was slower before Turing, I'll make a change to keep collectives enabled for older NVIDIA GPUs.

@jeffbolznv
Copy link
Collaborator Author

Latest commit reenables collectives for pre-Turing.

@etasnadi
Copy link
Contributor

Latest commit reenables collectives for pre-Turing.

Does it hurt the perormance on recent devices or why it is better to selectively enable/disable?

@jeffbolznv
Copy link
Collaborator Author

On my 5090 it's like 25% slower to enable collectives.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 30, 2025

Do you know why that is? Shouldn't subgroup ops be pretty quick?

@jeffbolznv
Copy link
Collaborator Author

Shuffle is not nearly as fast as integer math on recent GPUs, and to some extent competes with shared/global memory accesses (e.g. see https://forums.developer.nvidia.com/t/whats-the-difference-between-mio-and-lsu-instruction-queue-in-volta-architecture/124749) which are also a bottleneck in this shader.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 30, 2025

Here are results from my hardware:

Nvidia RTX 3090
ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    113 runs -  8884.29 us/run - 137.42 GFLOP/run -  15.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30668 runs -    32.82 us/run - 133.69 MFLOP/run -   4.07 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30217 runs -    33.18 us/run - 135.78 MFLOP/run -   4.09 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.72 us/run - 642.82 kFLOP/run -  66.13 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                23930 runs -    50.15 us/run -  20.90 MFLOP/run - 416.64 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    39.62 us/run -   2.78 MFLOP/run -  70.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   254.07 us/run -  22.28 MFLOP/run -  87.69 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19074 runs -    54.53 us/run - 115.40 MFLOP/run -   2.12 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6540 runs -   154.69 us/run - 923.24 MFLOP/run -   5.97 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4675 runs -   215.08 us/run -   1.85 GFLOP/run -   8.60 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    114 runs -  8787.96 us/run - 137.42 GFLOP/run -  15.64 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30668 runs -    33.13 us/run - 133.69 MFLOP/run -   4.04 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30217 runs -    33.41 us/run - 135.78 MFLOP/run -   4.06 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.79 us/run - 642.82 kFLOP/run -  65.63 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                23930 runs -    50.27 us/run -  20.90 MFLOP/run - 415.72 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    39.40 us/run -   2.78 MFLOP/run -  70.69 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   253.50 us/run -  22.28 MFLOP/run -  87.89 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19074 runs -    54.89 us/run - 115.40 MFLOP/run -   2.10 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6322 runs -   159.80 us/run - 923.24 MFLOP/run -   5.78 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4730 runs -   212.61 us/run -   1.85 GFLOP/run -   8.70 TFLOPS

PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    117 runs -  8577.18 us/run - 137.42 GFLOP/run -  16.02 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51612 runs -    19.51 us/run - 133.69 MFLOP/run -   6.85 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               42746 runs -    23.76 us/run - 135.78 MFLOP/run -   5.72 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                262144 runs -     3.89 us/run - 642.82 kFLOP/run - 165.44 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                67004 runs -    15.32 us/run -  20.90 MFLOP/run -   1.36 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                98304 runs -    10.66 us/run -   2.78 MFLOP/run - 261.36 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                22445 runs -    47.42 us/run -  22.28 MFLOP/run - 469.85 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               47685 runs -    21.01 us/run - 115.40 MFLOP/run -   5.49 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9919 runs -   101.64 us/run - 923.24 MFLOP/run -   9.08 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     5060 runs -   199.04 us/run -   1.85 GFLOP/run -   9.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    119 runs -  8416.67 us/run - 137.42 GFLOP/run -  16.33 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51612 runs -    19.44 us/run - 133.69 MFLOP/run -   6.88 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               42009 runs -    24.08 us/run - 135.78 MFLOP/run -   5.64 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                262144 runs -     3.86 us/run - 642.82 kFLOP/run - 166.50 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                86148 runs -    12.07 us/run -  20.90 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               131072 runs -     8.03 us/run -   2.78 MFLOP/run - 346.80 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                26934 runs -    41.18 us/run -  22.28 MFLOP/run - 541.02 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               48552 runs -    20.78 us/run - 115.40 MFLOP/run -   5.55 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9919 runs -   101.32 us/run - 923.24 MFLOP/run -   9.11 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     5060 runs -   199.78 us/run -   1.85 GFLOP/run -   9.25 TFLOPS
AMD Radeon Pro VII
ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     43 runs - 23266.07 us/run - 137.42 GFLOP/run -   5.91 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10472 runs -   100.17 us/run - 133.69 MFLOP/run -   1.33 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   112.47 us/run - 135.78 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    12.84 us/run - 642.82 kFLOP/run -  50.07 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    85.14 us/run -  20.90 MFLOP/run - 245.44 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    52.24 us/run -   2.78 MFLOP/run -  53.31 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   343.30 us/run -  22.28 MFLOP/run -  64.90 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8670 runs -   127.16 us/run - 115.40 MFLOP/run - 907.56 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2725 runs -   374.25 us/run - 923.24 MFLOP/run -   2.47 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1980 runs -   514.82 us/run -   1.85 GFLOP/run -   3.59 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     45 runs - 22607.02 us/run - 137.42 GFLOP/run -   6.08 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10472 runs -   101.88 us/run - 133.69 MFLOP/run -   1.31 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   111.62 us/run - 135.78 MFLOP/run -   1.22 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    12.83 us/run - 642.82 kFLOP/run -  50.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    84.91 us/run -  20.90 MFLOP/run - 246.10 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    52.10 us/run -   2.78 MFLOP/run -  53.45 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   343.44 us/run -  22.28 MFLOP/run -  64.87 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8670 runs -   127.24 us/run - 115.40 MFLOP/run - 906.97 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2725 runs -   374.90 us/run - 923.24 MFLOP/run -   2.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1980 runs -   513.75 us/run -   1.85 GFLOP/run -   3.60 TFLOPS


PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     45 runs - 22288.98 us/run - 137.42 GFLOP/run -   6.17 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    41.15 us/run - 133.69 MFLOP/run -   3.25 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19162 runs -    54.04 us/run - 135.78 MFLOP/run -   2.51 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     8.06 us/run - 642.82 kFLOP/run -  79.73 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    30.35 us/run -  20.90 MFLOP/run - 688.44 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                57344 runs -    19.13 us/run -   2.78 MFLOP/run - 145.56 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    91.74 us/run -  22.28 MFLOP/run - 242.83 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26010 runs -    39.41 us/run - 115.40 MFLOP/run -   2.93 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4360 runs -   230.02 us/run - 923.24 MFLOP/run -   4.01 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2200 runs -   461.85 us/run -   1.85 GFLOP/run -   4.00 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     46 runs - 22113.48 us/run - 137.42 GFLOP/run -   6.21 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    41.25 us/run - 133.69 MFLOP/run -   3.24 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19162 runs -    54.01 us/run - 135.78 MFLOP/run -   2.51 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     8.13 us/run - 642.82 kFLOP/run -  79.09 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    30.41 us/run -  20.90 MFLOP/run - 687.21 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                57344 runs -    19.14 us/run -   2.78 MFLOP/run - 145.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    92.45 us/run -  22.28 MFLOP/run - 240.97 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26010 runs -    39.45 us/run - 115.40 MFLOP/run -   2.93 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4469 runs -   228.62 us/run - 923.24 MFLOP/run -   4.04 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2200 runs -   457.11 us/run -   1.85 GFLOP/run -   4.04 TFLOPS
Intel A770
ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     43 runs - 23399.37 us/run - 137.42 GFLOP/run -   5.87 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14960 runs -    67.74 us/run - 133.69 MFLOP/run -   1.97 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    78.16 us/run - 135.78 MFLOP/run -   1.74 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    13.09 us/run - 642.82 kFLOP/run -  49.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    75.27 us/run -  20.90 MFLOP/run - 277.61 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.95 us/run -   2.78 MFLOP/run -  60.61 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   294.13 us/run -  22.28 MFLOP/run -  75.74 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9537 runs -   115.09 us/run - 115.40 MFLOP/run -   1.00 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3161 runs -   326.29 us/run - 923.24 MFLOP/run -   2.83 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1925 runs -   524.95 us/run -   1.85 GFLOP/run -   3.52 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     47 runs - 21516.98 us/run - 137.42 GFLOP/run -   6.39 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               15708 runs -    64.86 us/run - 133.69 MFLOP/run -   2.06 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14003 runs -    74.20 us/run - 135.78 MFLOP/run -   1.83 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    12.95 us/run - 642.82 kFLOP/run -  49.66 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    74.18 us/run -  20.90 MFLOP/run - 281.71 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    44.97 us/run -   2.78 MFLOP/run -  61.92 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   289.04 us/run -  22.28 MFLOP/run -  77.08 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9537 runs -   109.57 us/run - 115.40 MFLOP/run -   1.05 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3161 runs -   317.33 us/run - 923.24 MFLOP/run -   2.91 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2090 runs -   490.53 us/run -   1.85 GFLOP/run -   3.77 TFLOPS


PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     32 runs - 31778.94 us/run - 137.42 GFLOP/run -   4.32 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12716 runs -    79.90 us/run - 133.69 MFLOP/run -   1.67 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11055 runs -    94.89 us/run - 135.78 MFLOP/run -   1.43 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.69 us/run - 642.82 kFLOP/run -  66.35 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    34.12 us/run -  20.90 MFLOP/run - 612.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                49152 runs -    21.51 us/run -   2.78 MFLOP/run - 129.44 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   107.69 us/run -  22.28 MFLOP/run - 206.88 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               21675 runs -    46.68 us/run - 115.40 MFLOP/run -   2.47 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4687 runs -   216.70 us/run - 923.24 MFLOP/run -   4.26 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1595 runs -   641.41 us/run -   1.85 GFLOP/run -   2.88 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     34 runs - 29641.18 us/run - 137.42 GFLOP/run -   4.64 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13464 runs -    77.44 us/run - 133.69 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11055 runs -    94.57 us/run - 135.78 MFLOP/run -   1.44 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.47 us/run - 642.82 kFLOP/run -  67.91 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    33.67 us/run -  20.90 MFLOP/run - 620.61 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                49152 runs -    21.26 us/run -   2.78 MFLOP/run - 131.01 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   106.60 us/run -  22.28 MFLOP/run - 208.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               21675 runs -    46.29 us/run - 115.40 MFLOP/run -   2.49 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4578 runs -   218.96 us/run - 923.24 MFLOP/run -   4.22 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1595 runs -   641.38 us/run -   1.85 GFLOP/run -   2.88 TFLOPS

There's an issue with Intel, the new results are worse than the old ones for large inputs. Any ideas? Everything else looks fine.

@jeffbolznv
Copy link
Collaborator Author

There's an issue with Intel, the new results are worse than the old ones for large inputs. Any ideas?

Can you try forcing CONV_SHAPE_128x128 for intel? That should be closer to the old behavior.

@jeffbolznv
Copy link
Collaborator Author

Oh wait, the large convolutions are probably already using the large tile size. But maybe still worth verifying that's happening.

@jeffbolznv
Copy link
Collaborator Author

Other possibilities might be that the shared memory stride padding or loop unrolling are harmful on Intel.

@etasnadi
Copy link
Contributor

etasnadi commented Jul 30, 2025

Shuffle is not nearly as fast as integer math on recent GPUs, and to some extent competes with shared/global memory accesses (e.g. see https://forums.developer.nvidia.com/t/whats-the-difference-between-mio-and-lsu-instruction-queue-in-volta-architecture/124749) which are also a bottleneck in this shader.

Then it definitely makes sense to disable.

Now it would be good to know if this also holds for recent AMD/Intel/Adreno devices and disable subgroups collectives.

@etasnadi
Copy link
Contributor

Here are results from my hardware:
Nvidia RTX 3090

AMD Radeon Pro VII
Intel A770

There's an issue with Intel, the new results are worse than the old ones for large inputs. Any ideas? Everything else looks fine.

Did you try with disabled collectives on AMD?

#ifdef USE_COLLECTIVES
# extension GL_KHR_shader_subgroup_shuffle : enable
#endif

#include "types.comp"

// Make spec constant
#define SHMEM_PAD 0
#define SHMEM_PAD 4
Copy link
Contributor

@etasnadi etasnadi Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use padding, wouldn't it be better if we converted this to spec constant to make sure that its value is consistent with the host code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a good idea. I'll change it.

@netrunnereve
Copy link
Collaborator

Now it would be good to know if this also holds for recent AMD/Intel/Adreno devices and disable subgroups collectives.

When I worked on mat vec optimizations shared memory was actually faster than subgroup shuffles on GCN cards, even though in theory it should be the opposite. These things don't always make sense and I guess the only way to find out is to get the hardware and test it.

@netrunnereve
Copy link
Collaborator

Did you try with disabled collectives on AMD?

Hmm GCN5 has single cycle integer multiplication so there's a chance it'll be faster with no collectives.

@jeffbolznv
Copy link
Collaborator Author

#14982 is a follow-on change to use coopmat2 in this shader.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 31, 2025

Here's another example where the first test is negatively affected. I'm doing some more tests on Intel and AMD.

AMD RX 6800 XT
ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     58 runs - 17524.78 us/run - 137.42 GFLOP/run -   7.84 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               17204 runs -    59.02 us/run - 133.69 MFLOP/run -   2.27 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    78.62 us/run - 135.78 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                114688 runs -     9.18 us/run - 642.82 kFLOP/run -  70.02 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    60.40 us/run -  20.90 MFLOP/run - 345.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    34.52 us/run -   2.78 MFLOP/run -  80.68 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   234.14 us/run -  22.28 MFLOP/run -  95.15 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11271 runs -    95.81 us/run - 115.40 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3488 runs -   294.00 us/run - 923.24 MFLOP/run -   3.14 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2255 runs -   453.73 us/run -   1.85 GFLOP/run -   4.07 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     57 runs - 17729.82 us/run - 137.42 GFLOP/run -   7.75 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               17204 runs -    59.20 us/run - 133.69 MFLOP/run -   2.26 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    78.68 us/run - 135.78 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                114688 runs -     9.20 us/run - 642.82 kFLOP/run -  69.84 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    60.43 us/run -  20.90 MFLOP/run - 345.79 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    34.57 us/run -   2.78 MFLOP/run -  80.54 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   233.03 us/run -  22.28 MFLOP/run -  95.60 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11271 runs -    93.59 us/run - 115.40 MFLOP/run -   1.23 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3488 runs -   291.75 us/run - 923.24 MFLOP/run -   3.16 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2255 runs -   454.05 us/run -   1.85 GFLOP/run -   4.07 TFLOPS

PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     54 runs - 18532.46 us/run - 137.42 GFLOP/run -   7.42 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26928 runs -    37.43 us/run - 133.69 MFLOP/run -   3.57 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19899 runs -    51.08 us/run - 135.78 MFLOP/run -   2.66 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                212992 runs -     4.73 us/run - 642.82 kFLOP/run - 136.00 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    29.79 us/run -  20.90 MFLOP/run - 701.48 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    17.41 us/run -   2.78 MFLOP/run - 159.98 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    90.22 us/run -  22.28 MFLOP/run - 246.94 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27744 runs -    36.60 us/run - 115.40 MFLOP/run -   3.15 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4796 runs -   211.58 us/run - 923.24 MFLOP/run -   4.36 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2365 runs -   428.54 us/run -   1.85 GFLOP/run -   4.31 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     54 runs - 18822.93 us/run - 137.42 GFLOP/run -   7.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26928 runs -    37.74 us/run - 133.69 MFLOP/run -   3.54 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19899 runs -    50.96 us/run - 135.78 MFLOP/run -   2.66 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                212992 runs -     4.73 us/run - 642.82 kFLOP/run - 135.89 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    29.81 us/run -  20.90 MFLOP/run - 701.01 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    17.40 us/run -   2.78 MFLOP/run - 160.05 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    90.34 us/run -  22.28 MFLOP/run - 246.60 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27744 runs -    36.20 us/run - 115.40 MFLOP/run -   3.19 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4905 runs -   205.30 us/run - 923.24 MFLOP/run -   4.50 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2365 runs -   426.08 us/run -   1.85 GFLOP/run -   4.34 TFLOPS


PR without collectives:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     67 runs - 15121.81 us/run - 137.42 GFLOP/run -   9.09 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               32912 runs -    31.04 us/run - 133.69 MFLOP/run -   4.31 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.66 us/run - 135.78 MFLOP/run -   3.34 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                237568 runs -     4.30 us/run - 642.82 kFLOP/run - 149.66 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52646 runs -    19.04 us/run -  20.90 MFLOP/run -   1.10 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                90112 runs -    11.26 us/run -   2.78 MFLOP/run - 247.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                17956 runs -    60.62 us/run -  22.28 MFLOP/run - 367.50 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               35547 runs -    28.74 us/run - 115.40 MFLOP/run -   4.02 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5886 runs -   172.87 us/run - 923.24 MFLOP/run -   5.34 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2915 runs -   349.64 us/run -   1.85 GFLOP/run -   5.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     66 runs - 15295.30 us/run - 137.42 GFLOP/run -   8.98 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               32912 runs -    31.01 us/run - 133.69 MFLOP/run -   4.31 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.72 us/run - 135.78 MFLOP/run -   3.33 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                237568 runs -     4.33 us/run - 642.82 kFLOP/run - 148.43 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52646 runs -    19.19 us/run -  20.90 MFLOP/run -   1.09 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                90112 runs -    11.30 us/run -   2.78 MFLOP/run - 246.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                17956 runs -    60.40 us/run -  22.28 MFLOP/run - 368.84 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               34680 runs -    28.86 us/run - 115.40 MFLOP/run -   4.00 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5995 runs -   169.32 us/run - 923.24 MFLOP/run -   5.45 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2915 runs -   347.27 us/run -   1.85 GFLOP/run -   5.32 TFLOPS

Edit: This one is easily solved by disabling subgroup shuffle. I added the results from that to the details. That means that they should also be disabled at least on AMD RDNA. I'll check GCN again.

@0cc4m
Copy link
Collaborator

0cc4m commented Jul 31, 2025

Here's a diff that gets good performance on Intel:

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 6647b1cc2..9b3feb9ce 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3093,7 +3093,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
         uint32_t use_collectives = 0;  // Enables subgroup ops for preventing the re-calculation of indices.
         uint32_t conv2d_BS_NPQ = 128;
         uint32_t conv2d_TS_K   = 8;
-        uint32_t conv2d_SHMEM_PAD = 4;
+        uint32_t conv2d_SHMEM_PAD = 0;

         switch (s) {
         default:
@@ -7060,9 +7061,10 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
             for (uint32_t i = 0; i < CONV_SHAPE_COUNT; ++i) {
                 tiles[i] = CEIL_DIV(elements[0], ctx->device->pipeline_conv2d_f32[i]->wg_denoms[0]) * CEIL_DIV(elements[1], ctx->device->pipeline_conv2d_f32[i]->wg_denoms[1]);
             }
-            if (elements[0] > 64 && tiles[CONV_SHAPE_128x128] >= ctx->device->shader_core_count * 2) {
+            const uint32_t shader_core_count = ctx->device->shader_core_count > 0 ? ctx->device->shader_core_count : 32;
+            if (elements[0] > 64 && tiles[CONV_SHAPE_128x128] >= shader_core_count * 2) {
                 shape = CONV_SHAPE_128x128;
-            } else if (elements[0] <= 32 && tiles[CONV_SHAPE_32x256] >= ctx->device->shader_core_count * 2) {
+            } else if (elements[0] <= 32 && tiles[CONV_SHAPE_32x256] >= shader_core_count * 2) {
                 shape = CONV_SHAPE_32x256;
             } else {
                 shape = CONV_SHAPE_64x32;
diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp b/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
index 32bd9d4d6..69494c119 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
@@ -202,7 +202,7 @@ void main() {
             Ash[B_ly * Ash_stride + B_lx] = val;
         }
         /* Load input to B_block: (BS_CRS x BS_NPQ) */
-        [[unroll]] for (uint32_t r_offset = 0; r_offset < BS_CRS; r_offset += BrpWg) {
+        for (uint32_t r_offset = 0; r_offset < BS_CRS; r_offset += BrpWg) {
             uint32_t B_ly          = r_offset + Br;             /* Row index of B block */
             uint32_t B_lx          = Bc;
             uint32_t NPQ_idx       = B_idx_NPQ * BS_NPQ + B_lx; /* Global NPQ index (column index of B) */
@@ -248,7 +248,7 @@ void main() {
         }
         barrier();
         if (T_y * TS_K < K) {
-            [[unroll]] for (uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++) {
+            for (uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++) {
                 for (uint32_t T_ly = 0; T_ly < TS_K; T_ly++) {
                     regA[T_ly] = Ash[(T_y * TS_K + T_ly) * Ash_stride + CRS_lidx];
                 }

The problems are:

  • Intel does not like the shmem padding
    • easy to fix
  • Your shader selection logic does not take into account that shader_core_count is 0 on non-Nvidia and non-AMD devices
    • Not sure how to fix this. My A770 has 32 Xe units, so I hardcoded that value and it seems to work alright, but that's not a general solution
  • Intel does not like your unrolling
    • fixable but annoying if that means more shader variants, maybe it can be solved with another specialization constant?

@jeffbolznv
Copy link
Collaborator Author

 Intel does not like your unrolling

fixable but annoying if that means more shader variants, maybe it can be solved with another specialization constant?

Does AMD benefit from the unrolling? I retested and it seems like the first unroll doesn't make a difference on NV, and the second one is in code that will be replaced by coopmat2. So we could potentially just remove the force unrolling if AMD performance is OK.

Your shader selection logic does not take into account that shader_core_count is 0 on non-Nvidia and non-AMD devices

I had tried to write it in a way where it will continue to use the original tile size on Intel. But it sounds like there are benefits to the other tile sizes on Intel, so I think a hardcoded constant like 32 is OK for now.

@jeffbolznv
Copy link
Collaborator Author

the first unroll doesn't make a difference on NV

Actually, it looks like there's a small gain. I'm going to try multiple variants and only do unrolling for smaller loop counts, we'll see how that works across all HW.

@jeffbolznv
Copy link
Collaborator Author

only do unrolling for smaller loop counts

This didn't work out as cleanly as I had hoped, even on NV. So I just did variants with/without unrolling selected based on device type.

I think we just need a decision on whether to enable unrolling and/or collectives on AMD.

@0cc4m
Copy link
Collaborator

0cc4m commented Aug 1, 2025

Thank you, now it works well on Intel. I ran further benchmarks on AMD Radeon Pro VII and RX 6800 XT, this is what I came up with:

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 5898bf8df..2cd32fbb5 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3099,6 +3099,8 @@ static void ggml_vk_load_shaders(vk_device& device) {
         if (device->vendor_id == VK_VENDOR_ID_INTEL) {
             conv2d_SHMEM_PAD = 0;
             conv2d_UNROLL = false;
+        } else if (device->vendor_id == VK_VENDOR_ID_AMD) {
+            conv2d_SHMEM_PAD = device->architecture == vk_device_architecture::AMD_GCN ? 1 : 4;
         }

         switch (s) {
@@ -3107,6 +3109,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
             conv2d_BS_K = 128;
             conv2d_BS_NPQ = 128;
             conv2d_BS_CRS = 16;
+            if (device->vendor_id == VK_VENDOR_ID_AMD && device->architecture != vk_device_architecture::AMD_GCN) {
+                conv2d_UNROLL = false;
+            }
             break;
         case CONV_SHAPE_64x32:
             conv2d_BS_K = 64;
@@ -3121,13 +3126,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
             break;
         }

-        // Use collectives on pre-Turing NVIDIA GPUs, which had slower integer math.
+        // Use collectives on pre-Turing NVIDIA GPUs and GCN AMD cards, which had slower integer math.
         bool allow_collectives_nv = device->vendor_id != VK_VENDOR_ID_NVIDIA ||
                                     device->architecture == vk_device_architecture::NVIDIA_PRE_TURING;
+        bool allow_collectives_amd = device->vendor_id != VK_VENDOR_ID_AMD ||
+                                     device->architecture == vk_device_architecture::AMD_GCN;

         if (device->subgroup_shuffle &&
             device->vendor_id != VK_VENDOR_ID_INTEL &&   // Do not enable collectives on Intel, see PR 14316.
-            allow_collectives_nv) {
+            allow_collectives_nv &&
+            allow_collectives_amd) {
             use_collectives = 1;
             conv2d_BS_CRS   = std::min(
                 device->subgroup_size,

There seems to be a difference in whether unrolling helps on RDNA2 depending on the shader size. It helps with the small variants and slows the large tests slightly, that's why I added the if statement in the switch case. Not sure what the reason for this behaviour is.

Copy link
Collaborator

@0cc4m 0cc4m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, all my GPUs get a boost from this now.

@0cc4m 0cc4m merged commit a9f7541 into ggml-org:master Aug 2, 2025
46 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants