vulkan: optimizations for direct convolution #14933

jeffbolznv · 2025-07-29T02:36:45Z

Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too.
Fix shmem bank conflicts. 16B padding should work with coopmat.
Some explicit loop unrolling.
Skip math/stores work for parts of the tile that are OOB.
Apply fastdiv opt.
Disable shuffles for NV.

5090 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    220 runs -  4554.01 us/run - 137.42 GFLOP/run -  30.18 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.52 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.20 us/run - 135.78 MFLOP/run -   3.65 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -    10.03 us/run - 642.82 kFLOP/run -  64.06 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.84 us/run -  20.90 MFLOP/run - 636.32 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.82 us/run -   2.78 MFLOP/run - 112.22 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.47 us/run -  22.28 MFLOP/run - 173.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.51 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10246 runs -    98.46 us/run - 923.24 MFLOP/run -   9.38 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3630 runs -   277.22 us/run -   1.85 GFLOP/run -   6.67 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    223 runs -  4493.81 us/run - 137.42 GFLOP/run -  30.58 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.55 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.32 us/run - 135.78 MFLOP/run -   3.64 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.96 us/run - 642.82 kFLOP/run -  64.54 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.90 us/run -  20.90 MFLOP/run - 635.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.85 us/run -   2.78 MFLOP/run - 112.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.29 us/run -  22.28 MFLOP/run - 173.66 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.36 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10137 runs -    99.29 us/run - 923.24 MFLOP/run -   9.30 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3685 runs -   275.26 us/run -   1.85 GFLOP/run -   6.72 TFLOPS

5090 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    212 runs -  4720.67 us/run - 137.42 GFLOP/run -  29.11 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133144 runs -     7.52 us/run - 133.69 MFLOP/run -  17.78 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99495 runs -    10.12 us/run - 135.78 MFLOP/run -  13.42 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                491520 runs -     2.05 us/run - 642.82 kFLOP/run - 312.83 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.61 us/run -  20.90 MFLOP/run -   3.72 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.48 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                35912 runs -    31.51 us/run -  22.28 MFLOP/run - 706.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99705 runs -    10.06 us/run - 115.40 MFLOP/run -  11.47 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26705 runs -    37.50 us/run - 923.24 MFLOP/run -  24.62 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    10670 runs -    94.18 us/run -   1.85 GFLOP/run -  19.63 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    217 runs -  4612.13 us/run - 137.42 GFLOP/run -  29.80 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133892 runs -     7.50 us/run - 133.69 MFLOP/run -  17.82 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               98021 runs -    10.21 us/run - 135.78 MFLOP/run -  13.29 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                516096 runs -     1.95 us/run - 642.82 kFLOP/run - 329.59 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               177082 runs -     5.67 us/run -  20.90 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               180224 runs -     5.65 us/run -   2.78 MFLOP/run - 492.74 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                31423 runs -    32.23 us/run -  22.28 MFLOP/run - 691.18 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              102306 runs -     9.82 us/run - 115.40 MFLOP/run -  11.75 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27032 runs -    37.03 us/run - 923.24 MFLOP/run -  24.93 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    11440 runs -    87.54 us/run -   1.85 GFLOP/run -  21.12 TFLOPS

4070 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     95 runs - 10632.43 us/run - 137.42 GFLOP/run -  12.92 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.27 us/run - 133.69 MFLOP/run -   3.69 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.70 us/run - 135.78 MFLOP/run -   3.34 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.28 us/run - 642.82 kFLOP/run - 102.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    58.79 us/run -  20.90 MFLOP/run - 355.42 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.52 us/run -   2.78 MFLOP/run -  61.18 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   314.76 us/run -  22.28 MFLOP/run -  70.78 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    41.63 us/run - 115.40 MFLOP/run -   2.77 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6104 runs -   166.49 us/run - 923.24 MFLOP/run -   5.55 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   253.72 us/run -   1.85 GFLOP/run -   7.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     99 runs - 10197.10 us/run - 137.42 GFLOP/run -  13.48 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.33 us/run - 133.69 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24321 runs -    41.20 us/run - 135.78 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.36 us/run - 642.82 kFLOP/run - 101.03 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    59.09 us/run -  20.90 MFLOP/run - 353.67 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.46 us/run -   2.78 MFLOP/run -  61.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   316.72 us/run -  22.28 MFLOP/run -  70.34 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    42.07 us/run - 115.40 MFLOP/run -   2.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5995 runs -   169.17 us/run - 923.24 MFLOP/run -   5.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   255.64 us/run -   1.85 GFLOP/run -   7.23 TFLOPS

4070 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     55 runs - 18398.33 us/run - 137.42 GFLOP/run -   7.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               54604 runs -    18.35 us/run - 133.69 MFLOP/run -   7.28 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               38324 runs -    26.10 us/run - 135.78 MFLOP/run -   5.20 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.73 us/run - 642.82 kFLOP/run - 235.85 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.21 us/run -  20.90 MFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    12.98 us/run -   2.78 MFLOP/run - 214.49 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    95.47 us/run -  22.28 MFLOP/run - 233.36 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               50286 runs -    20.09 us/run - 115.40 MFLOP/run -   5.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   114.18 us/run - 923.24 MFLOP/run -   8.09 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   224.02 us/run -   1.85 GFLOP/run -   8.25 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     68 runs - 14908.06 us/run - 137.42 GFLOP/run -   9.22 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               53856 runs -    18.68 us/run - 133.69 MFLOP/run -   7.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    26.01 us/run - 135.78 MFLOP/run -   5.22 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.75 us/run - 642.82 kFLOP/run - 233.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.33 us/run -  20.90 MFLOP/run -   1.57 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    13.06 us/run -   2.78 MFLOP/run - 213.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    96.22 us/run -  22.28 MFLOP/run - 231.53 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               49419 runs -    20.45 us/run - 115.40 MFLOP/run -   5.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   113.43 us/run - 923.24 MFLOP/run -   8.14 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   222.58 us/run -   1.85 GFLOP/run -   8.31 TFLOPS

I haven't looked into why the first test case (// K=CRS=NPQ=4096 conv2d matmul performance) is slower on 4070. That's the one that seems most likely to benefit from coopmat, so I'd prefer to wait until we add coopmat support to worry about that.

Here's a comparison to the im2col path using #14833. All test cases except the first are faster than the im2col path.

5090
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            1192 runs -   839.13 us/run - 137.42 GFLOP/run - 163.77 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                61336 runs -    16.50 us/run - 133.69 MFLOP/run -   8.10 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52327 runs -    19.17 us/run - 135.78 MFLOP/run -   7.08 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 100352 runs -     9.98 us/run - 642.82 kFLOP/run -  64.43 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 19456 runs -    53.39 us/run -  20.90 MFLOP/run - 391.40 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 20480 runs -    51.00 us/run -   2.78 MFLOP/run -  54.60 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  4096 runs -   281.73 us/run -  22.28 MFLOP/run -  79.08 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52887 runs -    19.01 us/run - 115.40 MFLOP/run -   6.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                15478 runs -    64.64 us/run - 923.24 MFLOP/run -  14.28 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             12210 runs -    82.24 us/run -   1.85 GFLOP/run -  22.48 TFLOPS

4070
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             350 runs -  2858.74 us/run - 137.42 GFLOP/run -  48.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37400 runs -    26.77 us/run - 133.69 MFLOP/run -   4.99 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37587 runs -    26.95 us/run - 135.78 MFLOP/run -   5.04 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  69632 runs -    14.42 us/run - 642.82 kFLOP/run -  44.58 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  6144 runs -   193.36 us/run -  20.90 MFLOP/run - 108.07 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  8192 runs -   135.03 us/run -   2.78 MFLOP/run -  20.62 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  1024 runs -  1010.80 us/run -  22.28 MFLOP/run -  22.04 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                27744 runs -    36.08 us/run - 115.40 MFLOP/run -   3.20 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4469 runs -   225.93 us/run - 923.24 MFLOP/run -   4.09 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              6105 runs -   165.17 us/run -   1.85 GFLOP/run -  11.19 TFLOPS

cc @etasnadi

- Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV.

Green-Sky · 2025-07-29T09:51:28Z

On my rtx 2070 mobile:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat

sd1 vae 512x768

before:
computing vae [mode: DECODE] graph completed, taking 1.16s

computing vae [mode: DECODE] graph completed, taking 1.13s

pr:
computing vae [mode: DECODE] graph completed, taking 1.56s

computing vae [mode: DECODE] graph completed, taking 1.55s

This does not look good in practice with sd.cpp.

I did not run the sampling with this patch.
I applyed this patch to ggml ggml-org/ggml@b96890f

jeffbolznv · 2025-07-29T11:49:30Z

Please share the command line you used for sd. Maybe this is the same as the 4070 regression, I'll look into it.

Green-Sky · 2025-07-29T12:15:41Z

Please share the command line you used for sd. Maybe this is the same as the 4070 regression, I'll look into it.

bin/sd -m ../models/CyberRealistic_V9-q8_0.gguf --sampling-method dpm++2m --schedule karras -W 512 -H 768 --cfg-scale 5 --steps 30 -p "a lovely cat" -v

should be the same with the base f16.safetensors file.

Green-Sky · 2025-07-29T13:49:39Z

Another example using sd2 https://huggingface.co/Green-Sky/SD-Turbo-GGUF/blob/main/sd_turbo-f16-q8_0.gguf

before:

$ bin/sd -m ../models/sd_turbo-f16-q8_0.gguf --cfg-scale 1 --steps 8 -p "a lovely cat" -v

[DEBUG] ggml_extend.hpp:1187 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.73s

after:

[DEBUG] ggml_extend.hpp:1187 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 1.00s

edit: im2col+mat_mul for comparison

[DEBUG] ggml_extend.hpp:1187 - vae compute buffer size: 1664.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 1.32s

etasnadi · 2025-07-29T14:05:41Z

Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too.
- Fix shmem bank conflicts. 16B padding should work with coopmat.
- Some explicit loop unrolling.
- Skip math/stores work for parts of the tile that are OOB.
- Apply fastdiv opt.
- Disable shuffles for NV.

5090 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    220 runs -  4554.01 us/run - 137.42 GFLOP/run -  30.18 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.52 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.20 us/run - 135.78 MFLOP/run -   3.65 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -    10.03 us/run - 642.82 kFLOP/run -  64.06 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.84 us/run -  20.90 MFLOP/run - 636.32 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.82 us/run -   2.78 MFLOP/run - 112.22 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.47 us/run -  22.28 MFLOP/run - 173.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.51 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10246 runs -    98.46 us/run - 923.24 MFLOP/run -   9.38 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3630 runs -   277.22 us/run -   1.85 GFLOP/run -   6.67 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    223 runs -  4493.81 us/run - 137.42 GFLOP/run -  30.58 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    40.55 us/run - 133.69 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27269 runs -    37.32 us/run - 135.78 MFLOP/run -   3.64 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.96 us/run - 642.82 kFLOP/run -  64.54 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    32.90 us/run -  20.90 MFLOP/run - 635.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    24.85 us/run -   2.78 MFLOP/run - 112.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   128.29 us/run -  22.28 MFLOP/run - 173.66 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14739 runs -    70.36 us/run - 115.40 MFLOP/run -   1.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10137 runs -    99.29 us/run - 923.24 MFLOP/run -   9.30 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3685 runs -   275.26 us/run -   1.85 GFLOP/run -   6.72 TFLOPS

5090 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    212 runs -  4720.67 us/run - 137.42 GFLOP/run -  29.11 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133144 runs -     7.52 us/run - 133.69 MFLOP/run -  17.78 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99495 runs -    10.12 us/run - 135.78 MFLOP/run -  13.42 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                491520 runs -     2.05 us/run - 642.82 kFLOP/run - 312.83 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.61 us/run -  20.90 MFLOP/run -   3.72 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.48 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                35912 runs -    31.51 us/run -  22.28 MFLOP/run - 706.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               99705 runs -    10.06 us/run - 115.40 MFLOP/run -  11.47 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26705 runs -    37.50 us/run - 923.24 MFLOP/run -  24.62 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    10670 runs -    94.18 us/run -   1.85 GFLOP/run -  19.63 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    217 runs -  4612.13 us/run - 137.42 GFLOP/run -  29.80 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              133892 runs -     7.50 us/run - 133.69 MFLOP/run -  17.82 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               98021 runs -    10.21 us/run - 135.78 MFLOP/run -  13.29 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                516096 runs -     1.95 us/run - 642.82 kFLOP/run - 329.59 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               177082 runs -     5.67 us/run -  20.90 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               180224 runs -     5.65 us/run -   2.78 MFLOP/run - 492.74 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                31423 runs -    32.23 us/run -  22.28 MFLOP/run - 691.18 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              102306 runs -     9.82 us/run - 115.40 MFLOP/run -  11.75 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27032 runs -    37.03 us/run - 923.24 MFLOP/run -  24.93 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    11440 runs -    87.54 us/run -   1.85 GFLOP/run -  21.12 TFLOPS

4070 before:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     95 runs - 10632.43 us/run - 137.42 GFLOP/run -  12.92 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.27 us/run - 133.69 MFLOP/run -   3.69 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.70 us/run - 135.78 MFLOP/run -   3.34 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.28 us/run - 642.82 kFLOP/run - 102.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    58.79 us/run -  20.90 MFLOP/run - 355.42 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.52 us/run -   2.78 MFLOP/run -  61.18 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   314.76 us/run -  22.28 MFLOP/run -  70.78 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    41.63 us/run - 115.40 MFLOP/run -   2.77 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6104 runs -   166.49 us/run - 923.24 MFLOP/run -   5.55 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   253.72 us/run -   1.85 GFLOP/run -   7.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     99 runs - 10197.10 us/run - 137.42 GFLOP/run -  13.48 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27676 runs -    36.33 us/run - 133.69 MFLOP/run -   3.68 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24321 runs -    41.20 us/run - 135.78 MFLOP/run -   3.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                163840 runs -     6.36 us/run - 642.82 kFLOP/run - 101.03 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    59.09 us/run -  20.90 MFLOP/run - 353.67 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.46 us/run -   2.78 MFLOP/run -  61.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   316.72 us/run -  22.28 MFLOP/run -  70.34 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24276 runs -    42.07 us/run - 115.40 MFLOP/run -   2.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5995 runs -   169.17 us/run - 923.24 MFLOP/run -   5.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     3960 runs -   255.64 us/run -   1.85 GFLOP/run -   7.23 TFLOPS

4070 after:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     55 runs - 18398.33 us/run - 137.42 GFLOP/run -   7.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               54604 runs -    18.35 us/run - 133.69 MFLOP/run -   7.28 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               38324 runs -    26.10 us/run - 135.78 MFLOP/run -   5.20 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.73 us/run - 642.82 kFLOP/run - 235.85 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.21 us/run -  20.90 MFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    12.98 us/run -   2.78 MFLOP/run - 214.49 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    95.47 us/run -  22.28 MFLOP/run - 233.36 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               50286 runs -    20.09 us/run - 115.40 MFLOP/run -   5.74 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   114.18 us/run - 923.24 MFLOP/run -   8.09 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   224.02 us/run -   1.85 GFLOP/run -   8.25 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     68 runs - 14908.06 us/run - 137.42 GFLOP/run -   9.22 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               53856 runs -    18.68 us/run - 133.69 MFLOP/run -   7.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    26.01 us/run - 135.78 MFLOP/run -   5.22 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                368640 runs -     2.75 us/run - 642.82 kFLOP/run - 233.38 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                76576 runs -    13.33 us/run -  20.90 MFLOP/run -   1.57 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                81920 runs -    13.06 us/run -   2.78 MFLOP/run - 213.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    96.22 us/run -  22.28 MFLOP/run - 231.53 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               49419 runs -    20.45 us/run - 115.40 MFLOP/run -   5.64 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8829 runs -   113.43 us/run - 923.24 MFLOP/run -   8.14 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4510 runs -   222.58 us/run -   1.85 GFLOP/run -   8.31 TFLOPS

I haven't looked into why the first test case (// K=CRS=NPQ=4096 conv2d matmul performance) is slower on 4070. That's the one that seems most likely to benefit from coopmat, so I'd prefer to wait until we add coopmat support to worry about that.

Here's a comparison to the im2col path using #14833. All test cases except the first are faster than the im2col path.

5090
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            1192 runs -   839.13 us/run - 137.42 GFLOP/run - 163.77 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                61336 runs -    16.50 us/run - 133.69 MFLOP/run -   8.10 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52327 runs -    19.17 us/run - 135.78 MFLOP/run -   7.08 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 100352 runs -     9.98 us/run - 642.82 kFLOP/run -  64.43 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 19456 runs -    53.39 us/run -  20.90 MFLOP/run - 391.40 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 20480 runs -    51.00 us/run -   2.78 MFLOP/run -  54.60 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  4096 runs -   281.73 us/run -  22.28 MFLOP/run -  79.08 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52887 runs -    19.01 us/run - 115.40 MFLOP/run -   6.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                15478 runs -    64.64 us/run - 923.24 MFLOP/run -  14.28 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             12210 runs -    82.24 us/run -   1.85 GFLOP/run -  22.48 TFLOPS

4070
  CONV_2D_IM2COL(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             350 runs -  2858.74 us/run - 137.42 GFLOP/run -  48.07 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37400 runs -    26.77 us/run - 133.69 MFLOP/run -   4.99 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                37587 runs -    26.95 us/run - 135.78 MFLOP/run -   5.04 TFLOPS
  CONV_2D_IM2COL(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  69632 runs -    14.42 us/run - 642.82 kFLOP/run -  44.58 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  6144 runs -   193.36 us/run -  20.90 MFLOP/run - 108.07 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  8192 runs -   135.03 us/run -   2.78 MFLOP/run -  20.62 GFLOPS
  CONV_2D_IM2COL(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                  1024 runs -  1010.80 us/run -  22.28 MFLOP/run -  22.04 GFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                27744 runs -    36.08 us/run - 115.40 MFLOP/run -   3.20 TFLOPS
  CONV_2D_IM2COL(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4469 runs -   225.93 us/run - 923.24 MFLOP/run -   4.09 TFLOPS
  CONV_2D_IM2COL(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              6105 runs -   165.17 us/run -   1.85 GFLOP/run -  11.19 TFLOPS

cc @etasnadi

Did you use coopmat/coopmat2 for the im2col comparison? 5090 reaches 163.77 TFLOPS but wikipedia says that it can only do 104.8 TFLOPS fp16/fp32. Otherwise there is a bug in the perf code for graphs.

jeffbolznv · 2025-07-29T14:07:44Z

Yes, I kept coopmat2 enabled for the im2col test.

Green-Sky · 2025-07-29T14:11:39Z

I decorated the conv calls to log the tensor shapes:

For sd2 (same gguf as above).

UNET

Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 1280, 1280)
  input (f32): shape(16, 16, 1280, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 1280, 1280)
  input (f32): shape(32, 32, 1280, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 640, 640)
  input (f32): shape(64, 64, 640, 1)

VAE

[DEBUG] ggml_extend.hpp:1191 - vae compute buffer size: 640.00 MB(VRAM)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 4, 4)
  input (f32): shape(64, 64, 4, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 4, 512)
  input (f32): shape(64, 64, 4, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(64, 64, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(128, 128, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 512)
  input (f32): shape(256, 256, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 512, 256)
  input (f32): shape(256, 256, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 512, 256)
  input (f32): shape(256, 256, 512, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(256, 256, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 256)
  input (f32): shape(512, 512, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 256, 128)
  input (f32): shape(512, 512, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:0, p1:0, d0:1, d1:1)
  kernel (f16): shape(1, 1, 256, 128)
  input (f32): shape(512, 512, 256, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 128)
  input (f32): shape(512, 512, 128, 1)
Conv2d(s0:1, s1:1, p0:1, p1:1, d0:1, d1:1)
  kernel (f16): shape(3, 3, 128, 3)
  input (f32): shape(512, 512, 128, 1)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 1.04s

jeffbolznv · 2025-07-29T14:29:06Z

@Green-Sky do I need to make any manual edits to use direct convolution? I'm using your sd -m ../models/sd_turbo-f16-q8_0.gguf --cfg-scale 1 --steps 8 -p "a lovely cat" -v command line and seeing the 1664 MB number.

Green-Sky · 2025-07-29T14:30:51Z

@Green-Sky do I need to make any manual edits to use direct convolution? I'm using your sd -m ../models/sd_turbo-f16-q8_0.gguf --cfg-scale 1 --steps 8 -p "a lovely cat" -v command line and seeing the 1664 MB number.

Right. You need leejet/stable-diffusion.cpp#744 currently.
Sorry, forgot to mention.

Green-Sky · 2025-07-29T14:35:35Z

Perf numbers from my side:

$ bin/test-backend-ops perf -o CONV_2D
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2070
  Device memory: 8192 MB (8192 MB free)

master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     31 runs - 33125.48 us/run - 137.42 GFLOP/run -   4.15 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               16456 runs -    61.75 us/run - 133.69 MFLOP/run -   2.17 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   104.84 us/run - 135.78 MFLOP/run -   1.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 65536 runs -    16.25 us/run - 642.82 kFLOP/run -  39.56 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   130.98 us/run -  20.90 MFLOP/run - 159.54 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -    97.61 us/run -   2.78 MFLOP/run -  28.53 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   732.50 us/run -  22.28 MFLOP/run -  30.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12138 runs -    87.13 us/run - 115.40 MFLOP/run -   1.32 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2071 runs -   483.85 us/run - 923.24 MFLOP/run -   1.91 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1760 runs -   574.47 us/run -   1.85 GFLOP/run -   3.22 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     31 runs - 32706.81 us/run - 137.42 GFLOP/run -   4.20 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               16456 runs -    62.66 us/run - 133.69 MFLOP/run -   2.13 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   106.36 us/run - 135.78 MFLOP/run -   1.28 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 65536 runs -    16.06 us/run - 642.82 kFLOP/run -  40.03 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   133.07 us/run -  20.90 MFLOP/run - 157.04 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -    98.99 us/run -   2.78 MFLOP/run -  28.13 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   741.31 us/run -  22.28 MFLOP/run -  30.05 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12138 runs -    87.74 us/run - 115.40 MFLOP/run -   1.32 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2071 runs -   487.95 us/run - 923.24 MFLOP/run -   1.89 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1760 runs -   575.22 us/run -   1.85 GFLOP/run -   3.21 TFLOPS

pr:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     20 runs - 51761.65 us/run - 137.42 GFLOP/run -   2.65 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               18700 runs -    54.20 us/run - 133.69 MFLOP/run -   2.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14003 runs -    74.78 us/run - 135.78 MFLOP/run -   1.82 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                196608 runs -     5.09 us/run - 642.82 kFLOP/run - 126.30 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    28.90 us/run -  20.90 MFLOP/run - 723.08 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    26.75 us/run -   2.78 MFLOP/run - 104.10 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   211.34 us/run -  22.28 MFLOP/run - 105.41 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               20808 runs -    49.10 us/run - 115.40 MFLOP/run -   2.35 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3052 runs -   335.65 us/run - 923.24 MFLOP/run -   2.75 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   660.23 us/run -   1.85 GFLOP/run -   2.80 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     21 runs - 49715.43 us/run - 137.42 GFLOP/run -   2.76 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               18700 runs -    54.80 us/run - 133.69 MFLOP/run -   2.44 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    75.49 us/run - 135.78 MFLOP/run -   1.80 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                196608 runs -     5.14 us/run - 642.82 kFLOP/run - 124.99 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    29.69 us/run -  20.90 MFLOP/run - 703.75 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                40960 runs -    28.57 us/run -   2.78 MFLOP/run -  97.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   218.45 us/run -  22.28 MFLOP/run - 101.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               20808 runs -    48.68 us/run - 115.40 MFLOP/run -   2.37 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3052 runs -   335.69 us/run - 923.24 MFLOP/run -   2.75 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   656.95 us/run -   1.85 GFLOP/run -   2.81 TFLOPS

jeffbolznv · 2025-07-29T14:43:47Z

I can see a similar 25%-ish slowdown on 4070:


4070 before:

  |==================================================| 8/8 - 5.17it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 1.77s
[INFO ] stable-diffusion.cpp:1814 - generating 1 latent images completed, taking 1.82s
[INFO ] stable-diffusion.cpp:1817 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1190 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.44s

4070 after:

  |==================================================| 8/8 - 5.13it/s
[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 1.78s
[INFO ] stable-diffusion.cpp:1814 - generating 1 latent images completed, taking 1.84s
[INFO ] stable-diffusion.cpp:1817 - decoding 1 latents
[DEBUG] ggml_extend.hpp:1190 - vae compute buffer size: 640.00 MB(VRAM)
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.56s

I'll start by looking at the test-backend-ops test.

jeffbolznv · 2025-07-29T15:33:15Z

I looked at the backend tests. The larger tile size is better for very large convolutions (like the first test). I think I need to allow for multiple tile sizes, and doing so should allow for further gains for some of the other shapes. I'll do some more experimentation.

jeffbolznv · 2025-07-29T19:28:20Z

I pushed another commit that has three tile sizes - the original, the one I had used in the previous commit, and a third for smaller K values. It chooses between them based on K/NPQ sizes and number of SMs. This restores the performance in the directed tests:

5090:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            308 runs -  3253.41 us/run - 137.42 GFLOP/run -  42.24 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              136884 runs -     7.33 us/run - 133.69 MFLOP/run -  18.24 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              100969 runs -     9.92 us/run - 135.78 MFLOP/run -  13.69 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                507904 runs -     1.99 us/run - 642.82 kFLOP/run - 323.05 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.54 us/run -  20.90 MFLOP/run -   3.77 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.83 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                58357 runs -    18.53 us/run -  22.28 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              103173 runs -     9.69 us/run - 115.40 MFLOP/run -  11.91 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27686 runs -    36.24 us/run - 923.24 MFLOP/run -  25.48 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            11055 runs -    90.61 us/run -   1.85 GFLOP/run -  20.41 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            315 runs -  3183.73 us/run - 137.42 GFLOP/run -  43.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              139128 runs -     7.20 us/run - 133.69 MFLOP/run -  18.58 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              102443 runs -     9.83 us/run - 135.78 MFLOP/run -  13.82 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                540672 runs -     1.86 us/run - 642.82 kFLOP/run - 345.80 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               181868 runs -     5.54 us/run -  20.90 MFLOP/run -   3.77 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               188416 runs -     5.52 us/run -   2.78 MFLOP/run - 504.95 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                58357 runs -    18.38 us/run -  22.28 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):              105774 runs -     9.50 us/run - 115.40 MFLOP/run -  12.15 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               28122 runs -    35.60 us/run - 923.24 MFLOP/run -  25.93 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            11880 runs -    84.47 us/run -   1.85 GFLOP/run -  21.89 TFLOPS

4070:
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             98 runs - 10299.26 us/run - 137.42 GFLOP/run -  13.34 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               55352 runs -    18.26 us/run - 133.69 MFLOP/run -   7.32 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    25.78 us/run - 135.78 MFLOP/run -   5.27 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                385024 runs -     2.65 us/run - 642.82 kFLOP/run - 242.29 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                71790 runs -    14.73 us/run -  20.90 MFLOP/run -   1.42 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               106496 runs -    10.07 us/run -   2.78 MFLOP/run - 276.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                22445 runs -    53.70 us/run -  22.28 MFLOP/run - 414.89 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51153 runs -    19.72 us/run - 115.40 MFLOP/run -   5.85 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9047 runs -   111.47 us/run - 923.24 MFLOP/run -   8.28 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             4510 runs -   222.19 us/run -   1.85 GFLOP/run -   8.32 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):            103 runs -  9750.78 us/run - 137.42 GFLOP/run -  14.09 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               55352 runs -    18.20 us/run - 133.69 MFLOP/run -   7.35 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               39061 runs -    25.63 us/run - 135.78 MFLOP/run -   5.30 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                376832 runs -     2.69 us/run - 642.82 kFLOP/run - 238.60 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                86148 runs -    12.24 us/run -  20.90 MFLOP/run -   1.71 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               131072 runs -     7.80 us/run -   2.78 MFLOP/run - 357.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                22445 runs -    45.85 us/run -  22.28 MFLOP/run - 485.89 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51153 runs -    19.78 us/run - 115.40 MFLOP/run -   5.84 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9047 runs -   110.81 us/run - 923.24 MFLOP/run -   8.33 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):             4565 runs -   219.34 us/run -   1.85 GFLOP/run -   8.43 TFLOPS

This also restores the performance in sd.cpp according to the debug message, which doesn't seem to be very precise. The GGML_VK_PERF_LOGGER output is more clear, for example this is previous TOT ggml vs this updated PR on 5090:

5090 before:
Vulkan Timings:
ADD: 85 x 67.653 us
CONT: 3 x 22.346 us
CONV_2D M=Cout=128, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 5 x 2442.45 us (31638.6 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 4761.98 us (32462.4 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=256, N=N*OW*OH=262144: 1 x 896.448 us (19126.9 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 8825.86 us (35030.1 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=65536: 5 x 2553.93 us (30264.2 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 5143.94 us (30055.2 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=512, N=N*OW*OH=65536: 1 x 845.824 us (20291.6 GFLOPS/s)
CONV_2D M=Cout=3, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 1 x 2267.14 us (798.873 GFLOPS/s)
CONV_2D M=Cout=4, K=Cin*KW*KH=4, N=N*OW*OH=4096: 1 x 13.152 us (8.72019 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=36, N=N*OW*OH=4096: 1 x 32.768 us (4544 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=16384: 7 x 2659.37 us (29067.4 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=4096: 10 x 1043.53 us (18519.1 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 9121.79 us (33897.3 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=512, N=N*OW*OH=4096: 4 x 144.576 us (14839.2 GFLOPS/s)
GROUP_NORM: 30 x 831.536 us
MUL: 30 x 59.307 us
MUL_MAT m=4096 n=4096 k=512: 1 x 92 us (186555 GFLOPS/s)
MUL_MAT m=512 n=4096 k=4096: 1 x 100.352 us (171175 GFLOPS/s)
SCALE: 1 x 26.624 us
SILU: 29 x 63.772 us
SOFT_MAX: 1 x 43.008 us
UPSCALE: 3 x 165.898 us
Total time: 121672 us.

5090 after:
Vulkan Timings:
ADD: 85 x 65.703 us
CONT: 3 x 24.629 us
CONV_2D M=Cout=128, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 5 x 1969.77 us (39230.8 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 3259.42 us (47427.2 GFLOPS/s)
CONV_2D M=Cout=128, K=Cin*KW*KH=256, N=N*OW*OH=262144: 1 x 661.504 us (25920.2 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=262144: 1 x 7219.71 us (42823.1 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=2304, N=N*OW*OH=65536: 5 x 1820.77 us (42450.6 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 3440.64 us (44934.1 GFLOPS/s)
CONV_2D M=Cout=256, K=Cin*KW*KH=512, N=N*OW*OH=65536: 1 x 557.088 us (30808.6 GFLOPS/s)
CONV_2D M=Cout=3, K=Cin*KW*KH=1152, N=N*OW*OH=262144: 1 x 485.408 us (3731.2 GFLOPS/s)
CONV_2D M=Cout=4, K=Cin*KW*KH=4, N=N*OW*OH=4096: 1 x 3.648 us (31.4386 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=36, N=N*OW*OH=4096: 1 x 12.704 us (11720.5 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=16384: 7 x 1882.62 us (41060.4 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=4096: 10 x 695.929 us (27769 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=4608, N=N*OW*OH=65536: 1 x 6290.3 us (49155.7 GFLOPS/s)
CONV_2D M=Cout=512, K=Cin*KW*KH=512, N=N*OW*OH=4096: 4 x 89.04 us (24094.6 GFLOPS/s)
GROUP_NORM: 30 x 853.14 us
MUL: 30 x 59.335 us
MUL_MAT m=4096 n=4096 k=512: 1 x 95.424 us (179861 GFLOPS/s)
MUL_MAT m=512 n=4096 k=4096: 1 x 100.8 us (170414 GFLOPS/s)
SCALE: 1 x 26.624 us
SILU: 29 x 64.54 us
SOFT_MAX: 1 x 43.008 us
UPSCALE: 3 x 166.613 us
Total time: 97047.3 us.

Many of the buckets show a similar improvement to the first directed test (30 TFLOPS -> 43 TFLOPS), and the two buckets with small K show even bigger improvements. The gains on 4070 are smaller.

The tile selection logic is no longer NV-specific. I'd appreciate some testing on other hardware. If it doesn't help, I can restrict the changes again. I think we still don't have a way to query number of SMs for Intel, but the logic is written in a way where it'll then just base it on K/NPQ.

Green-Sky · 2025-07-29T21:00:10Z

New numbers time.

$ bin/test-backend-ops perf -o CONV_2D
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 2070 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 0 | matrix cores: KHR_coopmat
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce RTX 2070
  Device memory: 8192 MB (8192 MB free)

master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     26 runs - 38736.73 us/run - 137.42 GFLOP/run -   3.55 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14960 runs -    68.96 us/run - 133.69 MFLOP/run -   1.94 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8844 runs -   117.11 us/run - 135.78 MFLOP/run -   1.16 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.05 us/run - 642.82 kFLOP/run -  35.61 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   146.66 us/run -  20.90 MFLOP/run - 142.49 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   109.64 us/run -   2.78 MFLOP/run -  25.40 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   836.26 us/run -  22.28 MFLOP/run -  26.64 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    97.61 us/run - 115.40 MFLOP/run -   1.18 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1853 runs -   558.98 us/run - 923.24 MFLOP/run -   1.65 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   655.81 us/run -   1.85 GFLOP/run -   2.82 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     27 runs - 37843.30 us/run - 137.42 GFLOP/run -   3.63 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14212 runs -    70.64 us/run - 133.69 MFLOP/run -   1.89 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8844 runs -   119.07 us/run - 135.78 MFLOP/run -   1.14 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.00 us/run - 642.82 kFLOP/run -  35.71 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 9572 runs -   149.78 us/run -  20.90 MFLOP/run - 139.51 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                16384 runs -   111.55 us/run -   2.78 MFLOP/run -  24.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   840.77 us/run -  22.28 MFLOP/run -  26.50 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    98.16 us/run - 115.40 MFLOP/run -   1.18 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1853 runs -   549.29 us/run - 923.24 MFLOP/run -   1.68 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1540 runs -   655.39 us/run -   1.85 GFLOP/run -   2.82 TFLOPS

pr:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     28 runs - 36300.93 us/run - 137.42 GFLOP/run -   3.79 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               15708 runs -    63.89 us/run - 133.69 MFLOP/run -   2.09 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12529 runs -    82.23 us/run - 135.78 MFLOP/run -   1.65 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                172032 runs -     5.92 us/run - 642.82 kFLOP/run - 108.56 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    27.44 us/run -  20.90 MFLOP/run - 761.40 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    16.94 us/run -   2.78 MFLOP/run - 164.37 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   120.99 us/run -  22.28 MFLOP/run - 184.14 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               17340 runs -    59.98 us/run - 115.40 MFLOP/run -   1.92 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2507 runs -   403.82 us/run - 923.24 MFLOP/run -   2.29 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1265 runs -   805.03 us/run -   1.85 GFLOP/run -   2.30 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     27 runs - 37721.93 us/run - 137.42 GFLOP/run -   3.64 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14960 runs -    69.47 us/run - 133.69 MFLOP/run -   1.92 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11792 runs -    86.13 us/run - 135.78 MFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                172032 runs -     5.89 us/run - 642.82 kFLOP/run - 109.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    27.96 us/run -  20.90 MFLOP/run - 747.26 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    17.12 us/run -   2.78 MFLOP/run - 162.71 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8978 runs -   119.52 us/run -  22.28 MFLOP/run - 186.40 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19074 runs -    54.06 us/run - 115.40 MFLOP/run -   2.13 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2725 runs -   374.94 us/run - 923.24 MFLOP/run -   2.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1375 runs -   741.01 us/run -   1.85 GFLOP/run -   2.50 TFLOPS

Most are now faster.

sd2 vae decode (same as above):

master:

[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.36s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.85s

pr:

[INFO ] stable-diffusion.cpp:1806 - sampling completed, taking 2.37s
[DEBUG] stable-diffusion.cpp:1182 - computing vae [mode: DECODE] graph completed, taking 0.80s

Looks better now. The difference between now and before master is pretty high, so I should probably bench this on a clean reboot with no other stuff open. Will try different models tomorrow.

I'm a fan of your work btw, on vulkan generally, but also the llama.cpp/ggml stuff specifially (:

etasnadi · 2025-07-29T21:06:59Z

I tested it with my ancient GTX 1060 GPU, and surprisingly, this PR is much faster in some cases (except for the 4096 by 4096 case).

I did not test which specific improvements are responsible for the performance gains, but I expect that fastdiv is a better option than shuffle. I was also considering adding fastdiv, but since the warp shuffle allowed computing divisions only once per thread per blocktile, I did not expect much gain from it.

Disadvantages of shuffle:

~~It is not supported on pre-Kepler GPUs, as far as I know, so the fastdiv trick can boost performance on such really old devices as well.~~ Edit: pre-Kepler GPus might not have Vulkan support either.
With warp shuffle, we need to limit the blocktile dot dimension to the warp size.

I suspect this behavior is not unique to Nvidia, and fastdiv performs better than shuffle on old non-Nvidia devices as well. If this is confirmed, then it would be a good idea to drop shuffle for fastdiv.

Master:
=======

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 69740.80 us/run - 137.42 GFLOP/run -   1.97 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   133.83 us/run - 133.69 MFLOP/run - 998.99 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5159 runs -   201.81 us/run - 135.78 MFLOP/run - 672.83 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    53.14 us/run - 642.82 kFLOP/run -  12.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   488.59 us/run -  20.90 MFLOP/run -  42.77 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   415.18 us/run -   2.78 MFLOP/run -   6.71 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3141.48 us/run -  22.28 MFLOP/run -   7.09 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6069 runs -   187.87 us/run - 115.40 MFLOP/run - 614.26 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 981 runs -  1050.38 us/run - 923.24 MFLOP/run - 878.96 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1165.10 us/run -   1.85 GFLOP/run -   1.59 TFLOPS

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 69245.20 us/run - 137.42 GFLOP/run -   1.98 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   142.16 us/run - 133.69 MFLOP/run - 940.47 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5159 runs -   217.08 us/run - 135.78 MFLOP/run - 625.48 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    53.31 us/run - 642.82 kFLOP/run -  12.06 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   478.72 us/run -  20.90 MFLOP/run -  43.65 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   403.03 us/run -   2.78 MFLOP/run -   6.91 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3131.36 us/run -  22.28 MFLOP/run -   7.11 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   201.01 us/run - 115.40 MFLOP/run - 574.12 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 981 runs -  1054.63 us/run - 923.24 MFLOP/run - 875.41 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1162.61 us/run -   1.85 GFLOP/run -   1.59 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

PR:
===

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 69762.27 us/run - 137.42 GFLOP/run -   1.97 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   126.56 us/run - 133.69 MFLOP/run -   1.06 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   190.03 us/run - 135.78 MFLOP/run - 714.53 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.97 us/run - 642.82 kFLOP/run -  64.44 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    92.13 us/run -  20.90 MFLOP/run - 226.81 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    57.55 us/run -   2.78 MFLOP/run -  48.39 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   416.38 us/run -  22.28 MFLOP/run -  53.51 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   102.83 us/run - 115.40 MFLOP/run -   1.12 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   758.62 us/run - 923.24 MFLOP/run -   1.22 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1181.37 us/run -   1.85 GFLOP/run -   1.57 TFLOPS

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 70923.33 us/run - 137.42 GFLOP/run -   1.94 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   141.42 us/run - 133.69 MFLOP/run - 945.40 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   192.64 us/run - 135.78 MFLOP/run - 704.84 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 98304 runs -    10.65 us/run - 642.82 kFLOP/run -  60.34 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    85.48 us/run -  20.90 MFLOP/run - 244.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    56.87 us/run -   2.78 MFLOP/run -  48.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   467.03 us/run -  22.28 MFLOP/run -  47.70 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9537 runs -   105.21 us/run - 115.40 MFLOP/run -   1.10 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   763.36 us/run - 923.24 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1176.98 us/run -   1.85 GFLOP/run -   1.57 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

daniandtheweb · 2025-07-29T23:51:43Z

Performance numbers on my laptop look really nice with this PR.

Ryzen 7 4700U (Radeon Vega 7 iGPU), 16gb ram

./test-backend-ops perf -o CONV_2D                                              0.001s 
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: AMD Radeon Graphics (RADV RENOIR)
  Device memory: 5452 MB (5452 MB free)

Master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 537110.00 us/run - 137.42 GFLOP/run - 255.85 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1496 runs -   687.21 us/run - 133.69 MFLOP/run - 194.55 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -  1119.22 us/run - 135.78 MFLOP/run - 121.32 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    56.07 us/run - 642.82 kFLOP/run -  11.46 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   799.36 us/run -  20.90 MFLOP/run -  26.14 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   509.41 us/run -   2.78 MFLOP/run -   5.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3909.65 us/run -  22.28 MFLOP/run -   5.70 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2601 runs -   436.11 us/run - 115.40 MFLOP/run - 264.62 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2907.87 us/run - 923.24 MFLOP/run - 317.50 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3331.33 us/run -   1.85 GFLOP/run - 555.00 GFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      5 runs - 215140.40 us/run - 137.42 GFLOP/run - 638.76 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3740 runs -   273.53 us/run - 133.69 MFLOP/run - 488.77 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2211 runs -   488.62 us/run - 135.78 MFLOP/run - 277.89 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 24576 runs -    55.79 us/run - 642.82 kFLOP/run -  11.52 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   795.08 us/run -  20.90 MFLOP/run -  26.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   506.09 us/run -   2.78 MFLOP/run -   5.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  3885.25 us/run -  22.28 MFLOP/run -   5.73 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2601 runs -   435.96 us/run - 115.40 MFLOP/run - 264.72 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2903.64 us/run - 923.24 MFLOP/run - 317.96 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3295.16 us/run -   1.85 GFLOP/run - 561.09 GFLOPS

PR:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      2 runs - 529981.00 us/run - 137.42 GFLOP/run - 259.30 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2244 runs -   656.40 us/run - 133.69 MFLOP/run - 203.68 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1474 runs -   946.40 us/run - 135.78 MFLOP/run - 143.47 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.31 us/run - 642.82 kFLOP/run -  35.12 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   224.87 us/run -  20.90 MFLOP/run -  92.93 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   139.94 us/run -   2.78 MFLOP/run -  19.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   990.73 us/run -  22.28 MFLOP/run -  22.49 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3468 runs -   352.27 us/run - 115.40 MFLOP/run - 327.60 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2687.49 us/run - 923.24 MFLOP/run - 343.53 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3267.60 us/run -   1.85 GFLOP/run - 565.82 GFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      5 runs - 220690.00 us/run - 137.42 GFLOP/run - 622.69 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4488 runs -   264.14 us/run - 133.69 MFLOP/run - 506.15 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2211 runs -   475.02 us/run - 135.78 MFLOP/run - 285.85 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    18.77 us/run - 642.82 kFLOP/run -  34.25 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   225.11 us/run -  20.90 MFLOP/run -  92.83 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   139.96 us/run -   2.78 MFLOP/run -  19.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   989.46 us/run -  22.28 MFLOP/run -  22.52 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3468 runs -   353.70 us/run - 115.40 MFLOP/run - 326.28 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 436 runs -  2691.24 us/run - 923.24 MFLOP/run - 343.05 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      330 runs -  3242.42 us/run -   1.85 GFLOP/run - 570.22 GFLOPS

stable diffusion.cpp im2col, sd 1.5, 512x512, 20 step: diffusion 3.83 s/it, vae decode 20s

stable-diffusion.cpp conv2d master, sd 1.5, 512x512, 20 step: diffusion 4.39 s/it, vae decode 6.18s

stable-diffusion.cpp conv2d PR, sd 1.5, 512x512, 20 step: diffusion 4.08 s/it, vae decode 5.96s

netrunnereve · 2025-07-30T00:49:29Z

Here's a run on my RX 470, with everything being faster than master except for the ne_input=[16,16,128,8],ne_kernel=[3,3,128,512] test. If I turn collectives off, that test runs faster but everything else runs slower!

PR:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 62334.35 us/run - 137.42 GFLOP/run -   2.20 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10472 runs -    96.89 us/run - 133.69 MFLOP/run -   1.38 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6633 runs -   164.13 us/run - 135.78 MFLOP/run - 827.27 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.07 us/run - 642.82 kFLOP/run -  33.72 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    78.54 us/run -  20.90 MFLOP/run - 266.05 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    48.72 us/run -   2.78 MFLOP/run -  57.16 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   270.09 us/run -  22.28 MFLOP/run -  82.49 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    99.76 us/run - 115.40 MFLOP/run -   1.16 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1635 runs -   613.66 us/run - 923.24 MFLOP/run -   1.50 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      825 runs -  1251.48 us/run -   1.85 GFLOP/run -   1.48 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 59013.53 us/run - 137.42 GFLOP/run -   2.33 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11220 runs -    94.72 us/run - 133.69 MFLOP/run -   1.41 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6633 runs -   162.89 us/run - 135.78 MFLOP/run - 833.60 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 57344 runs -    19.03 us/run - 642.82 kFLOP/run -  33.77 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    78.29 us/run -  20.90 MFLOP/run - 266.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    48.71 us/run -   2.78 MFLOP/run -  57.18 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   270.79 us/run -  22.28 MFLOP/run -  82.27 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   100.71 us/run - 115.40 MFLOP/run -   1.15 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1744 runs -   587.41 us/run - 923.24 MFLOP/run -   1.57 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1157.66 us/run -   1.85 GFLOP/run -   1.60 TFLOPS

PR with no collectives:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 64289.75 us/run - 137.42 GFLOP/run -   2.14 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   143.74 us/run - 133.69 MFLOP/run - 930.12 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.86 us/run - 135.78 MFLOP/run - 794.71 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.54 us/run - 642.82 kFLOP/run -  23.34 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   231.00 us/run -  20.90 MFLOP/run -  90.46 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   145.13 us/run -   2.78 MFLOP/run -  19.19 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   999.29 us/run -  22.28 MFLOP/run -  22.29 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   211.30 us/run - 115.40 MFLOP/run - 546.17 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   926.03 us/run - 923.24 MFLOP/run - 996.99 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1152.03 us/run -   1.85 GFLOP/run -   1.60 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 61158.71 us/run - 137.42 GFLOP/run -   2.25 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   144.01 us/run - 133.69 MFLOP/run - 928.37 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.21 us/run - 135.78 MFLOP/run - 797.75 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.53 us/run - 642.82 kFLOP/run -  23.35 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   230.81 us/run -  20.90 MFLOP/run -  90.54 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   144.69 us/run -   2.78 MFLOP/run -  19.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1002.49 us/run -  22.28 MFLOP/run -  22.22 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   210.98 us/run - 115.40 MFLOP/run - 546.98 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   925.57 us/run - 923.24 MFLOP/run - 997.48 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1123.53 us/run -   1.85 GFLOP/run -   1.65 TFLOPS

Master:

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 64523.81 us/run - 137.42 GFLOP/run -   2.13 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   144.09 us/run - 133.69 MFLOP/run - 927.84 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.97 us/run - 135.78 MFLOP/run - 794.18 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.60 us/run - 642.82 kFLOP/run -  23.29 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   230.83 us/run -  20.90 MFLOP/run -  90.53 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   144.84 us/run -   2.78 MFLOP/run -  19.23 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   999.31 us/run -  22.28 MFLOP/run -  22.29 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   211.70 us/run - 115.40 MFLOP/run - 545.13 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   926.70 us/run - 923.24 MFLOP/run - 996.26 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1154.92 us/run -   1.85 GFLOP/run -   1.60 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     17 runs - 61533.29 us/run - 137.42 GFLOP/run -   2.23 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                7480 runs -   143.74 us/run - 133.69 MFLOP/run - 930.09 GFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   170.53 us/run - 135.78 MFLOP/run - 796.26 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 40960 runs -    27.47 us/run - 642.82 kFLOP/run -  23.40 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4786 runs -   230.49 us/run -  20.90 MFLOP/run -  90.66 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 8192 runs -   144.64 us/run -   2.78 MFLOP/run -  19.25 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -  1002.43 us/run -  22.28 MFLOP/run -  22.22 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5202 runs -   211.31 us/run - 115.40 MFLOP/run - 546.13 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1090 runs -   926.93 us/run - 923.24 MFLOP/run - 996.01 GFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1121.26 us/run -   1.85 GFLOP/run -   1.65 TFLOPS

I also have a 512x512 SDXL run with all ggml_conv_2d functions switched to ggml_conv_2d_direct.

PR: 1.03 s/it sampling, 1.33s decoding

PR with no collectives: 1.05 s/it sampling, 1.38s decoding

Master: 1.09 s/it sampling, 1.41s decoding

etasnadi · 2025-07-30T13:12:57Z

Motivated by #14933 (comment), I tested with this PR using collectives on/off on GTX 1060 (Noteboook).

It seems that the effects of collectives (warp shuffle) and the contributions of this PR can add up even for the seemingly peaking 4096x4096 test case where the combination of the two enables to reach 2.08 to 2.10 TFLOPS that I could not achieve before.

PS: collectives also helped on this Nvidia card before this PR, I guess because of the poor idiv throughput.
My 2060 desktop device is way less sensitive to enalbding/disabling collectives so I am not sure that we should disable collectives on all Nvidia devices.

llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp

Line 3103 in 136ecfb

    
           device->vendor_id != VK_VENDOR_ID_NVIDIA) {  // Collectives no faster on NVIDIA.

PR, Collectives=1

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 66094.44 us/run - 137.42 GFLOP/run -   2.08 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8976 runs -   115.36 us/run - 133.69 MFLOP/run -   1.16 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   173.05 us/run - 135.78 MFLOP/run - 784.62 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     7.74 us/run - 642.82 kFLOP/run -  83.02 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    67.39 us/run -  20.90 MFLOP/run - 310.07 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    44.97 us/run -   2.78 MFLOP/run -  61.93 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   334.86 us/run -  22.28 MFLOP/run -  66.53 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -    97.66 us/run - 115.40 MFLOP/run -   1.18 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   745.92 us/run - 923.24 MFLOP/run -   1.24 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1090.30 us/run -   1.85 GFLOP/run -   1.70 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     16 runs - 64889.12 us/run - 137.42 GFLOP/run -   2.12 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8976 runs -   116.30 us/run - 133.69 MFLOP/run -   1.15 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   174.92 us/run - 135.78 MFLOP/run - 776.26 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     7.67 us/run - 642.82 kFLOP/run -  83.86 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    68.07 us/run -  20.90 MFLOP/run - 306.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.40 us/run -   2.78 MFLOP/run -  61.34 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   336.03 us/run -  22.28 MFLOP/run -  66.30 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11271 runs -    95.98 us/run - 115.40 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   740.06 us/run - 923.24 MFLOP/run -   1.25 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      935 runs -  1070.72 us/run -   1.85 GFLOP/run -   1.73 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

PR, Collectives=0

./bin/test-backend-ops -o CONV_2D -b Vulkan0 perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce GTX 1060 (NVIDIA) | uma: 0 | fp16: 0 | bf16: 0 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
  Device description: NVIDIA GeForce GTX 1060
  Device memory: 6144 MB (6144 MB free)

  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 70782.33 us/run - 137.42 GFLOP/run -   1.94 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   126.55 us/run - 133.69 MFLOP/run -   1.06 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   188.36 us/run - 135.78 MFLOP/run - 720.87 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 98304 runs -    10.62 us/run - 642.82 kFLOP/run -  60.55 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    84.49 us/run -  20.90 MFLOP/run - 247.33 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    54.87 us/run -   2.78 MFLOP/run -  50.75 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   416.30 us/run -  22.28 MFLOP/run -  53.52 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   100.85 us/run - 115.40 MFLOP/run -   1.14 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1308 runs -   769.67 us/run - 923.24 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1167.87 us/run -   1.85 GFLOP/run -   1.58 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     15 runs - 71122.67 us/run - 137.42 GFLOP/run -   1.93 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8228 runs -   124.90 us/run - 133.69 MFLOP/run -   1.07 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5896 runs -   188.71 us/run - 135.78 MFLOP/run - 719.52 GFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 98304 runs -    10.22 us/run - 642.82 kFLOP/run -  62.90 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    85.33 us/run -  20.90 MFLOP/run - 244.90 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    55.35 us/run -   2.78 MFLOP/run -  50.31 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   420.99 us/run -  22.28 MFLOP/run -  52.92 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10404 runs -   100.87 us/run - 115.40 MFLOP/run -   1.14 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                1417 runs -   758.10 us/run - 923.24 MFLOP/run -   1.22 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                      880 runs -  1149.91 us/run -   1.85 GFLOP/run -   1.61 TFLOPS
  Backend Vulkan0: OK
Backend 2/2: CPU
  Skipping
2/2 backends passed
OK

jeffbolznv · 2025-07-30T13:50:52Z

Integer math was slower before Turing, I'll make a change to keep collectives enabled for older NVIDIA GPUs.

jeffbolznv · 2025-07-30T14:13:55Z

Latest commit reenables collectives for pre-Turing.

etasnadi · 2025-07-30T14:18:08Z

Latest commit reenables collectives for pre-Turing.

Does it hurt the perormance on recent devices or why it is better to selectively enable/disable?

jeffbolznv · 2025-07-30T14:25:50Z

On my 5090 it's like 25% slower to enable collectives.

0cc4m · 2025-07-30T14:53:06Z

Do you know why that is? Shouldn't subgroup ops be pretty quick?

jeffbolznv · 2025-07-30T15:00:56Z

Shuffle is not nearly as fast as integer math on recent GPUs, and to some extent competes with shared/global memory accesses (e.g. see https://forums.developer.nvidia.com/t/whats-the-difference-between-mio-and-lsu-instruction-queue-in-volta-architecture/124749) which are also a bottleneck in this shader.

0cc4m · 2025-07-30T15:03:38Z

Here are results from my hardware:

Nvidia RTX 3090

ggml_vulkan: 0 = NVIDIA GeForce RTX 3090 (NVIDIA) | uma: 0 | fp16: 1 | bf16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    113 runs -  8884.29 us/run - 137.42 GFLOP/run -  15.47 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30668 runs -    32.82 us/run - 133.69 MFLOP/run -   4.07 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30217 runs -    33.18 us/run - 135.78 MFLOP/run -   4.09 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.72 us/run - 642.82 kFLOP/run -  66.13 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                23930 runs -    50.15 us/run -  20.90 MFLOP/run - 416.64 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    39.62 us/run -   2.78 MFLOP/run -  70.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   254.07 us/run -  22.28 MFLOP/run -  87.69 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19074 runs -    54.53 us/run - 115.40 MFLOP/run -   2.12 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6540 runs -   154.69 us/run - 923.24 MFLOP/run -   5.97 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4675 runs -   215.08 us/run -   1.85 GFLOP/run -   8.60 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    114 runs -  8787.96 us/run - 137.42 GFLOP/run -  15.64 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30668 runs -    33.13 us/run - 133.69 MFLOP/run -   4.04 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               30217 runs -    33.41 us/run - 135.78 MFLOP/run -   4.06 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.79 us/run - 642.82 kFLOP/run -  65.63 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                23930 runs -    50.27 us/run -  20.90 MFLOP/run - 415.72 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    39.40 us/run -   2.78 MFLOP/run -  70.69 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   253.50 us/run -  22.28 MFLOP/run -  87.89 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19074 runs -    54.89 us/run - 115.40 MFLOP/run -   2.10 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                6322 runs -   159.80 us/run - 923.24 MFLOP/run -   5.78 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     4730 runs -   212.61 us/run -   1.85 GFLOP/run -   8.70 TFLOPS

PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    117 runs -  8577.18 us/run - 137.42 GFLOP/run -  16.02 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51612 runs -    19.51 us/run - 133.69 MFLOP/run -   6.85 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               42746 runs -    23.76 us/run - 135.78 MFLOP/run -   5.72 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                262144 runs -     3.89 us/run - 642.82 kFLOP/run - 165.44 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                67004 runs -    15.32 us/run -  20.90 MFLOP/run -   1.36 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                98304 runs -    10.66 us/run -   2.78 MFLOP/run - 261.36 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                22445 runs -    47.42 us/run -  22.28 MFLOP/run - 469.85 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               47685 runs -    21.01 us/run - 115.40 MFLOP/run -   5.49 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9919 runs -   101.64 us/run - 923.24 MFLOP/run -   9.08 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     5060 runs -   199.04 us/run -   1.85 GFLOP/run -   9.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                    119 runs -  8416.67 us/run - 137.42 GFLOP/run -  16.33 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               51612 runs -    19.44 us/run - 133.69 MFLOP/run -   6.88 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               42009 runs -    24.08 us/run - 135.78 MFLOP/run -   5.64 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                262144 runs -     3.86 us/run - 642.82 kFLOP/run - 166.50 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                86148 runs -    12.07 us/run -  20.90 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               131072 runs -     8.03 us/run -   2.78 MFLOP/run - 346.80 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                26934 runs -    41.18 us/run -  22.28 MFLOP/run - 541.02 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               48552 runs -    20.78 us/run - 115.40 MFLOP/run -   5.55 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9919 runs -   101.32 us/run - 923.24 MFLOP/run -   9.11 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     5060 runs -   199.78 us/run -   1.85 GFLOP/run -   9.25 TFLOPS

AMD Radeon Pro VII

ggml_vulkan: 0 = AMD Radeon (TM) Pro VII (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     43 runs - 23266.07 us/run - 137.42 GFLOP/run -   5.91 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10472 runs -   100.17 us/run - 133.69 MFLOP/run -   1.33 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   112.47 us/run - 135.78 MFLOP/run -   1.21 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    12.84 us/run - 642.82 kFLOP/run -  50.07 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    85.14 us/run -  20.90 MFLOP/run - 245.44 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    52.24 us/run -   2.78 MFLOP/run -  53.31 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   343.30 us/run -  22.28 MFLOP/run -  64.90 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8670 runs -   127.16 us/run - 115.40 MFLOP/run - 907.56 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2725 runs -   374.25 us/run - 923.24 MFLOP/run -   2.47 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1980 runs -   514.82 us/run -   1.85 GFLOP/run -   3.59 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     45 runs - 22607.02 us/run - 137.42 GFLOP/run -   6.08 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               10472 runs -   101.88 us/run - 133.69 MFLOP/run -   1.31 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9581 runs -   111.62 us/run - 135.78 MFLOP/run -   1.22 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    12.83 us/run - 642.82 kFLOP/run -  50.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    84.91 us/run -  20.90 MFLOP/run - 246.10 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    52.10 us/run -   2.78 MFLOP/run -  53.45 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   343.44 us/run -  22.28 MFLOP/run -  64.87 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                8670 runs -   127.24 us/run - 115.40 MFLOP/run - 906.97 GFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                2725 runs -   374.90 us/run - 923.24 MFLOP/run -   2.46 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1980 runs -   513.75 us/run -   1.85 GFLOP/run -   3.60 TFLOPS


PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     45 runs - 22288.98 us/run - 137.42 GFLOP/run -   6.17 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    41.15 us/run - 133.69 MFLOP/run -   3.25 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19162 runs -    54.04 us/run - 135.78 MFLOP/run -   2.51 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     8.06 us/run - 642.82 kFLOP/run -  79.73 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    30.35 us/run -  20.90 MFLOP/run - 688.44 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                57344 runs -    19.13 us/run -   2.78 MFLOP/run - 145.56 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    91.74 us/run -  22.28 MFLOP/run - 242.83 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26010 runs -    39.41 us/run - 115.40 MFLOP/run -   2.93 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4360 runs -   230.02 us/run - 923.24 MFLOP/run -   4.01 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2200 runs -   461.85 us/run -   1.85 GFLOP/run -   4.00 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     46 runs - 22113.48 us/run - 137.42 GFLOP/run -   6.21 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               24684 runs -    41.25 us/run - 133.69 MFLOP/run -   3.24 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19162 runs -    54.01 us/run - 135.78 MFLOP/run -   2.51 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                131072 runs -     8.13 us/run - 642.82 kFLOP/run -  79.09 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    30.41 us/run -  20.90 MFLOP/run - 687.21 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                57344 runs -    19.14 us/run -   2.78 MFLOP/run - 145.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    92.45 us/run -  22.28 MFLOP/run - 240.97 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26010 runs -    39.45 us/run - 115.40 MFLOP/run -   2.93 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4469 runs -   228.62 us/run - 923.24 MFLOP/run -   4.04 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2200 runs -   457.11 us/run -   1.85 GFLOP/run -   4.04 TFLOPS

Intel A770

ggml_vulkan: 0 = Intel(R) Arc(tm) A770 Graphics (DG2) (Intel open-source Mesa driver) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     43 runs - 23399.37 us/run - 137.42 GFLOP/run -   5.87 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14960 runs -    67.74 us/run - 133.69 MFLOP/run -   1.97 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    78.16 us/run - 135.78 MFLOP/run -   1.74 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    13.09 us/run - 642.82 kFLOP/run -  49.10 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    75.27 us/run -  20.90 MFLOP/run - 277.61 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    45.95 us/run -   2.78 MFLOP/run -  60.61 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   294.13 us/run -  22.28 MFLOP/run -  75.74 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9537 runs -   115.09 us/run - 115.40 MFLOP/run -   1.00 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3161 runs -   326.29 us/run - 923.24 MFLOP/run -   2.83 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1925 runs -   524.95 us/run -   1.85 GFLOP/run -   3.52 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     47 runs - 21516.98 us/run - 137.42 GFLOP/run -   6.39 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               15708 runs -    64.86 us/run - 133.69 MFLOP/run -   2.06 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               14003 runs -    74.20 us/run - 135.78 MFLOP/run -   1.83 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 81920 runs -    12.95 us/run - 642.82 kFLOP/run -  49.66 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                14358 runs -    74.18 us/run -  20.90 MFLOP/run - 281.71 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                24576 runs -    44.97 us/run -   2.78 MFLOP/run -  61.92 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   289.04 us/run -  22.28 MFLOP/run -  77.08 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                9537 runs -   109.57 us/run - 115.40 MFLOP/run -   1.05 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3161 runs -   317.33 us/run - 923.24 MFLOP/run -   2.91 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2090 runs -   490.53 us/run -   1.85 GFLOP/run -   3.77 TFLOPS


PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     32 runs - 31778.94 us/run - 137.42 GFLOP/run -   4.32 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               12716 runs -    79.90 us/run - 133.69 MFLOP/run -   1.67 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11055 runs -    94.89 us/run - 135.78 MFLOP/run -   1.43 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.69 us/run - 642.82 kFLOP/run -  66.35 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    34.12 us/run -  20.90 MFLOP/run - 612.50 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                49152 runs -    21.51 us/run -   2.78 MFLOP/run - 129.44 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   107.69 us/run -  22.28 MFLOP/run - 206.88 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               21675 runs -    46.68 us/run - 115.40 MFLOP/run -   2.47 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4687 runs -   216.70 us/run - 923.24 MFLOP/run -   4.26 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1595 runs -   641.41 us/run -   1.85 GFLOP/run -   2.88 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     34 runs - 29641.18 us/run - 137.42 GFLOP/run -   4.64 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13464 runs -    77.44 us/run - 133.69 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11055 runs -    94.57 us/run - 135.78 MFLOP/run -   1.44 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                106496 runs -     9.47 us/run - 642.82 kFLOP/run -  67.91 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                33502 runs -    33.67 us/run -  20.90 MFLOP/run - 620.61 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                49152 runs -    21.26 us/run -   2.78 MFLOP/run - 131.01 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -   106.60 us/run -  22.28 MFLOP/run - 208.99 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               21675 runs -    46.29 us/run - 115.40 MFLOP/run -   2.49 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4578 runs -   218.96 us/run - 923.24 MFLOP/run -   4.22 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     1595 runs -   641.38 us/run -   1.85 GFLOP/run -   2.88 TFLOPS

There's an issue with Intel, the new results are worse than the old ones for large inputs. Any ideas? Everything else looks fine.

jeffbolznv · 2025-07-30T15:08:49Z

There's an issue with Intel, the new results are worse than the old ones for large inputs. Any ideas?

Can you try forcing CONV_SHAPE_128x128 for intel? That should be closer to the old behavior.

jeffbolznv · 2025-07-30T15:18:24Z

Oh wait, the large convolutions are probably already using the large tile size. But maybe still worth verifying that's happening.

jeffbolznv · 2025-07-30T15:21:11Z

Other possibilities might be that the shared memory stride padding or loop unrolling are harmful on Intel.

etasnadi · 2025-07-30T21:01:44Z

Shuffle is not nearly as fast as integer math on recent GPUs, and to some extent competes with shared/global memory accesses (e.g. see https://forums.developer.nvidia.com/t/whats-the-difference-between-mio-and-lsu-instruction-queue-in-volta-architecture/124749) which are also a bottleneck in this shader.

Then it definitely makes sense to disable.

Now it would be good to know if this also holds for recent AMD/Intel/Adreno devices and disable ~~subgroups~~ collectives.

etasnadi · 2025-07-30T21:06:27Z

Here are results from my hardware:
Nvidia RTX 3090

AMD Radeon Pro VII
Intel A770

There's an issue with Intel, the new results are worse than the old ones for large inputs. Any ideas? Everything else looks fine.

Did you try with disabled collectives on AMD?

etasnadi · 2025-07-30T22:34:59Z

ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp

 #ifdef USE_COLLECTIVES
 #    extension GL_KHR_shader_subgroup_shuffle : enable
 #endif

 #include "types.comp"

 // Make spec constant
-#define SHMEM_PAD 0
+#define SHMEM_PAD 4


If we use padding, wouldn't it be better if we converted this to spec constant to make sure that its value is consistent with the host code?

Yeah, it's a good idea. I'll change it.

netrunnereve · 2025-07-31T01:11:43Z

Now it would be good to know if this also holds for recent AMD/Intel/Adreno devices and disable subgroups collectives.

When I worked on mat vec optimizations shared memory was actually faster than subgroup shuffles on GCN cards, even though in theory it should be the opposite. These things don't always make sense and I guess the only way to find out is to get the hardware and test it.

netrunnereve · 2025-07-31T01:47:10Z

Did you try with disabled collectives on AMD?

Hmm GCN5 has single cycle integer multiplication so there's a chance it'll be faster with no collectives.

jeffbolznv · 2025-07-31T04:01:30Z

#14982 is a follow-on change to use coopmat2 in this shader.

0cc4m · 2025-07-31T15:32:30Z

Here's another example where the first test is negatively affected. I'm doing some more tests on Intel and AMD.

AMD RX 6800 XT

ggml_vulkan: 0 = AMD Radeon RX 6800 XT (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none

Master:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     58 runs - 17524.78 us/run - 137.42 GFLOP/run -   7.84 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               17204 runs -    59.02 us/run - 133.69 MFLOP/run -   2.27 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    78.62 us/run - 135.78 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                114688 runs -     9.18 us/run - 642.82 kFLOP/run -  70.02 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    60.40 us/run -  20.90 MFLOP/run - 345.97 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    34.52 us/run -   2.78 MFLOP/run -  80.68 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   234.14 us/run -  22.28 MFLOP/run -  95.15 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11271 runs -    95.81 us/run - 115.40 MFLOP/run -   1.20 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3488 runs -   294.00 us/run - 923.24 MFLOP/run -   3.14 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2255 runs -   453.73 us/run -   1.85 GFLOP/run -   4.07 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     57 runs - 17729.82 us/run - 137.42 GFLOP/run -   7.75 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               17204 runs -    59.20 us/run - 133.69 MFLOP/run -   2.26 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               13266 runs -    78.68 us/run - 135.78 MFLOP/run -   1.73 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                114688 runs -     9.20 us/run - 642.82 kFLOP/run -  69.84 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                19144 runs -    60.43 us/run -  20.90 MFLOP/run - 345.79 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                32768 runs -    34.57 us/run -   2.78 MFLOP/run -  80.54 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                 4489 runs -   233.03 us/run -  22.28 MFLOP/run -  95.60 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               11271 runs -    93.59 us/run - 115.40 MFLOP/run -   1.23 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                3488 runs -   291.75 us/run - 923.24 MFLOP/run -   3.16 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2255 runs -   454.05 us/run -   1.85 GFLOP/run -   4.07 TFLOPS

PR:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     54 runs - 18532.46 us/run - 137.42 GFLOP/run -   7.42 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26928 runs -    37.43 us/run - 133.69 MFLOP/run -   3.57 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19899 runs -    51.08 us/run - 135.78 MFLOP/run -   2.66 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                212992 runs -     4.73 us/run - 642.82 kFLOP/run - 136.00 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    29.79 us/run -  20.90 MFLOP/run - 701.48 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    17.41 us/run -   2.78 MFLOP/run - 159.98 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    90.22 us/run -  22.28 MFLOP/run - 246.94 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27744 runs -    36.60 us/run - 115.40 MFLOP/run -   3.15 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4796 runs -   211.58 us/run - 923.24 MFLOP/run -   4.36 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2365 runs -   428.54 us/run -   1.85 GFLOP/run -   4.31 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     54 runs - 18822.93 us/run - 137.42 GFLOP/run -   7.30 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               26928 runs -    37.74 us/run - 133.69 MFLOP/run -   3.54 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               19899 runs -    50.96 us/run - 135.78 MFLOP/run -   2.66 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                212992 runs -     4.73 us/run - 642.82 kFLOP/run - 135.89 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                38288 runs -    29.81 us/run -  20.90 MFLOP/run - 701.01 GFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                65536 runs -    17.40 us/run -   2.78 MFLOP/run - 160.05 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                13467 runs -    90.34 us/run -  22.28 MFLOP/run - 246.60 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               27744 runs -    36.20 us/run - 115.40 MFLOP/run -   3.19 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                4905 runs -   205.30 us/run - 923.24 MFLOP/run -   4.50 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2365 runs -   426.08 us/run -   1.85 GFLOP/run -   4.34 TFLOPS


PR without collectives:
CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     67 runs - 15121.81 us/run - 137.42 GFLOP/run -   9.09 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               32912 runs -    31.04 us/run - 133.69 MFLOP/run -   4.31 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.66 us/run - 135.78 MFLOP/run -   3.34 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                237568 runs -     4.30 us/run - 642.82 kFLOP/run - 149.66 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52646 runs -    19.04 us/run -  20.90 MFLOP/run -   1.10 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                90112 runs -    11.26 us/run -   2.78 MFLOP/run - 247.28 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                17956 runs -    60.62 us/run -  22.28 MFLOP/run - 367.50 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               35547 runs -    28.74 us/run - 115.40 MFLOP/run -   4.02 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5886 runs -   172.87 us/run - 923.24 MFLOP/run -   5.34 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f32,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2915 runs -   349.64 us/run -   1.85 GFLOP/run -   5.29 TFLOPS
  CONV_2D(ne_input=[19,19,256,16],ne_kernel=[4,4,256,4096],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     66 runs - 15295.30 us/run - 137.42 GFLOP/run -   8.98 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,128],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               32912 runs -    31.01 us/run - 133.69 MFLOP/run -   4.31 TFLOPS
  CONV_2D(ne_input=[19,19,8,16],ne_kernel=[4,4,8,130],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               25058 runs -    40.72 us/run - 135.78 MFLOP/run -   3.33 TFLOPS
  CONV_2D(ne_input=[19,19,4,16],ne_kernel=[2,2,4,4],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                237568 runs -     4.33 us/run - 642.82 kFLOP/run - 148.43 GFLOPS
  CONV_2D(ne_input=[224,224,3,1],ne_kernel=[3,3,3,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                52646 runs -    19.19 us/run -  20.90 MFLOP/run -   1.09 TFLOPS
  CONV_2D(ne_input=[224,224,1,1],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                90112 runs -    11.30 us/run -   2.78 MFLOP/run - 246.47 GFLOPS
  CONV_2D(ne_input=[224,224,1,8],ne_kernel=[2,2,1,8],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                17956 runs -    60.40 us/run -  22.28 MFLOP/run - 368.84 GFLOPS
  CONV_2D(ne_input=[58,58,32,1],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):               34680 runs -    28.86 us/run - 115.40 MFLOP/run -   4.00 TFLOPS
  CONV_2D(ne_input=[58,58,32,8],ne_kernel=[3,3,32,64],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                5995 runs -   169.32 us/run - 923.24 MFLOP/run -   5.45 TFLOPS
  CONV_2D(ne_input=[16,16,128,8],ne_kernel=[3,3,128,512],type_kernel=f16,stride0=1,stride1=1,padding0=0,padding1=0,dilation0=1,dilation1=1,cwhn=0):                     2915 runs -   347.27 us/run -   1.85 GFLOP/run -   5.32 TFLOPS

Edit: This one is easily solved by disabling subgroup shuffle. I added the results from that to the details. That means that they should also be disabled at least on AMD RDNA. I'll check GCN again.

0cc4m · 2025-07-31T16:31:41Z

Here's a diff that gets good performance on Intel:

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 6647b1cc2..9b3feb9ce 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3093,7 +3093,7 @@ static void ggml_vk_load_shaders(vk_device& device) {
         uint32_t use_collectives = 0;  // Enables subgroup ops for preventing the re-calculation of indices.
         uint32_t conv2d_BS_NPQ = 128;
         uint32_t conv2d_TS_K   = 8;
-        uint32_t conv2d_SHMEM_PAD = 4;
+        uint32_t conv2d_SHMEM_PAD = 0;

         switch (s) {
         default:
@@ -7060,9 +7061,10 @@ static vk_pipeline ggml_vk_op_get_pipeline(ggml_backend_vk_context * ctx, const
             for (uint32_t i = 0; i < CONV_SHAPE_COUNT; ++i) {
                 tiles[i] = CEIL_DIV(elements[0], ctx->device->pipeline_conv2d_f32[i]->wg_denoms[0]) * CEIL_DIV(elements[1], ctx->device->pipeline_conv2d_f32[i]->wg_denoms[1]);
             }
-            if (elements[0] > 64 && tiles[CONV_SHAPE_128x128] >= ctx->device->shader_core_count * 2) {
+            const uint32_t shader_core_count = ctx->device->shader_core_count > 0 ? ctx->device->shader_core_count : 32;
+            if (elements[0] > 64 && tiles[CONV_SHAPE_128x128] >= shader_core_count * 2) {
                 shape = CONV_SHAPE_128x128;
-            } else if (elements[0] <= 32 && tiles[CONV_SHAPE_32x256] >= ctx->device->shader_core_count * 2) {
+            } else if (elements[0] <= 32 && tiles[CONV_SHAPE_32x256] >= shader_core_count * 2) {
                 shape = CONV_SHAPE_32x256;
             } else {
                 shape = CONV_SHAPE_64x32;
diff --git a/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp b/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
index 32bd9d4d6..69494c119 100644
--- a/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
+++ b/ggml/src/ggml-vulkan/vulkan-shaders/conv2d_mm.comp
@@ -202,7 +202,7 @@ void main() {
             Ash[B_ly * Ash_stride + B_lx] = val;
         }
         /* Load input to B_block: (BS_CRS x BS_NPQ) */
-        [[unroll]] for (uint32_t r_offset = 0; r_offset < BS_CRS; r_offset += BrpWg) {
+        for (uint32_t r_offset = 0; r_offset < BS_CRS; r_offset += BrpWg) {
             uint32_t B_ly          = r_offset + Br;             /* Row index of B block */
             uint32_t B_lx          = Bc;
             uint32_t NPQ_idx       = B_idx_NPQ * BS_NPQ + B_lx; /* Global NPQ index (column index of B) */
@@ -248,7 +248,7 @@ void main() {
         }
         barrier();
         if (T_y * TS_K < K) {
-            [[unroll]] for (uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++) {
+            for (uint32_t CRS_lidx = 0; CRS_lidx < BS_CRS; CRS_lidx++) {
                 for (uint32_t T_ly = 0; T_ly < TS_K; T_ly++) {
                     regA[T_ly] = Ash[(T_y * TS_K + T_ly) * Ash_stride + CRS_lidx];
                 }

The problems are:

Intel does not like the shmem padding
- easy to fix
Your shader selection logic does not take into account that shader_core_count is 0 on non-Nvidia and non-AMD devices
- Not sure how to fix this. My A770 has 32 Xe units, so I hardcoded that value and it seems to work alright, but that's not a general solution
Intel does not like your unrolling
- fixable but annoying if that means more shader variants, maybe it can be solved with another specialization constant?

jeffbolznv · 2025-07-31T19:54:19Z

 Intel does not like your unrolling
fixable but annoying if that means more shader variants, maybe it can be solved with another specialization constant?

Does AMD benefit from the unrolling? I retested and it seems like the first unroll doesn't make a difference on NV, and the second one is in code that will be replaced by coopmat2. So we could potentially just remove the force unrolling if AMD performance is OK.

Your shader selection logic does not take into account that shader_core_count is 0 on non-Nvidia and non-AMD devices

I had tried to write it in a way where it will continue to use the original tile size on Intel. But it sounds like there are benefits to the other tile sizes on Intel, so I think a hardcoded constant like 32 is OK for now.

jeffbolznv · 2025-07-31T20:01:07Z

the first unroll doesn't make a difference on NV

Actually, it looks like there's a small gain. I'm going to try multiple variants and only do unrolling for smaller loop counts, we'll see how that works across all HW.

jeffbolznv · 2025-07-31T20:36:50Z

only do unrolling for smaller loop counts

This didn't work out as cleanly as I had hoped, even on NV. So I just did variants with/without unrolling selected based on device type.

I think we just need a decision on whether to enable unrolling and/or collectives on AMD.

0cc4m · 2025-08-01T04:04:12Z

Thank you, now it works well on Intel. I ran further benchmarks on AMD Radeon Pro VII and RX 6800 XT, this is what I came up with:

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 5898bf8df..2cd32fbb5 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -3099,6 +3099,8 @@ static void ggml_vk_load_shaders(vk_device& device) {
         if (device->vendor_id == VK_VENDOR_ID_INTEL) {
             conv2d_SHMEM_PAD = 0;
             conv2d_UNROLL = false;
+        } else if (device->vendor_id == VK_VENDOR_ID_AMD) {
+            conv2d_SHMEM_PAD = device->architecture == vk_device_architecture::AMD_GCN ? 1 : 4;
         }

         switch (s) {
@@ -3107,6 +3109,9 @@ static void ggml_vk_load_shaders(vk_device& device) {
             conv2d_BS_K = 128;
             conv2d_BS_NPQ = 128;
             conv2d_BS_CRS = 16;
+            if (device->vendor_id == VK_VENDOR_ID_AMD && device->architecture != vk_device_architecture::AMD_GCN) {
+                conv2d_UNROLL = false;
+            }
             break;
         case CONV_SHAPE_64x32:
             conv2d_BS_K = 64;
@@ -3121,13 +3126,16 @@ static void ggml_vk_load_shaders(vk_device& device) {
             break;
         }

-        // Use collectives on pre-Turing NVIDIA GPUs, which had slower integer math.
+        // Use collectives on pre-Turing NVIDIA GPUs and GCN AMD cards, which had slower integer math.
         bool allow_collectives_nv = device->vendor_id != VK_VENDOR_ID_NVIDIA ||
                                     device->architecture == vk_device_architecture::NVIDIA_PRE_TURING;
+        bool allow_collectives_amd = device->vendor_id != VK_VENDOR_ID_AMD ||
+                                     device->architecture == vk_device_architecture::AMD_GCN;

         if (device->subgroup_shuffle &&
             device->vendor_id != VK_VENDOR_ID_INTEL &&   // Do not enable collectives on Intel, see PR 14316.
-            allow_collectives_nv) {
+            allow_collectives_nv &&
+            allow_collectives_amd) {
             use_collectives = 1;
             conv2d_BS_CRS   = std::min(
                 device->subgroup_size,

There seems to be a difference in whether unrolling helps on RDNA2 depending on the shader size. It helps with the small variants and slows the large tests slightly, that's why I added the if statement in the switch case. Not sure what the reason for this behaviour is.

Co-authored-by: 0cc4m <[email protected]>

0cc4m

LGTM, all my GPUs get a boost from this now.

jeffbolznv requested a review from 0cc4m July 29, 2025 02:36

github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Jul 29, 2025

Three tiles sizes for CONV_2D, and a heuristic to choose

136ecfb

Green-Sky mentioned this pull request Jul 29, 2025

GGML direct conv2d support leejet/stable-diffusion.cpp#739

Open

reallow collectives for pre-Turing

95ee61a

etasnadi reviewed Jul 30, 2025

View reviewed changes

make SHMEM_PAD a spec constant

7d3553f

jeffbolznv mentioned this pull request Jul 31, 2025

vulkan: Use coopmat2 for conv2d #14982

Merged

jeffbolznv added 2 commits July 31, 2025 15:09

fixes for intel perf - no shmem padding, placeholder shader core count

4456649

shader variants with/without unrolling

e8643c0

0cc4m's fixes for AMD perf

d2a65ec

Co-authored-by: 0cc4m <[email protected]>

0cc4m approved these changes Aug 2, 2025

View reviewed changes

0cc4m merged commit a9f7541 into ggml-org:master Aug 2, 2025
46 of 47 checks passed

vulkan: optimizations for direct convolution #14933

vulkan: optimizations for direct convolution #14933

Uh oh!

Conversation

jeffbolznv commented Jul 29, 2025

Uh oh!

Green-Sky commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jul 29, 2025

Uh oh!

Green-Sky commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Green-Sky commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etasnadi commented Jul 29, 2025

Uh oh!

jeffbolznv commented Jul 29, 2025

Uh oh!

Green-Sky commented Jul 29, 2025

Uh oh!

jeffbolznv commented Jul 29, 2025

Uh oh!

Green-Sky commented Jul 29, 2025

Uh oh!

Green-Sky commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jul 29, 2025

Uh oh!

jeffbolznv commented Jul 29, 2025

Uh oh!

jeffbolznv commented Jul 29, 2025

Uh oh!

Green-Sky commented Jul 29, 2025

Uh oh!

etasnadi commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daniandtheweb commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netrunnereve commented Jul 30, 2025

Uh oh!

etasnadi commented Jul 30, 2025

Uh oh!

jeffbolznv commented Jul 30, 2025

Uh oh!

jeffbolznv commented Jul 30, 2025

Uh oh!

etasnadi commented Jul 30, 2025

Uh oh!

jeffbolznv commented Jul 30, 2025

Uh oh!

0cc4m commented Jul 30, 2025

Uh oh!

jeffbolznv commented Jul 30, 2025

Uh oh!

0cc4m commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented Jul 30, 2025

Uh oh!

jeffbolznv commented Jul 30, 2025

Uh oh!

jeffbolznv commented Jul 30, 2025

Uh oh!

etasnadi commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etasnadi commented Jul 30, 2025

Uh oh!

etasnadi Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffbolznv Jul 30, 2025

Choose a reason for hiding this comment

Green-Sky commented Jul 29, 2025 •

edited

Loading

Green-Sky commented Jul 29, 2025 •

edited

Loading

Green-Sky commented Jul 29, 2025 •

edited

Loading

Green-Sky commented Jul 29, 2025 •

edited

Loading

etasnadi commented Jul 29, 2025 •

edited

Loading

daniandtheweb commented Jul 29, 2025 •

edited

Loading

0cc4m commented Jul 30, 2025 •

edited

Loading

etasnadi commented Jul 30, 2025 •

edited

Loading

etasnadi Jul 30, 2025 •

edited

Loading

0cc4m commented Jul 31, 2025 •

edited

Loading