HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624

deepsek · 2025-07-10T18:31:04Z

Added Matrix cores support (MFMA instructions) for MMQ kernels.
Enable stream-K for CDNA3 to work with MMQ kernels.
Removed usage of WARP_SIZE hardcoded constant in MMQ kernels.
NOTE: Thoughts on removing all uses of hardcoded const specific to only NVIDIA (like WARP_SIZE) in order to support other GPUs?

@JohannesGaessler @ggerganov
P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.

EDIT:
Update to add some performance charts for DeepSeekV3 model.

Upstream vs ROCm Fork Development

MI300X vs H100 Throughput Test

JohannesGaessler · 2025-07-10T18:47:19Z

I would be happy to get on a call with you to discuss AMD hardware support, my email address can be found on my Github page.

ggerganov · 2025-07-11T07:48:35Z

P.S. I am part of an AMD team actively working on enabling AMD feature set on llama.cpp. We would like to get on call to discuss some future PR plans for additional backends, flash attention changes, etc.

@deepsek Thanks for the contribution and for reaching out. On topics related to the CUDA backend, @JohannesGaessler is the best person to consult with. For additional backends, @slaren can provide guidelines and advice. I'll be happy to provide input on any matters as well.

I am also available for call - feel free to contact me.

Dampfinchen · 2025-07-11T12:18:25Z

Very nice to see the initiative. I assume improvements made for CDNA will also swap into the consumer side next year when UDNA releases. So this is exciting news for the future of AMD products!

IMbackK · 2025-07-12T21:21:30Z

This certainly is good news

JohannesGaessler · 2025-07-12T21:51:15Z

Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion?

IMbackK · 2025-07-14T16:44:52Z

Sorry, I wanted to ask: @IMbackK since you've been working on AMD support, are you interested in joining the discussion?

Yes, certainly. It would help to avoid duplication of effort. i can be reached via email at uvos.xyz user carl

deepsek · 2025-07-21T11:23:31Z

Hi @JohannesGaessler, is there any blocker for merging this PR to the main branch?

IMbackK · 2025-07-21T12:19:57Z

@deepsek There are a few small things as discussed, better naming for this mfma path so that a rdna wmma solution can be added later without the nameing being strange is one thing, use of two V_MFMA_I32_16X16X16I8 instructions on gfx908 and gfx90a, even if this path is not chosen for those, to ease maintainability is another.

I would also like to try this myself on gfx94x somehow and i am not sure what the state is with regard to access to amds cloud for maintenance of a gfx94x specific code path, maybe @ggerganov can also comment on that. A problem here being that after cdna2/gfx90a/mi210 AMD has not made any further CDNA devices that are in a pcie addon board form factor, so out side of the acquisition of an entire mi300 oam machine no one can simply add a CDNA3/gfx94x/MI3xx compatible card to their system.

IMbackK · 2025-07-24T15:41:16Z

@deepsek upps, sorry i accidentally edited your post instead of quoting it, please repost.

From my side there is nothing further missing, id just like to give it another spin to test for regressions, i will approve after.

deepsek · 2025-07-24T15:51:03Z

No worries haha. I was just saying. Based on all the comments so far, looks like ggml_cuda_should_use_mmq needs a lot more changes in it, to choose MMQ based on architecture and sizes across all relevant backends.

@JohannesGaessler

To clarify, the function should return true for small batch sizes and false for large batch sizes with the boundary chosen in such a way that maximizes performance.

@slaren

The lower memory usage of MMQ is also a significant factor on why it is the default, even if it is not always faster than cuBLAS. The size of the buffer used to store the dequantized weights can be significant on consumer GPUs.

Is there actual guidance as to when performance is preferred over memory usage? Looks like there are conflicting viewpoints. Would be great to have this information documentation for when we contribute and add other architectures, there is a common design principle.

P.S.,

I do have to say, some look needs to be taken further into documentation for llama.cpp. When I was trying to add support for AMD in MMQ kernels. There were so many implicit design choices that are made without explanation in comments/documentation. It makes it extremely hard to understand the code and thereby making it hard to contribute more.

JohannesGaessler · 2025-07-24T16:59:25Z

ggml/src/ggml-cuda/mma.cuh

+        if constexpr (I == 64 && J == 2) { // Special tile size to load <16, 4> as <16, 8>
+#pragma unroll
+            for (int l = 0; l < t.ne; ++l) {
+                t.x[l] = xs0[t.get_i(l)*stride + t.get_j(l)];
+            }
+        } else {
+            int64_t * xi = (int64_t *) t.x;
+            const int64_t * xs = (int64_t *) ((const int *) xs0 + (threadIdx.x % t.I) * stride + 2 * (threadIdx.x / t.I));
+            xi[0] = xs[0];
+        }


Suggested change

if constexpr (I == 64 && J == 2) { // Special tile size to load <16, 4> as <16, 8>

#pragma unroll

for (int l = 0; l < t.ne; ++l) {

t.x[l] = xs0[t.get_i(l)*stride + t.get_j(l)];

}

} else {

int64_t * xi = (int64_t *) t.x;

const int64_t * xs = (int64_t *) ((const int *) xs0 + (threadIdx.x % t.I) * stride + 2 * (threadIdx.x / t.I));

xi[0] = xs[0];

}

if constexpr (I != 64 || J != 2) {

int64_t * xi = (int64_t *) t.x;

const int64_t * xs = (int64_t *) ((const int *) xs0 + (threadIdx.x % t.I) * stride + 2 * (threadIdx.x / t.I));

xi[0] = xs[0];

return;

}

I think this would be simpler.

Do you mean without the preprocessor directives?
This would affect the NV code path when we call load_generic though? I see some instances where load_generic is called

static __device__ __forceinline__ void load_generic(...) { if constexpr (I != 64 || J != 2) { int64_t * xi = (int64_t *) t.x; const int64_t * xs = (int64_t *) ((const int *) xs0 + (threadIdx.x % t.I) * stride + 2 * (threadIdx.x / t.I)); xi[0] = xs[0]; return; } #pragma unroll for (int l = 0; l < t.ne; ++l) { t.x[l] = xs0[t.get_i(l)*stride + t.get_j(l)]; } }

I basically meant to have the instructions for loading data as 64 bit encapsulated in an ifdef AMD_MFMA_AVAILABLE ... #endif and to use the generic implementation if the preconditions aren't met. But if this is going to be refactored anyways it doesn't matter.

JohannesGaessler · 2025-07-24T17:02:53Z

I don't understand what you're doing with tile_load, please explain it.

slaren · 2025-07-24T17:33:51Z

Is there actual guidance as to when performance is preferred over memory usage? Looks like there are conflicting viewpoints. Would be great to have this information documentation for when we contribute and add other architectures, there is a common design principle.

It's very hard to have a single design principle for every situation. Historically, some models would need as much as 6 GB of VRAM for the dequantization buffer. This would cause a lot of problems for people using consumer GPUs who would not understand why they could not offload to the GPU as many layers of the model as they would expect. This was one of the reasons why MMQ was made the default, even thought it was not faster than cuBLAS in every situation. Ultimately, it doesn't matter if cuBLAS is 10% faster, if using it means that you need to keep a large portion of the model in a CPU that is 10 times slower. For a data center GPU where VRAM is not so limited, the calculus may be different.

JohannesGaessler · 2025-07-24T17:44:27Z

I ran the following tests on my RTX 3090/4090:

for q in q4_0 q4_1 q5_0 q5_1 q8_0 q2_k q3_k_s q4_k_s q5_k_s q6_k iq1_s iq2_xxs iq2_xs iq2_s iq3_xxs iq3_xs iq3_s iq3_m iq4_nl iq4_xs; do echo $q; ./bench --model models/opt/llama_3-${q}.gguf -r 1 -fa 1 -n 0 -p 2048 -ub 16,32,64,128,256,512,1024,2048 -o sql|sqlite3 llama-bench.sqlite; sleep 10; done

With compre-llama-bench.py I then get the following changes in performance when unconditionally using cuBLAS:

GPU	Model	Microbatch size	Test	t/s master	t/s cuda-cublas-test	Speedup
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	16	pp2048	1269.16	363.33	0.29
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	32	pp2048	2123.61	700.72	0.33
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	64	pp2048	3035.47	1342.33	0.44
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	128	pp2048	3725.88	2319.45	0.62
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	256	pp2048	4410.33	3462.68	0.79
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	512	pp2048	4721.20	4110.23	0.87
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	1024	pp2048	4838.83	4874.03	1.01
RTX 3090	llama 8B IQ1_S - 1.5625 bpw	2048	pp2048	4776.52	4950.20	1.04
RTX 3090	llama 8B IQ2_S - 2.5 bpw	16	pp2048	1177.24	341.22	0.29
RTX 3090	llama 8B IQ2_S - 2.5 bpw	32	pp2048	1844.47	658.82	0.36
RTX 3090	llama 8B IQ2_S - 2.5 bpw	64	pp2048	2756.34	1262.28	0.46
RTX 3090	llama 8B IQ2_S - 2.5 bpw	128	pp2048	3397.38	2188.85	0.64
RTX 3090	llama 8B IQ2_S - 2.5 bpw	256	pp2048	3974.27	3305.33	0.83
RTX 3090	llama 8B IQ2_S - 2.5 bpw	512	pp2048	4188.95	3972.21	0.95
RTX 3090	llama 8B IQ2_S - 2.5 bpw	1024	pp2048	4218.73	4702.10	1.11
RTX 3090	llama 8B IQ2_S - 2.5 bpw	2048	pp2048	4133.44	4725.80	1.14
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	16	pp2048	1200.91	340.67	0.28
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	32	pp2048	1857.67	658.56	0.35
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	64	pp2048	2730.49	1263.58	0.46
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	128	pp2048	3300.26	2197.07	0.67
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	256	pp2048	3861.49	3315.19	0.86
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	512	pp2048	4100.01	3998.03	0.98
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	1024	pp2048	4180.07	4801.08	1.15
RTX 3090	llama 8B IQ2_XS - 2.3125 bpw	2048	pp2048	4168.49	4919.88	1.18
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	16	pp2048	1195.23	355.18	0.30
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	32	pp2048	2004.35	685.69	0.34
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	64	pp2048	3042.87	1314.73	0.43
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	128	pp2048	3890.35	2278.40	0.59
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	256	pp2048	4561.47	3413.11	0.75
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	512	pp2048	4825.37	4082.01	0.85
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	1024	pp2048	4971.60	4832.21	0.97
RTX 3090	llama 8B IQ2_XXS - 2.0625 bpw	2048	pp2048	4898.16	4951.49	1.01
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	16	pp2048	1063.41	347.95	0.33
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	32	pp2048	1821.60	672.38	0.37
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	64	pp2048	2907.41	1290.48	0.44
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	128	pp2048	3822.82	2231.33	0.58
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	256	pp2048	4488.48	3343.12	0.74
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	512	pp2048	4762.17	4002.22	0.84
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	1024	pp2048	4813.77	4718.51	0.98
RTX 3090	llama 8B IQ3_S - 3.4375 bpw	2048	pp2048	4667.00	4726.04	1.01
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	16	pp2048	1081.40	348.93	0.32
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	32	pp2048	1855.74	674.47	0.36
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	64	pp2048	2930.98	1292.52	0.44
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	128	pp2048	3809.37	2237.84	0.59
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	256	pp2048	4470.69	3346.66	0.75
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	512	pp2048	4774.10	4002.66	0.84
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	1024	pp2048	4804.29	4734.53	0.99
RTX 3090	llama 8B IQ3_S mix - 3.66 bpw	2048	pp2048	4646.86	4727.90	1.02
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	16	pp2048	1116.47	351.51	0.31
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	32	pp2048	1836.45	679.27	0.37
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	64	pp2048	2949.60	1302.48	0.44
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	128	pp2048	3820.79	2255.19	0.59
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	256	pp2048	4496.96	3367.13	0.75
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	512	pp2048	4756.66	4025.54	0.85
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	1024	pp2048	4804.88	4746.55	0.99
RTX 3090	llama 8B IQ3_XS - 3.3 bpw	2048	pp2048	4654.84	4728.24	1.02
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	16	pp2048	1152.30	351.00	0.30
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	32	pp2048	1860.74	678.43	0.36
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	64	pp2048	2937.99	1300.81	0.44
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	128	pp2048	3766.46	2256.70	0.60
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	256	pp2048	4426.53	3381.20	0.76
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	512	pp2048	4699.49	4031.36	0.86
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	1024	pp2048	4739.41	4743.37	1.00
RTX 3090	llama 8B IQ3_XXS - 3.0625 bpw	2048	pp2048	4588.29	4740.71	1.03
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	16	pp2048	1104.76	355.22	0.32
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	32	pp2048	1855.29	686.04	0.37
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	64	pp2048	2835.30	1315.09	0.46
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	128	pp2048	3739.72	2281.96	0.61
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	256	pp2048	4384.26	3418.87	0.78
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	512	pp2048	4578.77	4067.92	0.89
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	1024	pp2048	4657.31	4813.85	1.03
RTX 3090	llama 8B IQ4_NL - 4.5 bpw	2048	pp2048	4539.71	4862.84	1.07
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	16	pp2048	1113.46	353.79	0.32
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	32	pp2048	1881.63	682.58	0.36
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	64	pp2048	2837.24	1311.79	0.46
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	128	pp2048	3743.22	2277.19	0.61
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	256	pp2048	4403.62	3416.36	0.78
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	512	pp2048	4670.17	4081.44	0.87
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	1024	pp2048	4792.84	4832.52	1.01
RTX 3090	llama 8B IQ4_XS - 4.25 bpw	2048	pp2048	4695.20	4897.09	1.04
RTX 3090	llama 8B Q2_K_M	16	pp2048	1243.34	348.43	0.28
RTX 3090	llama 8B Q2_K_M	32	pp2048	1905.93	674.33	0.35
RTX 3090	llama 8B Q2_K_M	64	pp2048	2602.53	1294.78	0.50
RTX 3090	llama 8B Q2_K_M	128	pp2048	2744.97	2244.53	0.82
RTX 3090	llama 8B Q2_K_M	256	pp2048	3360.81	3388.03	1.01
RTX 3090	llama 8B Q2_K_M	512	pp2048	3648.98	4063.21	1.11
RTX 3090	llama 8B Q2_K_M	1024	pp2048	3807.19	4851.76	1.27
RTX 3090	llama 8B Q2_K_M	2048	pp2048	3806.04	4956.14	1.30
RTX 3090	llama 8B Q3_K_S	16	pp2048	1204.59	318.23	0.26
RTX 3090	llama 8B Q3_K_S	32	pp2048	1950.33	616.45	0.32
RTX 3090	llama 8B Q3_K_S	64	pp2048	2857.42	1182.46	0.41
RTX 3090	llama 8B Q3_K_S	128	pp2048	3399.77	2067.40	0.61
RTX 3090	llama 8B Q3_K_S	256	pp2048	4007.47	3183.37	0.79
RTX 3090	llama 8B Q3_K_S	512	pp2048	4261.17	3922.93	0.92
RTX 3090	llama 8B Q3_K_S	1024	pp2048	4377.78	4765.92	1.09
RTX 3090	llama 8B Q3_K_S	2048	pp2048	4331.94	4937.37	1.14
RTX 3090	llama 8B Q4_0	16	pp2048	1240.66	362.31	0.29
RTX 3090	llama 8B Q4_0	32	pp2048	2048.14	699.37	0.34
RTX 3090	llama 8B Q4_0	64	pp2048	3139.64	1344.21	0.43
RTX 3090	llama 8B Q4_0	128	pp2048	4045.92	2340.01	0.58
RTX 3090	llama 8B Q4_0	256	pp2048	4810.84	3532.32	0.73
RTX 3090	llama 8B Q4_0	512	pp2048	5139.21	4187.53	0.81
RTX 3090	llama 8B Q4_0	1024	pp2048	5207.74	4957.56	0.95
RTX 3090	llama 8B Q4_0	2048	pp2048	5104.67	5078.88	0.99
RTX 3090	llama 8B Q4_1	16	pp2048	1235.90	358.78	0.29
RTX 3090	llama 8B Q4_1	32	pp2048	2160.04	693.92	0.32
RTX 3090	llama 8B Q4_1	64	pp2048	3030.63	1332.42	0.44
RTX 3090	llama 8B Q4_1	128	pp2048	3759.64	2321.48	0.62
RTX 3090	llama 8B Q4_1	256	pp2048	4503.89	3485.79	0.77
RTX 3090	llama 8B Q4_1	512	pp2048	4817.82	4127.34	0.86
RTX 3090	llama 8B Q4_1	1024	pp2048	4877.41	4917.45	1.01
RTX 3090	llama 8B Q4_1	2048	pp2048	4807.40	5014.72	1.04
RTX 3090	llama 8B Q4_K_S	16	pp2048	1239.95	354.17	0.29
RTX 3090	llama 8B Q4_K_S	32	pp2048	2172.58	685.07	0.32
RTX 3090	llama 8B Q4_K_S	64	pp2048	3106.63	1316.57	0.42
RTX 3090	llama 8B Q4_K_S	128	pp2048	3820.59	2286.16	0.60
RTX 3090	llama 8B Q4_K_S	256	pp2048	4509.28	3433.22	0.76
RTX 3090	llama 8B Q4_K_S	512	pp2048	4794.53	4082.77	0.85
RTX 3090	llama 8B Q4_K_S	1024	pp2048	4902.25	4880.85	1.00
RTX 3090	llama 8B Q4_K_S	2048	pp2048	4814.05	4977.22	1.03
RTX 3090	llama 8B Q5_0	16	pp2048	1081.15	296.57	0.27
RTX 3090	llama 8B Q5_0	32	pp2048	1823.62	575.44	0.32
RTX 3090	llama 8B Q5_0	64	pp2048	2901.53	1106.99	0.38
RTX 3090	llama 8B Q5_0	128	pp2048	3821.11	1947.81	0.51
RTX 3090	llama 8B Q5_0	256	pp2048	4520.28	3031.59	0.67
RTX 3090	llama 8B Q5_0	512	pp2048	4817.86	3791.02	0.79
RTX 3090	llama 8B Q5_0	1024	pp2048	4896.94	4643.59	0.95
RTX 3090	llama 8B Q5_0	2048	pp2048	4796.51	4843.25	1.01
RTX 3090	llama 8B Q5_1	16	pp2048	1139.31	297.59	0.26
RTX 3090	llama 8B Q5_1	32	pp2048	2021.32	576.44	0.29
RTX 3090	llama 8B Q5_1	64	pp2048	2841.82	1108.70	0.39
RTX 3090	llama 8B Q5_1	128	pp2048	3570.52	1944.53	0.54
RTX 3090	llama 8B Q5_1	256	pp2048	4245.90	3022.05	0.71
RTX 3090	llama 8B Q5_1	512	pp2048	4575.99	3794.22	0.83
RTX 3090	llama 8B Q5_1	1024	pp2048	4649.67	4615.96	0.99
RTX 3090	llama 8B Q5_1	2048	pp2048	4571.11	4823.91	1.06
RTX 3090	llama 8B Q5_K_S	16	pp2048	1190.53	348.42	0.29
RTX 3090	llama 8B Q5_K_S	32	pp2048	2041.08	673.12	0.33
RTX 3090	llama 8B Q5_K_S	64	pp2048	2953.62	1293.24	0.44
RTX 3090	llama 8B Q5_K_S	128	pp2048	3673.18	2237.51	0.61
RTX 3090	llama 8B Q5_K_S	256	pp2048	4359.61	3368.89	0.77
RTX 3090	llama 8B Q5_K_S	512	pp2048	4657.81	4037.42	0.87
RTX 3090	llama 8B Q5_K_S	1024	pp2048	4754.54	4840.55	1.02
RTX 3090	llama 8B Q5_K_S	2048	pp2048	4676.00	4948.82	1.06
RTX 3090	llama 8B Q6_K	16	pp2048	1028.59	349.42	0.34
RTX 3090	llama 8B Q6_K	32	pp2048	1827.49	675.05	0.37
RTX 3090	llama 8B Q6_K	64	pp2048	2687.06	1297.17	0.48
RTX 3090	llama 8B Q6_K	128	pp2048	3377.85	2250.68	0.67
RTX 3090	llama 8B Q6_K	256	pp2048	3973.83	3382.65	0.85
RTX 3090	llama 8B Q6_K	512	pp2048	4207.99	4054.54	0.96
RTX 3090	llama 8B Q6_K	1024	pp2048	4288.43	4827.50	1.13
RTX 3090	llama 8B Q6_K	2048	pp2048	4227.47	4900.35	1.16
RTX 3090	llama 8B Q8_0	16	pp2048	973.31	341.18	0.35
RTX 3090	llama 8B Q8_0	32	pp2048	1683.00	658.35	0.39
RTX 3090	llama 8B Q8_0	64	pp2048	2928.68	1264.50	0.43
RTX 3090	llama 8B Q8_0	128	pp2048	3919.66	2220.24	0.57
RTX 3090	llama 8B Q8_0	256	pp2048	4636.38	3376.11	0.73
RTX 3090	llama 8B Q8_0	512	pp2048	4881.97	4052.21	0.83
RTX 3090	llama 8B Q8_0	1024	pp2048	5005.15	4837.42	0.97
RTX 3090	llama 8B Q8_0	2048	pp2048	4940.33	4961.92	1.00
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	16	pp2048	2258.70	513.36	0.23
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	32	pp2048	3999.30	1022.46	0.26
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	64	pp2048	6196.90	1919.15	0.31
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	128	pp2048	8209.90	3750.54	0.46
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	256	pp2048	10764.07	6074.51	0.56
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	512	pp2048	12283.45	8088.07	0.66
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	1024	pp2048	12146.24	9638.25	0.79
RTX 4090	llama 8B IQ1_S - 1.5625 bpw	2048	pp2048	11193.68	9873.72	0.88
RTX 4090	llama 8B IQ2_S - 2.5 bpw	16	pp2048	2048.38	505.74	0.25
RTX 4090	llama 8B IQ2_S - 2.5 bpw	32	pp2048	3393.96	1011.03	0.30
RTX 4090	llama 8B IQ2_S - 2.5 bpw	64	pp2048	5600.50	1893.46	0.34
RTX 4090	llama 8B IQ2_S - 2.5 bpw	128	pp2048	7574.04	3677.54	0.49
RTX 4090	llama 8B IQ2_S - 2.5 bpw	256	pp2048	9738.47	5940.02	0.61
RTX 4090	llama 8B IQ2_S - 2.5 bpw	512	pp2048	10639.67	7841.15	0.74
RTX 4090	llama 8B IQ2_S - 2.5 bpw	1024	pp2048	10290.99	9033.11	0.88
RTX 4090	llama 8B IQ2_S - 2.5 bpw	2048	pp2048	9090.10	8752.14	0.96
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	16	pp2048	2120.06	506.95	0.24
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	32	pp2048	3500.25	1011.94	0.29
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	64	pp2048	5601.59	1900.01	0.34
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	128	pp2048	7557.40	3692.89	0.49
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	256	pp2048	9618.61	6010.88	0.62
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	512	pp2048	10853.05	8030.11	0.74
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	1024	pp2048	10760.68	9576.75	0.89
RTX 4090	llama 8B IQ2_XS - 2.3125 bpw	2048	pp2048	10043.13	9863.42	0.98
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	16	pp2048	2136.25	509.27	0.24
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	32	pp2048	3774.52	1016.24	0.27
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	64	pp2048	6198.03	1906.25	0.31
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	128	pp2048	8569.45	3729.15	0.44
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	256	pp2048	11166.74	6033.03	0.54
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	512	pp2048	12687.06	8063.58	0.64
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	1024	pp2048	12497.25	9592.90	0.77
RTX 4090	llama 8B IQ2_XXS - 2.0625 bpw	2048	pp2048	11493.85	9872.21	0.86
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	16	pp2048	1650.19	492.99	0.30
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	32	pp2048	2883.47	984.68	0.34
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	64	pp2048	5466.87	1846.74	0.34
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	128	pp2048	8093.26	3578.02	0.44
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	256	pp2048	10642.97	5835.55	0.55
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	512	pp2048	11861.45	7760.24	0.65
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	1024	pp2048	11412.11	8971.85	0.79
RTX 4090	llama 8B IQ3_S - 3.4375 bpw	2048	pp2048	9988.09	8732.89	0.87
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	16	pp2048	1673.63	492.89	0.29
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	32	pp2048	2935.60	979.58	0.33
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	64	pp2048	5487.45	1845.35	0.34
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	128	pp2048	8073.59	3579.77	0.44
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	256	pp2048	10682.04	5830.44	0.55
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	512	pp2048	11855.61	7739.69	0.65
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	1024	pp2048	11414.45	8955.80	0.78
RTX 4090	llama 8B IQ3_S mix - 3.66 bpw	2048	pp2048	9961.58	8733.25	0.88
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	16	pp2048	1810.46	495.51	0.27
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	32	pp2048	3075.65	989.76	0.32
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	64	pp2048	5593.56	1855.39	0.33
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	128	pp2048	7908.67	3605.42	0.46
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	256	pp2048	10535.29	5860.60	0.56
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	512	pp2048	11930.76	7784.26	0.65
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	1024	pp2048	11487.46	8979.04	0.78
RTX 4090	llama 8B IQ3_XS - 3.3 bpw	2048	pp2048	9977.57	8752.45	0.88
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	16	pp2048	1942.41	493.14	0.25
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	32	pp2048	3246.69	984.55	0.30
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	64	pp2048	5676.98	1852.24	0.33
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	128	pp2048	7776.55	3608.04	0.46
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	256	pp2048	10475.05	5855.00	0.56
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	512	pp2048	11779.42	7790.70	0.66
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	1024	pp2048	11286.48	8991.37	0.80
RTX 4090	llama 8B IQ3_XXS - 3.0625 bpw	2048	pp2048	9882.69	8754.41	0.89
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	16	pp2048	1845.89	484.18	0.26
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	32	pp2048	3363.92	967.28	0.29
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	64	pp2048	5514.80	1815.48	0.33
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	128	pp2048	8083.99	3536.87	0.44
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	256	pp2048	10629.63	5787.63	0.54
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	512	pp2048	11827.68	7834.73	0.66
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	1024	pp2048	11715.52	9300.43	0.79
RTX 4090	llama 8B IQ4_NL - 4.5 bpw	2048	pp2048	10658.68	9474.71	0.89
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	16	pp2048	1895.00	487.59	0.26
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	32	pp2048	3456.82	971.04	0.28
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	64	pp2048	5699.30	1826.83	0.32
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	128	pp2048	8193.01	3540.37	0.43
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	256	pp2048	10743.01	5822.09	0.54
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	512	pp2048	12062.46	7858.77	0.65
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	1024	pp2048	12010.94	9391.68	0.78
RTX 4090	llama 8B IQ4_XS - 4.25 bpw	2048	pp2048	11016.93	9632.84	0.87
RTX 4090	llama 8B Q2_K_M	16	pp2048	2038.29	501.50	0.25
RTX 4090	llama 8B Q2_K_M	32	pp2048	3468.57	1003.92	0.29
RTX 4090	llama 8B Q2_K_M	64	pp2048	5390.14	1881.31	0.35
RTX 4090	llama 8B Q2_K_M	128	pp2048	5553.53	3681.68	0.66
RTX 4090	llama 8B Q2_K_M	256	pp2048	7681.81	5948.97	0.77
RTX 4090	llama 8B Q2_K_M	512	pp2048	9513.04	8012.94	0.84
RTX 4090	llama 8B Q2_K_M	1024	pp2048	9820.82	9561.64	0.97
RTX 4090	llama 8B Q2_K_M	2048	pp2048	9281.51	9865.96	1.06
RTX 4090	llama 8B Q3_K_S	16	pp2048	1855.05	489.78	0.26
RTX 4090	llama 8B Q3_K_S	32	pp2048	3349.46	973.59	0.29
RTX 4090	llama 8B Q3_K_S	64	pp2048	5620.07	1825.17	0.32
RTX 4090	llama 8B Q3_K_S	128	pp2048	7756.72	3562.66	0.46
RTX 4090	llama 8B Q3_K_S	256	pp2048	9924.35	5828.74	0.59
RTX 4090	llama 8B Q3_K_S	512	pp2048	11032.31	7961.34	0.72
RTX 4090	llama 8B Q3_K_S	1024	pp2048	11147.46	9633.22	0.86
RTX 4090	llama 8B Q3_K_S	2048	pp2048	10458.19	10069.99	0.96
RTX 4090	llama 8B Q4_0	16	pp2048	1823.73	483.19	0.26
RTX 4090	llama 8B Q4_0	32	pp2048	3326.25	959.73	0.29
RTX 4090	llama 8B Q4_0	64	pp2048	5823.55	1805.26	0.31
RTX 4090	llama 8B Q4_0	128	pp2048	8776.15	3512.86	0.40
RTX 4090	llama 8B Q4_0	256	pp2048	11726.54	5807.80	0.50
RTX 4090	llama 8B Q4_0	512	pp2048	13241.36	8003.77	0.60
RTX 4090	llama 8B Q4_0	1024	pp2048	13302.06	9691.22	0.73
RTX 4090	llama 8B Q4_0	2048	pp2048	12395.89	10151.28	0.82
RTX 4090	llama 8B Q4_1	16	pp2048	1735.86	477.01	0.27
RTX 4090	llama 8B Q4_1	32	pp2048	3316.52	947.21	0.29
RTX 4090	llama 8B Q4_1	64	pp2048	5590.41	1786.01	0.32
RTX 4090	llama 8B Q4_1	128	pp2048	8266.54	3477.50	0.42
RTX 4090	llama 8B Q4_1	256	pp2048	11008.21	5749.13	0.52
RTX 4090	llama 8B Q4_1	512	pp2048	12390.34	7874.50	0.64
RTX 4090	llama 8B Q4_1	1024	pp2048	12545.96	9542.28	0.76
RTX 4090	llama 8B Q4_1	2048	pp2048	11768.92	10010.06	0.85
RTX 4090	llama 8B Q4_K_S	16	pp2048	1811.12	482.90	0.27
RTX 4090	llama 8B Q4_K_S	32	pp2048	3518.97	958.08	0.27
RTX 4090	llama 8B Q4_K_S	64	pp2048	5942.06	1803.60	0.30
RTX 4090	llama 8B Q4_K_S	128	pp2048	8363.02	3519.98	0.42
RTX 4090	llama 8B Q4_K_S	256	pp2048	11060.81	5800.22	0.52
RTX 4090	llama 8B Q4_K_S	512	pp2048	12470.52	7902.08	0.63
RTX 4090	llama 8B Q4_K_S	1024	pp2048	12600.90	9540.61	0.76
RTX 4090	llama 8B Q4_K_S	2048	pp2048	11763.92	10007.35	0.85
RTX 4090	llama 8B Q5_0	16	pp2048	1594.98	458.26	0.29
RTX 4090	llama 8B Q5_0	32	pp2048	2970.63	905.00	0.30
RTX 4090	llama 8B Q5_0	64	pp2048	5211.66	1698.45	0.33
RTX 4090	llama 8B Q5_0	128	pp2048	8122.97	3296.47	0.41
RTX 4090	llama 8B Q5_0	256	pp2048	10861.65	5482.80	0.50
RTX 4090	llama 8B Q5_0	512	pp2048	12236.20	7593.84	0.62
RTX 4090	llama 8B Q5_0	1024	pp2048	12297.34	9241.81	0.75
RTX 4090	llama 8B Q5_0	2048	pp2048	11317.52	9673.95	0.85
RTX 4090	llama 8B Q5_1	16	pp2048	1578.65	456.87	0.29
RTX 4090	llama 8B Q5_1	32	pp2048	3046.68	906.34	0.30
RTX 4090	llama 8B Q5_1	64	pp2048	5176.37	1694.54	0.33
RTX 4090	llama 8B Q5_1	128	pp2048	7753.29	3286.86	0.42
RTX 4090	llama 8B Q5_1	256	pp2048	10272.38	5461.69	0.53
RTX 4090	llama 8B Q5_1	512	pp2048	11674.35	7539.37	0.65
RTX 4090	llama 8B Q5_1	1024	pp2048	11823.78	9163.92	0.78
RTX 4090	llama 8B Q5_1	2048	pp2048	10994.51	9578.97	0.87
RTX 4090	llama 8B Q5_K_S	16	pp2048	1667.07	476.82	0.29
RTX 4090	llama 8B Q5_K_S	32	pp2048	3169.82	955.20	0.30
RTX 4090	llama 8B Q5_K_S	64	pp2048	5532.52	1793.41	0.32
RTX 4090	llama 8B Q5_K_S	128	pp2048	8040.64	3483.21	0.43
RTX 4090	llama 8B Q5_K_S	256	pp2048	10581.57	5724.62	0.54
RTX 4090	llama 8B Q5_K_S	512	pp2048	11963.16	7829.51	0.65
RTX 4090	llama 8B Q5_K_S	1024	pp2048	12252.54	9446.07	0.77
RTX 4090	llama 8B Q5_K_S	2048	pp2048	11462.50	9948.84	0.87
RTX 4090	llama 8B Q6_K	16	pp2048	1417.83	467.42	0.33
RTX 4090	llama 8B Q6_K	32	pp2048	2779.86	934.50	0.34
RTX 4090	llama 8B Q6_K	64	pp2048	4787.15	1758.05	0.37
RTX 4090	llama 8B Q6_K	128	pp2048	7103.90	3398.73	0.48
RTX 4090	llama 8B Q6_K	256	pp2048	9457.93	5629.29	0.60
RTX 4090	llama 8B Q6_K	512	pp2048	10727.17	7685.62	0.72
RTX 4090	llama 8B Q6_K	1024	pp2048	10923.56	9204.79	0.84
RTX 4090	llama 8B Q6_K	2048	pp2048	10093.00	9528.85	0.94
RTX 4090	llama 8B Q8_0	16	pp2048	1306.56	457.63	0.35
RTX 4090	llama 8B Q8_0	32	pp2048	2408.03	914.73	0.38
RTX 4090	llama 8B Q8_0	64	pp2048	4354.06	1711.24	0.39
RTX 4090	llama 8B Q8_0	128	pp2048	7153.27	3334.15	0.47
RTX 4090	llama 8B Q8_0	256	pp2048	10286.79	5520.72	0.54
RTX 4090	llama 8B Q8_0	512	pp2048	12371.58	7610.56	0.62
RTX 4090	llama 8B Q8_0	1024	pp2048	12631.21	9248.36	0.73
RTX 4090	llama 8B Q8_0	2048	pp2048	11870.43	9797.81	0.83

On my RTX 4090 for all quantization formats except q2_K MMQ is faster for batch sizes <= 2048, for batch sizes <= 1024 MMQ is always faster. On my RTX 3090 MMQ is faster for all quantization formats except q2_K at batch sizes <= 512.

JohannesGaessler · 2025-07-24T18:44:03Z

More importantly, I tested the main branch on a H100, With the default code path (MMQ with mma instr) and with ggml_cuda_should_use_mmq returning false.

The MMQ code is designed around tensor core instructions that were introduced with Ampere. Hopper can also make use of these instructions but they have additional tensor core instructions that are only found on Hopper and to my knowledge no earlier or later generation. Presumably the cuBLAS code makes use of these instructions, the MMQ code definitely does not. I have never tested the code on or tuned it for Hopper.

I quickly ran the test on a RTX 2080 ti. I see the same results with cublas path outperforming the MMQ path, same as H100.

The tensor core instructions that MMQ was written around are not available on Turing. However, there are similar tensor core instruction which work on tiles that are exactly half as large and as such the same numerical result can be obtained by executing 2 tensor core instructions instead. I do not own any Turing hardware and have not tuned the code specifically for this architecture.

I'm looking to see if there a reasoning why a different design choice is made here between AMD & NV in terms of code paths, or if I'm just doing something wrong with my testing. Or maybe I'm just overthinking here and it slipped through the cracks during testing and needs to be fixed?

Is there actual guidance as to when performance is preferred over memory usage? Looks like there are conflicting viewpoints. Would be great to have this information documentation for when we contribute and add other architectures, there is a common design principle.

No one wrote down the exact criteria for when to use cuBLAS vs. MMQ. My general opinion is that if you use anything below q8_0 you are already trading quality for lower memory usage. The hardware that I have been focusing on is RTX 3090/4090 because those are in my opinion the best "cheap" GPUs with 24 GB VRAM; on those MMQ performs well enough that I think using it unconditionally is the correct choice. On an RTX 2080 ti with only 11 GB VRAM it's even more important to keep memory usage low so I decided to enable MMQ unconditionally as well under the assumption that the tradeoffs would be similar to Ampere/Ada Lovelace.

The logic on master for AMD was written by @IMbackK . I don't remember whether he posted the performance numbers upon which his decisions were based in the relevant PRs. In any case I don't have a good overview of the AMD hardware stack and decided to go with his judgement.

For CDNA3 in particular my understanding is that all GPUs using that architecture have at least 128 GB of memory. For that particular hardware I therefore think that the correct choice is to simply maximize speed.

I do have to say, some look needs to be taken further into documentation for llama.cpp. When I was trying to add support for AMD in MMQ kernels. There were so many implicit design choices that are made without explanation in comments/documentation. It makes it extremely hard to understand the code and thereby making it hard to contribute more.

I have documented the design decisions that seemed unintuitive to me at the time. However, I think that it is generally difficult to judge which parts of your own work need to be documented in order to make it understandable to third parties. Since you have already gone through the trouble of understanding the code as-is I would be grateful if you could add comments in those places where they would have helped your understanding.

deepsek · 2025-07-24T18:57:05Z

Sounds good. Great to hear the intuition process on these.
My primary goal is to enable all possible feature set. I'll leave the code path, design choices and subsequent modifications to the ggml_cuda_should_use_mmq code for the community.
I am here if you need to run perf/bench, etc on the hardware that is openly available (primarily CDNA3, my focus atm).

deepsek · 2025-07-24T19:06:43Z

I don't understand what you're doing with tile_load, please explain it.

The purpose of this is to leverage the 16x16x32 mfma instr (16x8 tile) over 32x32x16 (32x4 tile). This gives some perf increase and also fixes nwarps to 8 for all quants.

I use this specific 'placeholder' tile to load the same <16x4> tile twice as a <16x8> tile since the current arch matrix core can't support 16x16x16 instr. With this compute, the result is basically double the value needed, hence in these cases the scale value is halved in the code.

load_tile is set to <64,2> only because the tile calculates number of elements and I need it to stay at 2 (== 64*2/64). Since, <16,8>, <32,4> is taken. Also, hence the tag as a special tile used to achieve this.

JohannesGaessler · 2025-07-24T19:35:57Z

If I understand you correctly you are solving the problem where some of the quantization types have 1 scale per 4 32 bit integers so the 16x8 tiles are too long and don't yield the correct results. Have you considered loading the data as 16x8 tiles as you do in e.g. vec_dot_q8_0_q8_1_mma and then simply running the matrix multiply accumulate twice with half of the values masked? To me that seems like a simpler and more intuitive solution. In terms of implementation detail we could maybe extend the interface in mma.cuh with functions like

    static __device__ __forceinline__ void mma_low(
            tile<16, 16, float> & D, const tile<16, 8, half2> & A, const tile<16, 8, half2> & B) {

and a corresponding function mma_high that do the matrix multiply accumulate for only one half of the A/B tiles. It's not necessary to write an NVIDIA implementation for this in this PR, but it wouldn't be difficult.

deepsek · 2025-07-24T19:50:56Z

Yea. I was initially going down that same road when I started remove the larger (32x4) tiles. I was going to use a bitmask to simply clear unused threads. But that would require quite a few more changes in the code and how the loops are written right now. In the interest of saving time and prevent this PR from dangling too long, I just choose the other route to achieve the same with minimal changes.

We can revisit this as part of a larger redesign at a later date if you'd like.

JohannesGaessler · 2025-07-24T20:20:45Z

I don't think the current solution is good but I'm willing to approve the PR as-is as long as you promise to refactor the code in the near future.

xbezdick · 2025-07-25T16:13:18Z

Needed manual rebase but works great and on multiple AMD Radeon RX 7900 XT cards resolves "Page not present or supervisor privilege" gpu crash on k-shift.

IMbackK · 2025-07-26T22:04:24Z

This pr should not affect gfx11 in any way, thats also the first i hear of this problem.

IMbackK

I have tested this pr one more time and am satisfied that there are no further regressions except the small regression in the dp4a code path in Q4_0 and Q4_1 on gfx9xx, which i find acceptable give the performance benefit in other areas.

hjc4869 · 2025-07-30T18:46:30Z

@deepsek Given the overall similarity between MFMA and WMMA int8 instructions, is there a plan to enable int8 mmq for RDNA3/4? It could greatly improve performance at least on RDNA4 due to its doubled int8 mma throughput over fp16.

deepsek · 2025-07-30T18:54:55Z

@hjc4869 Yes, there is currently investigation going on to do exactly that! Assuming everything goes smoothly, should expect something to drop from the team in the near future. @jiachengjason

JohannesGaessler · 2025-07-30T19:04:02Z

To be clear, the WMMA interface as provided by e.g. NVIDIA is useless for MMQ because you don't get a defined data layout and therefore cannot apply the per-block scales correctly unless you go through shared memory (slow). That is why I went to the trouble of writing my own primitives in mma.cuh in the first place.

IMbackK · 2025-07-30T19:13:58Z

Wmma in this case means gfx11+ wmma instructions not rocwmma, slightly confusing naming.

he29-net · 2025-08-01T13:03:47Z

Hi,
I updated my llama.cpp repository yesterday and noticed a significant performance regression (around ~30% drop in PP speed) on ROCm / RX 6800. I tried bisecting it today and it led me to this PR.

before PR merge:

  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  0 |           pp512 |        879.15 ± 6.41 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  0 |           tg128 |         58.67 ± 0.02 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  1 |           pp512 |        597.47 ± 1.06 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  1 |           tg128 |         55.74 ± 0.03 |

build: 11dd5a44 (5996)

after PR merge:

  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  0 |           pp512 |        722.62 ± 5.53 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  0 |           tg128 |         58.85 ± 0.04 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  1 |           pp512 |        514.10 ± 1.83 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  1 |           tg128 |         55.96 ± 0.08 |

build: 66906cd8 (5997)

While bisecting, I also ran into one build where FA is either completely broken, or extremely well optimized, but for now I did not examine it further:

  Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_batch | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  0 |           pp512 |        888.74 ± 4.75 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  0 |           tg128 |         57.37 ± 0.08 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  1 |           pp512 |        820.83 ± 2.34 |
| qwen3moe 30B.A3B Q3_K - Small  |  12.37 GiB |    30.53 B | ROCm       |  99 |     512 |  1 |           tg128 |         51.79 ± 0.03 |

build: 760b4484 (5925)

EDIT: Tracked down the FA regression to a86f52b (tag b5973), so that seems unrelated to this PR.

JohannesGaessler · 2025-08-01T16:05:53Z

I can reproduce the regression:

Model	FlashAttention	Test	t/s b5996	t/s b5997	Speedup
llama 8B Q4_0	No	pp512	786.04	556.47	0.71
llama 8B Q4_0	No	tg128	62.46	62.51	1.00
llama 8B Q4_0	Yes	pp512	604.58	459.13	0.76
llama 8B Q4_0	Yes	tg128	59.81	59.91	1.00
qwen3moe 30B.A3B Q3_K_S	No	pp512	525.15	402.47	0.77
qwen3moe 30B.A3B Q3_K_S	No	tg128	39.97	39.98	1.00
qwen3moe 30B.A3B Q3_K_S	Yes	pp512	403.61	325.29	0.81
qwen3moe 30B.A3B Q3_K_S	Yes	tg128	37.85	37.91	1.00

I don't at all understand why this is happening though since the dp4a codepath for a warp size of 32 should be unchanged.

JohannesGaessler · 2025-08-01T16:32:20Z

The performance regression should be fixed with #15014 .

deepsek · 2025-08-01T16:37:12Z

Good catch! Looks like the last line of the ggml_use_mmq... is using this code path for other amd gpus too. Missed this!

IMbackK · 2025-08-01T18:52:22Z

@xbezdick since we have discovered that the RDNA path was affected by this pr after all by accident, and we have restored original behavior with #15014 could you retest git and check if you can repoduce a crash there again and open an issue if so?

deepsek and others added 14 commits May 23, 2025 19:43

Feat: Enable MFMA instr for Q4_K

68da4e5

Fix: Missed template param

79f348a

Feat: Add MFMA instr for Q6_K, remove MMQ_NWARPS

89ba8a6

Merge branch 'ggml-org:master' into amd-integration

e57e563

Merge branch 'ggml-org:master' into amd-integration

9784a51

Merge branch 'ggml-org:master' into amd-integration

dad79b3

Perf: Fix Register Spilling Q6_K - Refactor kernel, launch_bound

ff60fa9

Perf: Refactor Q4_K, reduce register pressure

e8eeb34

Perf: Throughput Increase 4k->6.9k t/s

a161900

Perf: 7.1k tokens/sec

75d386a

Perf/Feat: Throughput 8.3k tokens/sec, Add support for all quants

0215a80

Feat: Remove warnings, deprecated __AMDGCN_WAVEFRONT_SIZE

aa35feb

Merge branch 'master' into amd-integration

ba17f62

Feat: Enable stream-k for CDNA3

5ab1491

deepsek requested a review from JohannesGaessler as a code owner July 10, 2025 18:31

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 10, 2025

Fix: Remove Trailing Whitespaces

fb2fd31

deepsek added 2 commits July 14, 2025 18:34

Fix: Unused Params Warnings, CUDA Build

b55d44a

-p512: 8.4k->9.5k - Account for DataPadding for writing tile_y

ab7c007

deepsek requested a review from ngxson as a code owner July 15, 2025 16:53

github-actions bot added the devops improvements to build systems and github actions label Jul 15, 2025

ggml-org deleted a comment from deepsek Jul 24, 2025

JohannesGaessler reviewed Jul 24, 2025

View reviewed changes

JohannesGaessler approved these changes Jul 24, 2025

View reviewed changes

JohannesGaessler requested a review from IMbackK July 24, 2025 20:27

IMbackK approved these changes Jul 26, 2025

View reviewed changes

IMbackK merged commit 66906cd into ggml-org:master Jul 26, 2025
87 of 88 checks passed

he29-net mentioned this pull request Aug 1, 2025

CUDA: fix overflow in FA, tune performance #14840

Merged

JohannesGaessler mentioned this pull request Aug 1, 2025

CUDA: fix MMQ nwarps for AMD with warp_size==32 #15014

Merged

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624

HIP: Enable Matrix cores for MMQ Kernels, Enable stream-K for CDNA 3 #14624

Conversation

deepsek commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 10, 2025

Uh oh!

ggerganov commented Jul 11, 2025

Uh oh!

Dampfinchen commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Jul 12, 2025

Uh oh!

JohannesGaessler commented Jul 12, 2025

Uh oh!

IMbackK commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepsek commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IMbackK commented Jul 24, 2025

Uh oh!

deepsek commented Jul 24, 2025

Uh oh!

JohannesGaessler Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

deepsek Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler commented Jul 24, 2025

Uh oh!

slaren commented Jul 24, 2025

Uh oh!

JohannesGaessler commented Jul 24, 2025

Uh oh!

JohannesGaessler commented Jul 24, 2025

Uh oh!

deepsek commented Jul 24, 2025

Uh oh!

deepsek commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 24, 2025

Uh oh!

deepsek commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 24, 2025

Uh oh!

xbezdick commented Jul 25, 2025

Uh oh!

IMbackK commented Jul 26, 2025

Uh oh!

IMbackK left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hjc4869 commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepsek commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 30, 2025

Uh oh!

IMbackK commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

he29-net commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

deepsek commented Jul 10, 2025 •

edited

Loading

Dampfinchen commented Jul 11, 2025 •

edited

Loading

IMbackK commented Jul 14, 2025 •

edited

Loading

deepsek commented Jul 21, 2025 •

edited

Loading

IMbackK commented Jul 21, 2025 •

edited

Loading

deepsek Jul 24, 2025 •

edited

Loading

deepsek commented Jul 24, 2025 •

edited

Loading

deepsek commented Jul 24, 2025 •

edited

Loading

hjc4869 commented Jul 30, 2025 •

edited

Loading

deepsek commented Jul 30, 2025 •

edited

Loading

IMbackK commented Jul 30, 2025 •

edited

Loading

he29-net commented Aug 1, 2025 •

edited

Loading

IMbackK commented Aug 1, 2025 •

edited

Loading