-
Notifications
You must be signed in to change notification settings - Fork 646
Commit 4b5b75f
committed
Update base for Update on "[ET-VK][qlinear] Faster weight only quantized linear gemv kernel"
## Changes
* Introduce a new compute shader for int4 linear's gemv cases that performs much better than the existing shader. This shader is inspired from MNN's gemv_1x1_conv_buf.cl shader.
With this compute kernel, transformer models' text generation can execute much faster than before.
On Samsung Galaxy S24 for Llama 3.2 1B, generating 128 tokens:
Before: ~25 tok/s
After: ~49 tok/s
## Why this new shader is faster
The biggest reason is due to vectorized loading of the uint4 weight buffer. This new shader loads the weight buffer as a buffer/image of `uvec4`, whereas the old shader loads the weight buffer as a buffer/image of `u8vec4`. Using the Adreno Offline Compiler, I found that in the former, only one load instruction was used to load from the weight tensor, whereas in the latter 16 load instructions were used to load from the weight tensor. It appears that the data loading was not being vectorized at the assembly level. This is potentially behaviour that can be approved in the SPIR-V shader compiler.
An additional factor is better weight packing layout. The new prepacking routine results in better memory coalescing between threads in a work group.
The final major factor is the use of tree based reduction to co-operatively reduce partial results into the final output. Previously, a single thread was responsible for the final reduction.
## Future Work
* Introduce faster shader for int4 linear gemm cases
* Update QCSNW to also use these updated shaders
Differential Revision: [D78275584](https://our.internmc.facebook.com/intern/diff/D78275584/)
[ghstack-poisoned]1 parent bb2f99b commit 4b5b75fCopy full SHA for 4b5b75f
File tree
Expand file treeCollapse file tree
0 file changed
+0
-0
lines changedFilter options
Expand file treeCollapse file tree
0 file changed
+0
-0
lines changed
0 commit comments