Replies: 2 comments
-
@hipudding is there something we can help you with to make it happen? |
Beta Was this translation helpful? Give feedback.
-
Thank you for your interest in Ascend. If you want to enable quantized formats, I believe q8(q8_0,q8_1,q8_k_m) and q4(q4_0,q4_1,q4_k_m) are feasible. It would only require implementing the quantized versions of GGML_OP_GET_ROWS, GGML_OP_MUL_MAT, and GGML_OP_MUL_MAT_ID. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Ascends NPUs seems to be a great alternative (to Macstudio and epyc) to run quantized R1.
For example: Atlas 300I Duo offers 140TFLOPS fp16 408GB/s mem bandwidth + 96G Vram.
2 of this card onto a PC could run the quantized 671B R1 relatively well I would say.
However, as shown in https://github.com/ggerganov/llama.cpp/blob/master/docs/backend/CANN.md, there is no deepseek architecture support yet, and low bit quantization seems to be not validated yet.
@hipudding Do you have plan on porting low-bit quantized R1 to Ascend cards, via gguf-cann backend?
That seems a pretty valid use case to me...
Beta Was this translation helpful? Give feedback.
All reactions