comparison with qualcomm ai hub model #8194
Replies: 4 comments
-
good question, i found the same issue on llama 3.1 8b via executorch qnn backend,which performance is below 1/2 compare with qualcomm qaihub chaimed |
Beta Was this translation helpful? Give feedback.
-
cc: @cccclai |
Beta Was this translation helpful? Give feedback.
-
Yeah we found out the model definition in It's still wip and please expect some burden if trying them out, or maybe wait till it's more settled. |
Beta Was this translation helpful? Give feedback.
-
Converting to discussion |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
🐛 Describe the bug
I ran Llama-v3.2-3B-Chat(precision w4a16) from ai-hub-model on a Snapdragon 8 Gen 3 device, achieving 20 tokens/s.
For comparison, I ran inference for the Llama3.2-3B model quantized to W4A16 using executorch with the QNN backend on the same device. The performance I observed was 10 tokens/s.
Could you provide insights into what might be causing this performance difference? Are there issues with how executorch handles quantized models that could explain this gap?
Any guidance or suggestions would be greatly appreciated!
cc @cccclai @winskuo-quic @shewu-quic
Beta Was this translation helpful? Give feedback.
All reactions