v0.1.4.2
Local Inference Support
The major update is to upgrade to support ExecuTorch v0.5.0 framework for on-device inferencing. Some of the notable improvements are:
- Include support for KleidiAI Blockwise Kernels in XNNPACK to give 20%+ gain in Llama prefill
- Support models quantized via torchao’s quanitize_ api
- Enable stable lowering into XNNPACK
- Feature and fixes on Qualcomm and MediaTek backends (support to come in the future)
- Bug fixes
It's compatible with models (.pte files) that were exported with the previous 0.4 version of ExecuTorch.
Demo App Location
To help consolidate reference material, we’ve moved the example demo apps from llama-stack-apps to llama-stack-client-kotlin.
Contributors
@ashwinb, @cmodi-meta, @dltn , @Riandy, @WuhanMonkey, @yanxi0830