Releases: ggml-org/llama.cpp
Releases · ggml-org/llama.cpp
b7191
server: fix: /metrics endpoint returning JSON-escaped Prometheus form…
b7190
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in g…
b7189
[MUSA] enable fp16/fast_fp16/bf16_mma on PH1 (#17551) * [MUSA] enable fp16/fast_fp16/bf16_mma on PH1 Signed-off-by: Xiaodong Ye <[email protected]> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Update ggml/src/ggml-cuda/fattn-vec.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Update ggml/src/ggml-cuda/fattn-tile.cuh Co-authored-by: Johannes Gäßler <[email protected]> * Address review comments Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>
b7188
ggml-cuda: add stricter checking for fusion (#17568) * ggml-cuda: make conditions for fusion more explicit * ggml-cuda: remove size check as std::equal already does it
b7187
server : add Anthropic Messages API support (#17570) * server : add Anthropic Messages API support * remove [email protected] from tool calling/jinja tests * server : remove unused code and slow/skip on test_anthropic_vision_base64_with_multimodal_model in test_anthropic_api.py * server : removed redundant n field logic in anthropic_params_from_json * server : use single error object instead of error_array in streaming response handler for /v1/chat/completions and use unordered_set instead of set in to_json_anthropic_stream() * server : refactor Anthropic API to use OAI conversion * make sure basic test always go first * clean up * clean up api key check, add test --------- Co-authored-by: Xuan Son Nguyen <[email protected]>
b7186
model : Qwen3 Next (#16095) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <[email protected]> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <[email protected]> Co-authored-by: Georgi Gerganov <[email protected]>
b7185
CUDA: no FP16 arithmetic for vector FA kernel (#17558)
b7184
vulkan: Implement GGML_OP_TRI (#17503) * vulkan: Implement GGML_OP_TRI * check types match
b7183
rpc : cache and reuse compute graphs (#15405) Store the last computed graph and reuse it when possible. Also do not return response from GRAPH_COMPUTE and assume it always completes successfully. If this this is not the case, the server closes the connection. This saves us a network round trip to the server.
b7182
HIP: enable mul_mat_f for RDNA4 (#17437) * enable mmf for rdna4 * move some mmvf to mmf * revert lds128 for wmma loading * Revert "revert lds128 for wmma loading" This reverts commit db9ae8b6b4738a5def5b393caa1611d52133e9b5. * Revert "enable mmf for rdna4" This reverts commit 698c9f24187b990e35c3b73a8067e5387e6ddbd4. * Revert "move some mmvf to mmf" This reverts commit 99b92bd6653cc8593607f641e44606391691792f. * enable mul_mat for rdna4 --------- Co-authored-by: zhang hui <[email protected]>