Throughputs of Long Sequences #12608 #1985

simmonssong · 2025-03-27T08:59:48Z

simmonssong
Mar 27, 2025

Hi, I am testing throughputs of input sequences with different lengths. I found that throughput increases with length on several different models and quantization, is this caused by build-in infrastructure optimization of Llama.cpp?

onestardao · 2025-07-29T08:28:46Z

onestardao
Jul 29, 2025

interesting catch — yeah some infra optimizations in llama.cpp (like tiled kv-cache) do boost throughput on long sequences,
but there's a weirder side effect: models feel faster while actually getting semantically fuzzier under compression stress.

we ran into this when stress-testing long-form prompts + nested reasoning — throughput went up, but logic fidelity went sideways.

anyway, great question. if you're testing semantic drift under load too, happy to swap notes. some of those failure patterns are... spooky.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Throughputs of Long Sequences #12608 #1985

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Throughputs of Long Sequences #12608 #1985

Uh oh!

simmonssong Mar 27, 2025

Replies: 1 comment

Uh oh!

onestardao Jul 29, 2025

simmonssong
Mar 27, 2025

onestardao
Jul 29, 2025