-
Notifications
You must be signed in to change notification settings - Fork 29.9k
Open
Labels
Description
Feature request
A new cache class that supports sharing the same or part of the KV cache between different layers to improve cache efficiency.
Motivation
Many studies have shown that attention weights between different attention layers are always similar, and KV cache sharing
only causes a small quality degradation, while improving 2~3 times token/sec.
Your contribution
I would try to submit a PR.
zucchini-nlp