- 
                Notifications
    
You must be signed in to change notification settings  - Fork 479
 
Open
Labels
Description
🚀 Feature Description and Motivation
In many large language model (LLM) scenarios, especially multi-turn conversations or sessions where the user interacts repeatedly with the same context (e.g. chatbots, agents, assistant-like use cases), it’s critical to efficiently re-use past prompt / history information without repeatedly sending the entire conversation back to the model.
Several popular APIs already support explicit context caching or context handles:
- Anthropic Claude’s prompt caching uses cache identifiers to rehydrate previous contexts.
 - Google Gemini context caching provides context_cache_id to continue conversations.
 - Moonshot Kimi context caching allows explicit reuse of context handles.
 - Volcengine also offers conversation_id for session reuse.
 
We’d like to introduce an optional context caching interface in AIBrix, so that:
- Clients can pass in a conversation/session ID or similar handle when making requests.
 - AIBrix can reuse already-processed KV cache / embedding context for that session, reducing repeated computation.
 - Expose:
- A way to create a new context handle (first request)
 - A way to continue using an existing handle (subsequent requests)
 - A way to explicitly clear / expire handles (or auto-timeout)
 
 
This would likely require:
- Storing partial KV cache (or references) indexed by conversation/session IDs.
 - Coordinating with AIBrix’ current GPU memory management and eviction mechanisms.
 - Ensuring multi-tenant isolation and clean up on failures.
 
Use Case
- New API fields (e.g. context_id, clear_context).
 - Internal engine / scheduler support to associate context ID with existing KV cache.
 - Metrics to track cache hit/miss rate, and memory usage of stored contexts.
 
Proposed Solution
No response