Description
Who is this feature for?
Teams with < 5 ms latency SLO on fresh events—trading, ride-dispatch, gaming, ad-tech—who run ≥2 (configurable) consumer-groups and still want AutoMQ’s stateless S3 cost model.
What problem are they facing today?
Problem statement
Today any consumer fetch that targets offsets already flushed to S3 must perform an object-store GET. S3 p99 latency is 8–12 ms, so:
- Hot paths with sub-5 ms SLO (pricing, HFT, gaming) routinely violate latency budgets.
- When ≥ 2 (configurable) consumer-groups read the same tail segment, each pays the full S3 cost—doubling latency and GET charges.
- Rebalances, task migrations, or
seek()
retries trigger “thundering-herd” S3 bursts, causing lag spikes and SlowDown throttling. - existing in‑memory LogCache/BlockCache (which typically holds only a few hundred MB and seconds of traffic
Because brokers are stateless, there is no on-node cache to serve these repeat reads at disk speed; the only choices are accept latency or shorten retention (losing replay).
Why is solving this impactful?
Impact
- Unblocks new use-cases – drops tail-read p99 from 8-12 ms to ≈ 2-4 ms, so latency-critical workloads (pricing, gaming, ad-tech) can stay on AutoMQ instead of migrating to stateful SSD-tiered forks.
- Cuts S3 bill & throttling risk – one cache hit avoids every subsequent GET for that segment, typically halving request cost and smoothing “thundering-herd” spikes during rebalances.
- Preserves AutoMQ’s differentiator – keeps brokers stateless; cache is disposable and off by default, so existing low-cost users aren’t forced to pay for SSDs.
- Small, opt-in change – implemented behind
cache.type
flag; no impact on producer ACK flow or durability model. - Community value – adds a missing performance knob requested by several users, showcasing AutoMQ’s extensibility and encouraging further open-source contributions around storage-layer plugins.
Proposed solution
Proposed solution — “automq-hot-tail-cache” (Phase I)
-
Lazy read-through NVMe cache
- On the first S3 miss, broker writes the fetched segment to a bounded NVMe LRU cache; every later read of that segment is served locally (≈ 0.3 ms).
-
Pluggable SPI
- New interface
SegmentCache
(lookup
,put
,invalidate
,stats
). - Reference impls:
InMemoryLRUCache
(unit-test / CI) andNVMeFileCache
(direct-I/O).
- New interface
-
CacheManager & eviction loop
- Size + age caps (
cache.max.size.bytes
,cache.max.segment.age.ms
) enforced by a background thread; ref-count protects in-flight reads.
- Size + age caps (
-
Zero write-path changes
- Producer ACK still happens right after WAL
fsync
; cache population is post-read, so brokers remain stateless.
- Producer ACK still happens right after WAL
-
Config & metrics (feature-flag OFF by default)
cache.type=none|nvme
(defaultnone
) plus sizing knobs.- Micrometer metrics:
cache_hit_rate
,cache_evictions_total
,cache_size_bytes
,cache_read_latency_ms{quantile}
.
-
Incremental delivery
- PR-1 — SPI + in-mem cache + unit tests.
- PR-2 —
NVMeFileCache
, wiring inFetchService
, metrics, integration test. - Later RFC can add an eager write-through
CacheReplicator
and SPDK optimisations if the lazy mode proves valuable.
This keeps current clusters untouched, gives latency-sensitive users a drop-in speed boost, and preserves AutoMQ’s stateless scaling model.
Additional notes
RFC: Hot‑Tier NVMe Cache for AutoMQ Tail Segments Keeping brokers stateless
1 Motivation
AutoMQ’s shared‑storage architecture achieves massive cost savings by persisting log segments directly to object storage (e.g. Amazon S3) after a small fsync to a local WAL. While this design keeps brokers stateless and eliminates cross‑AZ replication, it introduces a latency gap for tail reads after data scrolls out of the existing in‑memory LogCache/BlockCache (which typically holds only a few hundred MB and seconds of traffic):
- p99 end‑to‑end latency ≈ 8–12 ms for consumers that fetch offsets already flushed to object storage (GET latency + deserialisation).
- Workloads such as financial tick feeds, gaming leaderboards, or micro‑services with sub‑5 ms SLOs cannot tolerate that penalty.
The goal of this RFC is to add an optional hot‑tier NVMe cache that preserves AutoMQ’s statelessness while targeting a tail‑read p99 below 3 ms (stretch goal < 2 ms), subject to empirical benchmarking.
1.1 Why a lazy cache adds value even with “consume‑once” semantics
Real‑world Kafka (and AutoMQ) deployments seldom read each record exactly once at the cluster level. Scenarios that trigger multiple tail reads — where a hot‑tier cache prevents S3 RTT on every replay — include:
Scenario | Why the same bytes are fetched again |
---|---|
Multiple consumer‑groups | ETL, monitoring, ML‑feature pipelines each have their own group ⇒ the newest 1‑2 segments are re‑fetched by N groups within seconds. |
Rebalance / fail‑over | On partition reassignment the new consumer seeks to an older offset to avoid gaps; it re‑reads the tail segments it just saw on the old instance. |
ksqlDB / Streams warm‑up | Stateful tasks replay the tail to rebuild in‑memory windows when they migrate between brokers. |
Retry storms | An exception in processing triggers seek(offset‑1) on a batch of messages → immediate re‑reads of the same segment. |
Analytics side‑loads | Ad‑hoc queries, Spark/Flink batch jobs or ML backfills often rewind a few minutes to gather context. |
Example
Ride‑sharing platform: topic
trip-events
(1 KB msgs, 20 k msg/s). Consumer‑groups:pricing
,fraud
,driver‑incentives
. A broker restart triggers a rebalance → each group re‑fetches the latest two log segments (~64 MiB). With lazy cache:
- First group incurs the S3 GET; segments enter NVMe.
- Second and third groups hit NVMe ⇒ save ≈ 8 ms × 20 k msg/s × 2 groups ≈ 320 ms per second of wall‑clock latency and ~800 GET requests.
Even when individual records are processed once per group, the same segment may be read multiple times across groups or recovery events. Caching the segment after the first miss converts all subsequent reads into sub‑millisecond NVMe hits while keeping the write path untouched.
2 Goals & Non‑Goals
Goals | Non‑Goals |
---|---|
* Provide a pluggable, bounded‑size cache for the most recent log segments. | * Re‑introduce follower replicas or ISR quorums. |
* Keep brokers stateless (cache treated as disposable). | * Replace existing WAL mechanism. |
* Maintain exactly‑once semantics and current producer ACK flow. | * Cache optimisation for random historical reads (cold scans). |
3 Design Overview
# Lazy read‑through (Phase I)
Producer ──> WAL (fsync, ACK) ──> S3Uploader ──> S3 Object Store
│
HotTierCache (NVMe) <─ populated on first consumer miss
Consumer Fetch
│
├─> LogCache (hit? 30 µs) -- DRAM
│
├─> NVMe Cache (hit? 0.3 ms) -- new layer
│
├─> WAL tail (if still resident) 0.5 ms
│
└─> S3 GET (miss) 8–12 ms
└─> put() into NVMe
Consumer Fetch Flow (lazy mode):
1. Check HotTierCache --> hit --> return bytes
2. else read from S3 --> update cache, populate HotTierCache
3.1 Key Components
Component | Responsibility |
---|---|
SegmentCache SPI |
Simple interface (lookup, put, invalidate, stats). |
InMemoryLRUCache |
Default implementation for unit tests & CI. |
NVMeFileCache |
Production impl using direct‑I/O mapped files on local NVMe/EBS gp3. |
CacheManager |
Orchestrates eviction (size‑/age‑based) and populates cache on read path. |
4 Data to Cache
- Complete log segments once they reach
segment.bytes
but before S3 upload. - Optionally the current active segment tail via a memory‑mapped view of the WAL.
- Metadata: base‑offset → local‑file‑path, length, lastAccessNanos.
5 Interaction with Existing Storage Layer (lazy read‑through first)
-
Write path (unchanged). Producer ACKs immediately after WAL
fsync
; only the existingS3Uploader
streams segments to object storage. No second copy is created at write time. -
Read path (lazy cache fill):
- Consumer fetch →
CacheManager.lookup(segmentId)
. - Hit ⇒ zero‑copy read via
FileChannel.transferTo
(NVMe or memory file). - Miss ⇒ delegate to
S3StreamReader
; once bytes are returned,CacheManager.put()
stores the segment (or segment slice) in the cache asynchronously ifcache.warmup=true
.
- Consumer fetch →
-
Eviction: background thread runs every
evict.interval.ms
to enforcecache.max.size.bytes
andcache.max.segment.age.ms
.
Future option: A follow‑up RFC could introduce an eager write‑through (
CacheReplicator
) path that copies segments to NVMe in parallel with the S3 upload. That optimisation is out of scope for Phase I to keep risk and complexity low.
6 Configuration Parameters (new)
Property | Default | Description |
---|---|---|
cache.type |
memory | nvme | none |
Global switch. |
cache.max.size.bytes |
20g |
Soft cap; eviction brings size ≤ cap. |
cache.max.segment.age.ms |
600000 |
TTL for a segment in cache. |
cache.warmup |
false |
If true, populate cache on S3 miss; else only writes. |
7 Metrics & Observability
Metric | Type | Purpose |
---|---|---|
cache_hit_rate |
gauge | Success ratio of cache lookups. |
cache_evictions_total |
counter | Tracks forced evictions. |
cache_size_bytes |
gauge | Current bytes on disk. |
cache_read_latency_ms (p95/p99) |
timer | Compare against S3 read latency. |
8 Failure Handling
- Cache corruption ⇒ auto‑invalidate segment & fall back to S3.
- NVMe volume loss ⇒ broker restarts with empty cache (stateless guarantee). No data loss.
- Low‑disk‑space alert ⇒ force eviction to 80 % of
cache.max.size
.
9 Implementation Plan
-
Phase I (this RFC):
- Introduce
SegmentCache
SPI +InMemoryLRUCache
. - Implement
NVMeFileCache
using direct‑I/O mapped files. - Integrate lazy read‑through fill logic on the consumer fetch path.
- Ship behind feature flag
cache.type=nvme
(defaults tonone
).
- Introduce
-
Phase II (future RFC):
- Optional eager write‑through (
CacheReplicator
) pipeline. - Advanced eviction heuristics (LFU, adaptive age).
- NUMA/SPDK optimisations.
- Optional eager write‑through (
-
Phase III: Empirical benchmarks (JMH, 128 partitions, 1 KB msgs, 10 Ki msgs/s).
-
Phase IV: Production‑grade metrics, docs, Helm/Compose examples.
10 Trade‑offs
Benefit | Drawback |
---|---|
p99 read latency drops from ~10 ms → ~2 ms. | Slight compute overhead (cache lookup, eviction). |
Keeps brokers stateless (cache == disposable). | Adds local‑disk requirement; can’t run on truly disk‑less Fargate. |
Zero data‑copy on rebalance (stateless property intact). | Cache warms up after scale‑in/out; first fetches may still hit S3. |
11 Alternatives Considered
- Increase WAL size to 20 GiB – wastes EBS and does not help segments already flushed.
- Client‑side cache – complicated, per‑consumer, does not help fan‑out.
- Classic tiered storage (SSD + S3) – re‑introduces stateful brokers; violates AutoMQ core principle.
12 Open Questions
- Should we use Direct‑I/O or page‑cache for
NVMeFileCache
reads? - Eviction policy – pure LRU vs LFU? Mixed‑age LFUDA?
- Config‑driven S3 prefetch on cache miss – is warm‑up desirable or too costly?
- Any security implications of mapping files outside data directory?
- Benchmarks: What latency target constitutes “success” for merge?
13 References
- AutoMQ documentation: Shared Storage architecture (2024)
- Confluent Kora paper: “Tiered Storage in Cloud‑Native Kafka” (SoCC 2022)
- Linux kernel mailing‑list discussion:
O_DIRECT
&posix_fadvise
best practices
14 Call for Discussion & Next Steps
- API surface — is the
SegmentCache
SPI minimal yet future‑proof (e.g., supports adaptive LFU)? - Eviction policy — stick with LRU or add LFU/age hybrid before merge?
- Latency target — agree that p99 < 3 ms is “good enough,” or push for sub‑2 ms stretch goal?
- Default state — should
cache.type
default tonone
(current proposal) ornvme
in cloud‑native charts? - Benchmark acceptance criteria — what traffic pattern & hit‑rate threshold must Phase I demonstrate?
Please leave comments on each point or suggest additional concerns. Once consensus forms, we’ll convert this RFC into Phase I PRs (SPI + in‑mem cache) and a follow‑up PR for
NVMeFileCache
. Pull‑request series tracked under milestone “Hot‑Tier Cache”.