Skip to content

[Feature Request] Optional NVMe Hot-Tier Tail Cache (Lazy Read-Through) for Sub-3 ms Fetches #2520

Open
@pintu4146

Description

@pintu4146

Who is this feature for?

Teams with < 5 ms latency SLO on fresh events—trading, ride-dispatch, gaming, ad-tech—who run ≥2 (configurable) consumer-groups and still want AutoMQ’s stateless S3 cost model.

What problem are they facing today?

Problem statement

Today any consumer fetch that targets offsets already flushed to S3 must perform an object-store GET. S3 p99 latency is 8–12 ms, so:

  • Hot paths with sub-5 ms SLO (pricing, HFT, gaming) routinely violate latency budgets.
  • When ≥ 2 (configurable) consumer-groups read the same tail segment, each pays the full S3 cost—doubling latency and GET charges.
  • Rebalances, task migrations, or seek() retries trigger “thundering-herd” S3 bursts, causing lag spikes and SlowDown throttling.
  • existing in‑memory LogCache/BlockCache (which typically holds only a few hundred MB and seconds of traffic

Because brokers are stateless, there is no on-node cache to serve these repeat reads at disk speed; the only choices are accept latency or shorten retention (losing replay).

Why is solving this impactful?

Impact

  • Unblocks new use-cases – drops tail-read p99 from 8-12 ms to ≈ 2-4 ms, so latency-critical workloads (pricing, gaming, ad-tech) can stay on AutoMQ instead of migrating to stateful SSD-tiered forks.
  • Cuts S3 bill & throttling risk – one cache hit avoids every subsequent GET for that segment, typically halving request cost and smoothing “thundering-herd” spikes during rebalances.
  • Preserves AutoMQ’s differentiator – keeps brokers stateless; cache is disposable and off by default, so existing low-cost users aren’t forced to pay for SSDs.
  • Small, opt-in change – implemented behind cache.type flag; no impact on producer ACK flow or durability model.
  • Community value – adds a missing performance knob requested by several users, showcasing AutoMQ’s extensibility and encouraging further open-source contributions around storage-layer plugins.

Proposed solution

Proposed solution — “automq-hot-tail-cache” (Phase I)

  • Lazy read-through NVMe cache

    • On the first S3 miss, broker writes the fetched segment to a bounded NVMe LRU cache; every later read of that segment is served locally (≈ 0.3 ms).
  • Pluggable SPI

    • New interface SegmentCache (lookup, put, invalidate, stats).
    • Reference impls: InMemoryLRUCache (unit-test / CI) and NVMeFileCache (direct-I/O).
  • CacheManager & eviction loop

    • Size + age caps (cache.max.size.bytes, cache.max.segment.age.ms) enforced by a background thread; ref-count protects in-flight reads.
  • Zero write-path changes

    • Producer ACK still happens right after WAL fsync; cache population is post-read, so brokers remain stateless.
  • Config & metrics (feature-flag OFF by default)

    • cache.type=none|nvme (default none) plus sizing knobs.
    • Micrometer metrics: cache_hit_rate, cache_evictions_total, cache_size_bytes, cache_read_latency_ms{quantile}.
  • Incremental delivery

    1. PR-1 — SPI + in-mem cache + unit tests.
    2. PR-2 — NVMeFileCache, wiring in FetchService, metrics, integration test.
    3. Later RFC can add an eager write-through CacheReplicator and SPDK optimisations if the lazy mode proves valuable.

This keeps current clusters untouched, gives latency-sensitive users a drop-in speed boost, and preserves AutoMQ’s stateless scaling model.

Additional notes

RFC: Hot‑Tier NVMe Cache for AutoMQ Tail Segments Keeping brokers stateless

1  Motivation

AutoMQ’s shared‑storage architecture achieves massive cost savings by persisting log segments directly to object storage (e.g. Amazon S3) after a small fsync to a local WAL. While this design keeps brokers stateless and eliminates cross‑AZ replication, it introduces a latency gap for tail reads after data scrolls out of the existing in‑memory LogCache/BlockCache (which typically holds only a few hundred MB and seconds of traffic):

  • p99 end‑to‑end latency ≈ 8–12 ms for consumers that fetch offsets already flushed to object storage (GET latency + deserialisation).
  • Workloads such as financial tick feeds, gaming leaderboards, or micro‑services with sub‑5 ms SLOs cannot tolerate that penalty.

The goal of this RFC is to add an optional hot‑tier NVMe cache that preserves AutoMQ’s statelessness while targeting a tail‑read p99 below 3 ms (stretch goal < 2 ms), subject to empirical benchmarking.

1.1  Why a lazy cache adds value even with “consume‑once” semantics

Real‑world Kafka (and AutoMQ) deployments seldom read each record exactly once at the cluster level. Scenarios that trigger multiple tail reads — where a hot‑tier cache prevents S3 RTT on every replay — include:

Scenario Why the same bytes are fetched again
Multiple consumer‑groups ETL, monitoring, ML‑feature pipelines each have their own group ⇒ the newest 1‑2 segments are re‑fetched by N groups within seconds.
Rebalance / fail‑over On partition reassignment the new consumer seeks to an older offset to avoid gaps; it re‑reads the tail segments it just saw on the old instance.
ksqlDB / Streams warm‑up Stateful tasks replay the tail to rebuild in‑memory windows when they migrate between brokers.
Retry storms An exception in processing triggers seek(offset‑1) on a batch of messages → immediate re‑reads of the same segment.
Analytics side‑loads Ad‑hoc queries, Spark/Flink batch jobs or ML backfills often rewind a few minutes to gather context.

Example

Ride‑sharing platform: topic trip-events (1 KB msgs, 20 k msg/s). Consumer‑groups: pricing, fraud, driver‑incentives. A broker restart triggers a rebalance → each group re‑fetches the latest two log segments (~64 MiB). With lazy cache:

  • First group incurs the S3 GET; segments enter NVMe.
  • Second and third groups hit NVMe ⇒ save ≈ 8 ms × 20 k msg/s × 2 groups ≈ 320 ms per second of wall‑clock latency and ~800 GET requests.

Even when individual records are processed once per group, the same segment may be read multiple times across groups or recovery events. Caching the segment after the first miss converts all subsequent reads into sub‑millisecond NVMe hits while keeping the write path untouched.

2  Goals & Non‑Goals

Goals Non‑Goals
* Provide a pluggable, bounded‑size cache for the most recent log segments. * Re‑introduce follower replicas or ISR quorums.
* Keep brokers stateless (cache treated as disposable). * Replace existing WAL mechanism.
* Maintain exactly‑once semantics and current producer ACK flow. * Cache optimisation for random historical reads (cold scans).

3 Design Overview


# Lazy read‑through (Phase I)
Producer ──> WAL (fsync, ACK) ──> S3Uploader ──> S3 Object Store
                   │
              
           HotTierCache (NVMe) <─ populated on first consumer miss


Consumer Fetch
   │
   ├─> LogCache      (hit? 30 µs)          -- DRAM
   │
   ├─> NVMe Cache    (hit? 0.3 ms)         -- new layer
   │
   ├─> WAL tail      (if still resident)    0.5 ms
   │
   └─> S3 GET        (miss)                8–12 ms
           └─> put() into NVMe



Consumer Fetch Flow (lazy mode):
1. Check HotTierCache  -->  hit  --> return bytes
2. else read from S3   -->  update cache, populate HotTierCache

3.1 Key Components

Component Responsibility
SegmentCache SPI Simple interface (lookup, put, invalidate, stats).
InMemoryLRUCache Default implementation for unit tests & CI.
NVMeFileCache Production impl using direct‑I/O mapped files on local NVMe/EBS gp3.
CacheManager Orchestrates eviction (size‑/age‑based) and populates cache on read path.

4 Data to Cache

  • Complete log segments once they reach segment.bytes but before S3 upload.
  • Optionally the current active segment tail via a memory‑mapped view of the WAL.
  • Metadata: base‑offset → local‑file‑path, length, lastAccessNanos.

5  Interaction with Existing Storage Layer (lazy read‑through first)

  • Write path (unchanged). Producer ACKs immediately after WAL fsync; only the existing S3Uploader streams segments to object storage. No second copy is created at write time.

  • Read path (lazy cache fill):

    1. Consumer fetch → CacheManager.lookup(segmentId).
    2. Hit ⇒ zero‑copy read via FileChannel.transferTo (NVMe or memory file).
    3. Miss ⇒ delegate to S3StreamReader; once bytes are returned, CacheManager.put() stores the segment (or segment slice) in the cache asynchronously if cache.warmup=true.
  • Eviction: background thread runs every evict.interval.ms to enforce cache.max.size.bytes and cache.max.segment.age.ms.

Future option: A follow‑up RFC could introduce an eager write‑through (CacheReplicator) path that copies segments to NVMe in parallel with the S3 upload. That optimisation is out of scope for Phase I to keep risk and complexity low.

6  Configuration Parameters (new)

Property Default Description
cache.type memory | nvme | none Global switch.
cache.max.size.bytes 20g Soft cap; eviction brings size ≤ cap.
cache.max.segment.age.ms 600000 TTL for a segment in cache.
cache.warmup false If true, populate cache on S3 miss; else only writes.

7  Metrics & Observability

Metric Type Purpose
cache_hit_rate gauge Success ratio of cache lookups.
cache_evictions_total counter Tracks forced evictions.
cache_size_bytes gauge Current bytes on disk.
cache_read_latency_ms (p95/p99) timer Compare against S3 read latency.

8  Failure Handling

  • Cache corruption ⇒ auto‑invalidate segment & fall back to S3.
  • NVMe volume loss ⇒ broker restarts with empty cache (stateless guarantee). No data loss.
  • Low‑disk‑space alert ⇒ force eviction to 80 % of cache.max.size.

9  Implementation Plan

  1. Phase I (this RFC):

    • Introduce SegmentCache SPI + InMemoryLRUCache.
    • Implement NVMeFileCache using direct‑I/O mapped files.
    • Integrate lazy read‑through fill logic on the consumer fetch path.
    • Ship behind feature flag cache.type=nvme (defaults to none).
  2. Phase II (future RFC):

    • Optional eager write‑through (CacheReplicator) pipeline.
    • Advanced eviction heuristics (LFU, adaptive age).
    • NUMA/SPDK optimisations.
  3. Phase III: Empirical benchmarks (JMH, 128 partitions, 1 KB msgs, 10 Ki msgs/s).

  4. Phase IV: Production‑grade metrics, docs, Helm/Compose examples.

10  Trade‑offs

Benefit Drawback
p99 read latency drops from ~10 ms → ~2 ms. Slight compute overhead (cache lookup, eviction).
Keeps brokers stateless (cache == disposable). Adds local‑disk requirement; can’t run on truly disk‑less Fargate.
Zero data‑copy on rebalance (stateless property intact). Cache warms up after scale‑in/out; first fetches may still hit S3.

11  Alternatives Considered

  1. Increase WAL size to 20 GiB – wastes EBS and does not help segments already flushed.
  2. Client‑side cache – complicated, per‑consumer, does not help fan‑out.
  3. Classic tiered storage (SSD + S3) – re‑introduces stateful brokers; violates AutoMQ core principle.

12  Open Questions

  1. Should we use Direct‑I/O or page‑cache for NVMeFileCache reads?
  2. Eviction policy – pure LRU vs LFU? Mixed‑age LFUDA?
  3. Config‑driven S3 prefetch on cache miss – is warm‑up desirable or too costly?
  4. Any security implications of mapping files outside data directory?
  5. Benchmarks: What latency target constitutes “success” for merge?

13  References

14  Call for Discussion & Next Steps

  1. API surface — is the SegmentCache SPI minimal yet future‑proof (e.g., supports adaptive LFU)?
  2. Eviction policy — stick with LRU or add LFU/age hybrid before merge?
  3. Latency target — agree that p99 < 3 ms is “good enough,” or push for sub‑2 ms stretch goal?
  4. Default state — should cache.type default to none (current proposal) or nvme in cloud‑native charts?
  5. Benchmark acceptance criteria — what traffic pattern & hit‑rate threshold must Phase I demonstrate?

Please leave comments on each point or suggest additional concerns. Once consensus forms, we’ll convert this RFC into Phase I PRs (SPI + in‑mem cache) and a follow‑up PR for NVMeFileCache. Pull‑request series tracked under milestone “Hot‑Tier Cache”.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions