[Feature Request]  Optional NVMe Hot-Tier Tail Cache (Lazy Read-Through) for Sub-3 ms Fetches

### Who is this feature for?

Teams with < 5 ms latency SLO on fresh events—trading, ride-dispatch, gaming, ad-tech—who run ≥2 (configurable) consumer-groups and still want AutoMQ’s stateless S3 cost model.

### What problem are they facing today?

**Problem statement**

Today any consumer fetch that targets offsets already flushed to S3 must perform an object-store GET. S3 p99 latency is 8–12 ms, so:

* Hot paths with sub-5 ms SLO (pricing, HFT, gaming) routinely violate latency budgets.
* When ≥ 2 (configurable) consumer-groups read the same tail segment, each pays the full S3 cost—doubling latency and GET charges.
* Rebalances, task migrations, or `seek()` retries trigger “thundering-herd” S3 bursts, causing lag spikes and SlowDown throttling.
* existing in‑memory LogCache/BlockCache (which typically holds only a few hundred MB and seconds of traffic

Because brokers are stateless, there is **no on-node cache** to serve these repeat reads at disk speed; the only choices are accept latency or shorten retention (losing replay).


### Why is solving this impactful?

**Impact**

* **Unblocks new use-cases** – drops tail-read p99 from 8-12 ms to ≈ 2-4 ms, so latency-critical workloads (pricing, gaming, ad-tech) can stay on AutoMQ instead of migrating to stateful SSD-tiered forks.
* **Cuts S3 bill & throttling risk** – one cache hit avoids every subsequent GET for that segment, typically halving request cost and smoothing “thundering-herd” spikes during rebalances.
* **Preserves AutoMQ’s differentiator** – keeps brokers stateless; cache is disposable and off by default, so existing low-cost users aren’t forced to pay for SSDs.
* **Small, opt-in change** – implemented behind `cache.type` flag; no impact on producer ACK flow or durability model.
* **Community value** – adds a missing performance knob requested by several users, showcasing AutoMQ’s extensibility and encouraging further open-source contributions around storage-layer plugins.


### Proposed solution

### Proposed solution — “automq-hot-tail-cache” (Phase I)

* **Lazy read-through NVMe cache**

  * On the first S3 miss, broker writes the fetched segment to a bounded NVMe LRU cache; every later read of that segment is served locally (≈ 0.3 ms).
* **Pluggable SPI**

  * New interface `SegmentCache` (`lookup`, `put`, `invalidate`, `stats`).
  * Reference impls: `InMemoryLRUCache` (unit-test / CI) and `NVMeFileCache` (direct-I/O).
* **CacheManager & eviction loop**

  * Size + age caps (`cache.max.size.bytes`, `cache.max.segment.age.ms`) enforced by a background thread; ref-count protects in-flight reads.
* **Zero write-path changes**

  * Producer ACK still happens right after WAL `fsync`; cache population is post-read, so brokers remain stateless.
* **Config & metrics (feature-flag OFF by default)**

  * `cache.type=none|nvme` (default `none`) plus sizing knobs.
  * Micrometer metrics: `cache_hit_rate`, `cache_evictions_total`, `cache_size_bytes`, `cache_read_latency_ms{quantile}`.
* **Incremental delivery**

  1. PR-1 — SPI + in-mem cache + unit tests.
  2. PR-2 — `NVMeFileCache`, wiring in `FetchService`, metrics, integration test.
  3. Later RFC can add an **eager write-through** `CacheReplicator` and SPDK optimisations if the lazy mode proves valuable.

This keeps current clusters untouched, gives latency-sensitive users a drop-in speed boost, and preserves AutoMQ’s stateless scaling model.


### Additional notes

# RFC: Hot‑Tier NVMe Cache for AutoMQ Tail Segments Keeping brokers stateless


## 1  Motivation

AutoMQ’s shared‑storage architecture achieves massive cost savings by persisting log segments directly to object storage (e.g. Amazon S3) after a small fsync to a local WAL. While this design keeps brokers stateless and eliminates cross‑AZ replication, it introduces a latency gap for tail reads after data scrolls out of the existing in‑memory LogCache/BlockCache (which typically holds only a few hundred MB and seconds of traffic):

* **p99 end‑to‑end latency ≈ 8–12 ms** for consumers that fetch offsets already flushed to object storage (GET latency + deserialisation).
* Workloads such as financial tick feeds, gaming leaderboards, or micro‑services with sub‑5 ms SLOs cannot tolerate that penalty.

The goal of this RFC is to **add an optional hot‑tier NVMe cache** that preserves AutoMQ’s statelessness **while *****targeting***** a tail‑read p99 below 3 ms (stretch goal < 2 ms), subject to empirical benchmarking**.

### 1.1  Why a *lazy* cache adds value even with “consume‑once” semantics

Real‑world Kafka (and AutoMQ) deployments seldom read each record **exactly once at the cluster level**. Scenarios that trigger multiple tail reads — where a hot‑tier cache prevents S3 RTT on every replay — include:

| Scenario                     | Why the same bytes are fetched again                                                                                                              |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Multiple consumer‑groups** | ETL, monitoring, ML‑feature pipelines each have their own group ⇒ the newest 1‑2 segments are re‑fetched by *N* groups within seconds.            |
| **Rebalance / fail‑over**    | On partition reassignment the new consumer seeks to an older offset to avoid gaps; it re‑reads the tail segments it just saw on the old instance. |
| **ksqlDB / Streams warm‑up** | Stateful tasks replay the tail to rebuild in‑memory windows when they migrate between brokers.                                                    |
| **Retry storms**             | An exception in processing triggers `seek(offset‑1)` on a batch of messages → immediate re‑reads of the same segment.                             |
| **Analytics side‑loads**     | Ad‑hoc queries, Spark/Flink batch jobs or ML backfills often rewind a few minutes to gather context.                                              |

**Example**

> Ride‑sharing platform: topic `trip-events` (1 KB msgs, 20 k msg/s). *Consumer‑groups*: `pricing`, `fraud`, `driver‑incentives`. A broker restart triggers a rebalance → each group re‑fetches the latest two log segments (\~64 MiB). With lazy cache:
>
> * First group incurs the S3 GET; segments enter NVMe.
> * Second and third groups hit NVMe ⇒ save ≈ 8 ms × 20 k msg/s × 2 groups ≈ 320 ms per second of wall‑clock latency and \~800 GET requests.

Even when *individual* records are processed once per group, the **same segment** may be read *multiple* times across groups or recovery events. Caching the segment after the first miss converts all subsequent reads into sub‑millisecond NVMe hits while keeping the write path untouched.

## 2  Goals & Non‑Goals

| Goals                                                                        | Non‑Goals                                                       |
| ---------------------------------------------------------------------------- | --------------------------------------------------------------- |
| \* Provide a pluggable, bounded‑size cache for the most recent log segments. | \* Re‑introduce follower replicas or ISR quorums.               |
| \* Keep brokers stateless (cache treated as disposable).                     | \* Replace existing WAL mechanism.                              |
| \* Maintain exactly‑once semantics and current producer ACK flow.            | \* Cache optimisation for random historical reads (cold scans). |

## 3  Design Overview

```

# Lazy read‑through (Phase I)
Producer ──> WAL (fsync, ACK) ──> S3Uploader ──> S3 Object Store
                   │
              
           HotTierCache (NVMe) <─ populated on first consumer miss


Consumer Fetch
   │
   ├─> LogCache      (hit? 30 µs)          -- DRAM
   │
   ├─> NVMe Cache    (hit? 0.3 ms)         -- new layer
   │
   ├─> WAL tail      (if still resident)    0.5 ms
   │
   └─> S3 GET        (miss)                8–12 ms
           └─> put() into NVMe



Consumer Fetch Flow (lazy mode):
1. Check HotTierCache  -->  hit  --> return bytes
2. else read from S3   -->  update cache, populate HotTierCache
```

### 3.1  Key Components

| Component              | Responsibility                                                            |
| ---------------------- | ------------------------------------------------------------------------- |
| `SegmentCache` **SPI** | Simple interface (lookup, put, invalidate, stats).                        |
| `InMemoryLRUCache`     | Default implementation for unit tests & CI.                               |
| `NVMeFileCache`        | Production impl using direct‑I/O mapped files on local NVMe/EBS gp3.      |
| `CacheManager`         | Orchestrates eviction (size‑/age‑based) and populates cache on read path. |

## 4  Data to Cache

* Complete **log segments** once they reach `segment.bytes` but before S3 upload.
* Optionally the **current active segment tail** via a memory‑mapped view of the WAL.
* Metadata: base‑offset → local‑file‑path, length, lastAccessNanos.

## 5  Interaction with Existing Storage Layer (lazy read‑through first)

* **Write path (unchanged).** Producer ACKs immediately after WAL `fsync`; only the existing `S3Uploader` streams segments to object storage. **No second copy is created at write time.**
* **Read path (lazy cache fill):**

  1. Consumer fetch → `CacheManager.lookup(segmentId)`.
  2. **Hit** ⇒ zero‑copy read via `FileChannel.transferTo` (NVMe or memory file).
  3. **Miss** ⇒ delegate to `S3StreamReader`; once bytes are returned, `CacheManager.put()` stores the segment (or segment slice) in the cache **asynchronously** if `cache.warmup=true`.
* **Eviction:** background thread runs every `evict.interval.ms` to enforce `cache.max.size.bytes` and `cache.max.segment.age.ms`.

> *Future option:* A follow‑up RFC could introduce an **eager write‑through** (`CacheReplicator`) path that copies segments to NVMe in parallel with the S3 upload. That optimisation is out of scope for Phase I to keep risk and complexity low.

## 6  Configuration Parameters (new)

| Property                   | Default                      | Description                                           |
| -------------------------- | ---------------------------- | ----------------------------------------------------- |
| `cache.type`               | `memory` \| `nvme` \| `none` | Global switch.                                        |
| `cache.max.size.bytes`     | `20g`                        | Soft cap; eviction brings size ≤ cap.                 |
| `cache.max.segment.age.ms` | `600000`                     | TTL for a segment in cache.                           |
| `cache.warmup`             | `false`                      | If true, populate cache on S3 miss; else only writes. |

## 7  Metrics & Observability

| Metric                            | Type    | Purpose                          |
| --------------------------------- | ------- | -------------------------------- |
| `cache_hit_rate`                  | gauge   | Success ratio of cache lookups.  |
| `cache_evictions_total`           | counter | Tracks forced evictions.         |
| `cache_size_bytes`                | gauge   | Current bytes on disk.           |
| `cache_read_latency_ms` (p95/p99) | timer   | Compare against S3 read latency. |

## 8  Failure Handling

* Cache corruption ⇒ auto‑invalidate segment & fall back to S3.
* NVMe volume loss ⇒ broker restarts with empty cache (stateless guarantee). No data loss.
* Low‑disk‑space alert ⇒ force eviction to 80 % of `cache.max.size`.

## 9  Implementation Plan

1. **Phase I (this RFC):**

   * Introduce `SegmentCache` SPI + `InMemoryLRUCache`.
   * Implement `NVMeFileCache` using direct‑I/O mapped files.
   * Integrate **lazy read‑through** fill logic on the consumer fetch path.
   * Ship behind feature flag `cache.type=nvme` (defaults to `none`).
2. **Phase II (future RFC):**

   * Optional **eager write‑through** (`CacheReplicator`) pipeline.
   * Advanced eviction heuristics (LFU, adaptive age).
   * NUMA/SPDK optimisations.
3. **Phase III:** Empirical benchmarks (JMH, 128 partitions, 1 KB msgs, 10 Ki msgs/s).
4. **Phase IV:** Production‑grade metrics, docs, Helm/Compose examples.

## 10  Trade‑offs

| Benefit                                                  | Drawback                                                           |
| -------------------------------------------------------- | ------------------------------------------------------------------ |
| p99 read latency drops from \~10 ms → \~2 ms.            | Slight compute overhead (cache lookup, eviction).                  |
| Keeps brokers stateless (cache == disposable).           | Adds local‑disk requirement; can’t run on truly disk‑less Fargate. |
| Zero data‑copy on rebalance (stateless property intact). | Cache warms up after scale‑in/out; first fetches may still hit S3. |

## 11  Alternatives Considered

1. **Increase WAL size to 20 GiB** – wastes EBS and does not help segments already flushed.
2. **Client‑side cache** – complicated, per‑consumer, does not help fan‑out.
3. **Classic tiered storage (SSD + S3)** – re‑introduces stateful brokers; violates AutoMQ core principle.

## 12  Open Questions

1. Should we use **Direct‑I/O or page‑cache** for `NVMeFileCache` reads?
2. Eviction policy – pure LRU vs LFU?  Mixed‑age LFUDA?
3. Config‑driven S3 prefetch on cache miss – is warm‑up desirable or too costly?
4. Any security implications of mapping files outside data directory?
5. Benchmarks: What latency target constitutes “success” for merge?

## 13  References

* [AutoMQ documentation: ](https://www.automq.com/docs/automq/architecture/overview)[*Shared Storage architecture*](https://www.automq.com/docs/automq/architecture/overview)[ (2024)](https://www.automq.com/docs/automq/architecture/overview)
* [Confluent Kora paper: “Tiered Storage in Cloud‑Native Kafka” (SoCC 2022)](https://www.vldb.org/pvldb/vol16/p3822-povzner.pdf)
* [Linux kernel mailing‑list discussion: ](https://yarchive.net/comp/linux/o_direct.html)[`O_DIRECT`](https://yarchive.net/comp/linux/o_direct.html)[ & ](https://yarchive.net/comp/linux/o_direct.html)[`posix_fadvise`](https://yarchive.net/comp/linux/o_direct.html)[ best practices](https://yarchive.net/comp/linux/o_direct.html)

## 14  Call for Discussion & Next Steps

1. **API surface** — is the `SegmentCache` SPI minimal yet future‑proof (e.g., supports adaptive LFU)?
2. **Eviction policy** — stick with LRU or add LFU/age hybrid before merge?
3. **Latency target** — agree that p99 < 3 ms is “good enough,” or push for sub‑2 ms stretch goal?
4. **Default state** — should `cache.type` default to `none` (current proposal) or `nvme` in cloud‑native charts?
5. **Benchmark acceptance criteria** — what traffic pattern & hit‑rate threshold must Phase I demonstrate?

> **Please leave comments on each point or suggest additional concerns.** Once consensus forms, we’ll convert this RFC into Phase I PRs (SPI + in‑mem cache) and a follow‑up PR for `NVMeFileCache`. Pull‑request series tracked under milestone *“Hot‑Tier Cache”*.



Scenario	Why the same bytes are fetched again
Multiple consumer‑groups	ETL, monitoring, ML‑feature pipelines each have their own group ⇒ the newest 1‑2 segments are re‑fetched by N groups within seconds.
Rebalance / fail‑over	On partition reassignment the new consumer seeks to an older offset to avoid gaps; it re‑reads the tail segments it just saw on the old instance.
ksqlDB / Streams warm‑up	Stateful tasks replay the tail to rebuild in‑memory windows when they migrate between brokers.
Retry storms	An exception in processing triggers `seek(offset‑1)` on a batch of messages → immediate re‑reads of the same segment.
Analytics side‑loads	Ad‑hoc queries, Spark/Flink batch jobs or ML backfills often rewind a few minutes to gather context.

Property	Default	Description
`cache.type`	`memory` \| `nvme` \| `none`	Global switch.
`cache.max.size.bytes`	`20g`	Soft cap; eviction brings size ≤ cap.
`cache.max.segment.age.ms`	`600000`	TTL for a segment in cache.
`cache.warmup`	`false`	If true, populate cache on S3 miss; else only writes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature Request] Optional NVMe Hot-Tier Tail Cache (Lazy Read-Through) for Sub-3 ms Fetches #2520

Who is this feature for?

What problem are they facing today?

Why is solving this impactful?

Proposed solution

Proposed solution — “automq-hot-tail-cache” (Phase I)

Additional notes

RFC: Hot‑Tier NVMe Cache for AutoMQ Tail Segments Keeping brokers stateless

1 Motivation

1.1 Why a lazy cache adds value even with “consume‑once” semantics

2 Goals & Non‑Goals

3 Design Overview

3.1 Key Components

4 Data to Cache

5 Interaction with Existing Storage Layer (lazy read‑through first)

6 Configuration Parameters (new)

7 Metrics & Observability

8 Failure Handling

9 Implementation Plan

10 Trade‑offs

11 Alternatives Considered

12 Open Questions

13 References

14 Call for Discussion & Next Steps

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Goals	Non‑Goals
* Provide a pluggable, bounded‑size cache for the most recent log segments.	* Re‑introduce follower replicas or ISR quorums.
* Keep brokers stateless (cache treated as disposable).	* Replace existing WAL mechanism.
* Maintain exactly‑once semantics and current producer ACK flow.	* Cache optimisation for random historical reads (cold scans).

Component	Responsibility
`SegmentCache` SPI	Simple interface (lookup, put, invalidate, stats).
`InMemoryLRUCache`	Default implementation for unit tests & CI.
`NVMeFileCache`	Production impl using direct‑I/O mapped files on local NVMe/EBS gp3.
`CacheManager`	Orchestrates eviction (size‑/age‑based) and populates cache on read path.

Metric	Type	Purpose
`cache_hit_rate`	gauge	Success ratio of cache lookups.
`cache_evictions_total`	counter	Tracks forced evictions.
`cache_size_bytes`	gauge	Current bytes on disk.
`cache_read_latency_ms` (p95/p99)	timer	Compare against S3 read latency.

Benefit	Drawback
p99 read latency drops from ~10 ms → ~2 ms.	Slight compute overhead (cache lookup, eviction).
Keeps brokers stateless (cache == disposable).	Adds local‑disk requirement; can’t run on truly disk‑less Fargate.
Zero data‑copy on rebalance (stateless property intact).	Cache warms up after scale‑in/out; first fetches may still hit S3.

[Feature Request] Optional NVMe Hot-Tier Tail Cache (Lazy Read-Through) for Sub-3 ms Fetches #2520

Description

Who is this feature for?

What problem are they facing today?

Why is solving this impactful?

Proposed solution

Proposed solution — “automq-hot-tail-cache” (Phase I)

Additional notes

RFC: Hot‑Tier NVMe Cache for AutoMQ Tail Segments Keeping brokers stateless

1 Motivation

1.1 Why a lazy cache adds value even with “consume‑once” semantics

2 Goals & Non‑Goals

3 Design Overview

3.1 Key Components

4 Data to Cache

5 Interaction with Existing Storage Layer (lazy read‑through first)

6 Configuration Parameters (new)

7 Metrics & Observability

8 Failure Handling

9 Implementation Plan

10 Trade‑offs

11 Alternatives Considered

12 Open Questions

13 References

14 Call for Discussion & Next Steps

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions