feat: Update docs to indicate need to use consistent hashing for KV events in backend engines (#2981)

qimcis · PeaBrane · web-flow · commit 3b6dbef24afa · 2025-09-18T22:27:09.000Z
Signed-off-by: PeaBrane &lt;yanrpei@gmail.com&gt;
Co-authored-by: Yan Ru Pei &lt;yanrpei@gmail.com&gt;
diff --git a/benchmarks/router/run_engines.sh b/benchmarks/router/run_engines.sh
@@ -125,8 +125,8 @@ for i in $(seq 1 $NUM_WORKERS); do
                 "${EXTRA_ARGS[@]}"
         else
             echo "[Worker-$i] Using GPUs: $GPU_DEVICES"
-            # Run vLLM engine (exec with env for proper syntax)
-            exec env CUDA_VISIBLE_DEVICES=$GPU_DEVICES python -m dynamo.vllm \
+            # Run vLLM engine with PYTHONHASHSEED=0 for deterministic event IDs in KV-aware routing
+            exec env PYTHONHASHSEED=0 CUDA_VISIBLE_DEVICES=$GPU_DEVICES python -m dynamo.vllm \
                 --model "$MODEL_PATH" \
                 --endpoint dyn://test.vllm.generate \
                 --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
diff --git a/components/backends/sglang/README.md b/components/backends/sglang/README.md
@@ -237,4 +237,4 @@ We currently provide deployment examples for Kubernetes and SLURM.
 - **[Deploying Dynamo with SGLang on Kubernetes](deploy/README.md)**
 
 ## SLURM
-- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
+- **[Deploying Dynamo with SGLang on SLURM](slurm_jobs/README.md)**
diff --git a/components/backends/vllm/README.md b/components/backends/vllm/README.md
@@ -168,6 +168,18 @@ See `args.py` for the full list of configuration options and their defaults.
 
 The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
 
+### Hashing Consistency for KV Events
+
+When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
+
+- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
+- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:
+
+```bash
+vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
+```
+See the high-level notes in [KV Cache Routing](../../../docs/architecture/kv_cache_routing.md) on deterministic event IDs.
+
 ## Request Migration
 
 You can enable [request migration](../../../docs/architecture/request_migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
diff --git a/docs/architecture/kv_cache_routing.md b/docs/architecture/kv_cache_routing.md
@@ -203,6 +203,10 @@ The two types of events are:
 
 The publisher can be initialized and used through C bindings or Python bindings.
 
+### Deterministic Event IDs
+
+For KV-aware routing to work across multiple workers and restarts, engines must emit deterministic block identifiers in KV events. Ensure all workers use identical engine versions/configuration so that block IDs for the same token content remain consistent. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect. The router recomputes local block hashes from tokens for matching, but parent/child links and removals depend on engine-provided IDs being stable.
+
 ### KVIndexer
 The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.