Skip to content

KyleSDeveloper/rag_service

Repository files navigation

Production RAG Service — Starter Kit

Python 3.11 FastAPI Docker CI

Hybrid retrieval (BM25 baseline today; vectors/reranking optional) with evals, auth, rate limiting, and latency metrics.

Table of Contents

Acceptance Criteria (edit targets as needed)

  • Recall@10 ≥ 0.80; Answer F1 ≥ 0.70 (or EM ≥ 0.60)
  • p95 latency ≤ 800 ms (≥100 queries); p50 ≤ 300 ms
  • Cost/1k queries within budget; cache hit-rate ≥ 30%
  • API-key auth + rate limiting
  • Docker + one-click deploy (Render/Fly/Cloud Run)
  • README benchmarks table + Loom demo

Quickstart

Option A — conda

conda create -n rag_env python=3.11 -y
conda activate rag_env
pip install -r requirements.txt

# build the BM25 index
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json

# run (auth + rate limit)
API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
python -m uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

Option B — venv

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# build the BM25 index
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json

# run (auth + rate limit)
API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

One-liner — build index & run

API_KEY=dev-key RATE_LIMIT_PER_MIN=30 \
python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json && \
python -m uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

Endpoints

  • GET /health{"ok": true, "version": "..." }
  • GET /version{"version": "..." }
  • GET /metrics{"requests": n, "latency_ms_p50": ..., "latency_ms_p95": ..., "window": n}
  • POST /ask{ "answer": "...", "latency_ms": 0.0, "docs": [ { "doc_id": "...", "text": "...", "score": ... } ] }

Usage (auth required)

Set your base URL and API key (local example shown):

export BASE_URL=http://localhost:8010
export API_KEY=dev-key

Authorized request (200):

curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}' | python -m json.tool

Unauthorized example (should be 401):

curl -i -s -X POST "$BASE_URL/ask" \
  -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'

Metrics:

curl -s "$BASE_URL/metrics" | python -m json.tool

Evaluate

Place 50 Q/A pairs in eval/gold.jsonl:

{"question":"...", "answer":"..."}

Run the evaluator:

python -m eval.evaluate --gold ./eval/gold.jsonl --api http://localhost:8010/ask --k 5

Benchmarks (local demo)

Metric Value
Answer F1 1.00 (toy)
Recall@10 1.00 (toy)
p50 latency 0.199 ms (local)
p95 latency 0.355 ms (local)

Architecture

flowchart TD
  A[Client] -->|HTTP| S[FastAPI Service]

  subgraph Routes
    S --> R1["POST /ask"]
    S --> R2["GET /health"]
    S --> R3["GET /version"]
    S --> R4["GET /metrics"]
  end

  R1 --> G{API key valid?}
  G -- No --> E401[401 Unauthorized]
  G -- Yes --> L{Within rate limit?}
  L -- No --> E429[429 Rate Limited]
  L -- Yes --> Q[BM25 query k]

  Q --> IDX[Index JSON]
  Q --> B[Stopword-aware boost]
  B --> SSEL[Best sentence]
  SSEL --> CAN[Canonical phrasing]
  CAN --> RESP[Response: answer/docs/latency_ms]

  R1 -. on success .-> MREC[Record latency]
  MREC --> R4
 
  subgraph Build
    C1[corpus txt files] --> IDX
    C2[python -m rag_app.index] --> IDX
  end
Loading

Components

  • API layer: rag_app/main.py (FastAPI app, routes, request/response models).
  • Auth: Simple API-key via x-api-key header. Disabled if API_KEY env is unset/empty.
  • Rate limiting: In-memory token bucket per key (RATE_LIMIT_PER_MIN), thread-safe.
  • Retrieval: rag_app/retrieval.py with BM25Retriever over a JSON index.
  • Index build: rag_app/index.py splits corpus/*.txt into snippets → writes rag_app/index.json.
  • Answering: Stopword-aware boost, choose best sentence from top snippet, then optional canonical phrasing for known intents.
  • Metrics: In-memory deque of recent latencies (p50/p95) + request count, exposed at /metrics.
  • (Optional) Cache: Small in-memory LRU for repeated (question,k) lookups.

Request lifecycle (POST /ask)

  1. Guard: Check x-api-key (if API_KEY is set) and rate limit the caller.
  2. Retrieve: Query BM25 over rag_app/index.json (top-k).
  3. Re-rank: Apply stopword-aware term-match boost to prioritize relevant snippets.
  4. Answer pick: Choose the best sentence from the top snippet; if the question matches a known intent, apply canonical phrasing.
  5. Metrics: Record latency (ms) into a rolling window (default 5k requests).
  6. Respond: Return {answer, latency_ms, docs}.

Data & storage

  • Corpus: Plain text files under corpus/. Edit or replace for your domain.
  • Index artifact: rag_app/index.json (generated). Treat as a build artifact; ignore in git.
    • Build at image build time (Docker) or at container start if missing.

Configuration (env)

  • API_KEY – enables auth when set (e.g., dev-key for local).
  • RATE_LIMIT_PER_MIN – integer per-key budget (default 60).
  • (If you add caching) CACHE_TTL_S, CACHE_MAX.

Module layout (key files)

rag_app/
├─ main.py         # FastAPI app, routes, auth, limiter, metrics, answering
├─ retrieval.py    # BM25Retriever (loads/snaps index)
├─ index.py        # builds JSON index from corpus/*.txt
└─ index.json      # generated artifact (ignored in VCS)
eval/
└─ evaluate.py     # computes F1/Recall@k via API calls
corpus/
└─ *.txt           # domain text

Deployment (Docker & one-click)

One-click (Render)

Deploy to Render

Uses Dockerfile. Set API_KEY in Render env vars after deploy.

Docker (local)

# build
docker build -t rag-service .

# run (maps 8000->8000 in the container)
docker run --rm -p 8000:8000 \
  -e API_KEY=dev-key \
  -e RATE_LIMIT_PER_MIN=60 \
  rag-service

Set BASE_URL=http://localhost:8000 when testing the container.

Monitoring & Metrics

What’s exposed

  • GET /metrics → JSON:
    {
      "requests": 42,
      "latency_ms_p50": 1.23,
      "latency_ms_p95": 3.45,
      "window": 42,
      "version": "..."
    }
  • window = number of recent requests kept in memory (rolling window).
  • Values reset on process restart (in-memory).

Quick checks

# Pretty print
curl -s "$BASE_URL/metrics" | python -m json.tool

# Print just key numbers (quote-safe)
curl -s "$BASE_URL/metrics" \
| python -c 'import sys,json; d=json.load(sys.stdin); print("requests={}  p50={} ms  p95={} ms".format(d["requests"], d["latency_ms_p50"], d["latency_ms_p95"]))'

Optional: Prometheus endpoint

pip install prometheus-fastapi-instrumentator
# rag_app/main.py
from prometheus_fastapi_instrumentator import Instrumentator

@app.on_event("startup")
def _startup():
    Instrumentator().instrument(app).expose(app, endpoint="/metrics/prom")

Security (API key & rate limiting)

API key

  • Header: x-api-key: <YOUR_KEY>
  • Enabled when API_KEY env var is set (any non-empty string).
  • Disabled in dev if API_KEY is empty.

Examples

# Authorized (200)
curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'

# Unauthorized (401)
curl -i -s -X POST "$BASE_URL/ask" \
  -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":5}'

Rate limiting

  • In-memory token bucket per API key.
  • Budget per minute: RATE_LIMIT_PER_MIN (default 60).
  • Exceeds budget → 429 Too Many Requests.
  • For multi-replica deployments, move buckets to Redis (shared state).

Set limits

API_KEY=<strong-secret> RATE_LIMIT_PER_MIN=60 \
uvicorn rag_app.main:app --host 0.0.0.0 --port 8010

Best practices

  • Use different keys per environment (dev/stage/prod).
  • Rotate keys; never commit them.
  • Front with a gateway/WAF if exposed publicly.
  • Add CORS policy if you’ll call from a browser app.

Troubleshooting

Symptom Likely cause Fix
401 Unauthorized on /ask Missing/incorrect x-api-key or API_KEY not set on server Set API_KEY server-side and send x-api-key header. Test curl -s $BASE_URL/health.
429 Too Many Requests Rate limit exceeded Lower request rate, increase RATE_LIMIT_PER_MIN, or use separate keys for tests.
404 Not Found on /version or /ask Wrong app path or port Ensure you run rag_app.main:app and target the right port. List paths via /openapi.json.
Port already in use Old server still running ss -lptn 'sport = :8010' then kill the PID, or change --port.
uvicorn: command not found Not installed in current env pip install uvicorn[standard]; confirm with which python / which uvicorn.
ModuleNotFoundError: rag_app Wrong cwd / PYTHONPATH Run from repo root or set PYTHONPATH=.; uvicorn rag_app.main:app ....
Index missing at startup rag_app/index.json not built Run python -m rag_app.index --corpus ./corpus --out ./rag_app/index.json.
/metrics shows zeros Fresh process or no traffic Send a few /ask requests, then recheck.
JSON errors in CLI snippets F-string quoting Use the .format() example in Monitoring section.
Docker healthcheck failing Wrong port or env Container listens on $PORT (default 8000). Map and set API_KEY.

Diagnostics

# List routes
curl -s "$BASE_URL/openapi.json" | python -m json.tool

# Health/version
curl -s "$BASE_URL/health"; curl -s "$BASE_URL/version"

# Minimal POST
curl -s -X POST "$BASE_URL/ask" \
  -H "x-api-key: $API_KEY" -H "Content-Type: application/json" \
  -d '{"question":"What is coinsurance?","k":3}' | python -m json.tool

Notes

  • Start with BM25 baseline (rank_bm25), then add vectors + reranker as needed.
  • Consider a small LRU cache for repeated queries and structured logging for observability.

License

MIT — see LICENSE.

Contact

Questions? Open an issue or ping me on LinkedIn.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published