-
Notifications
You must be signed in to change notification settings - Fork 603
feat: add rate limiter logic to dynamo's openai api compatible http service (v1) #1949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add rate limiter logic to dynamo's openai api compatible http service (v1) #1949
Conversation
👋 Hi jorgeantonio21! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughThis update introduces a comprehensive, configurable rate limiting system to the HTTP service, based on time-weighted exponential moving averages (EMA) of time to first token (TTFT) and inter-token latency (ITL). The rate limiter is integrated into service configuration, Python bindings, metrics, and OpenAI-compatible endpoints, with extensive documentation, benchmarks, and tests for correctness, performance, and recovery. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTP_Service
participant RateLimiter
participant Engine
Client->>HTTP_Service: Send request (e.g., completions)
HTTP_Service->>RateLimiter: should_reject(model, endpoint, type)
RateLimiter-->>HTTP_Service: Allow/Reject decision + EMA metrics
alt Rejected
HTTP_Service-->>Client: HTTP 429 Rate Limit Exceeded
HTTP_Service->>Metrics: Record rejection + EMA
else Allowed
HTTP_Service->>Engine: Process request
Engine-->>HTTP_Service: Response (stream/unary)
HTTP_Service->>RateLimiter: record_ttft/itl
HTTP_Service->>Metrics: Record success + latency
HTTP_Service-->>Client: Return response
end
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🔭 Outside diff range comments (1)
lib/llm/tests/http-service.rs (1)
166-180
: Update compare_counters to include the new Rejected status.The function only checks
Success
andError
statuses but not the newly addedRejected
status. This creates a gap in test coverage for rate limiting metrics.fn compare_counters(metrics: &Metrics, model: &str, expected: &[u64; 8]) { for endpoint in &[Endpoint::Completions, Endpoint::ChatCompletions] { for request_type in &[RequestType::Unary, RequestType::Stream] { - for status in &[Status::Success, Status::Error] { + for status in &[Status::Success, Status::Error, Status::Rejected] { let index = compute_index(endpoint, request_type, status); compare_counter( metrics, model, endpoint, request_type, status, expected[index], ); } } } }Note: You'll also need to update the expected array size from 8 to 12 to accommodate the additional Rejected status combinations (2 endpoints × 2 request types × 3 statuses = 12).
🧹 Nitpick comments (2)
lib/llm/src/http/service/rate_limiter.rs (1)
103-187
: Mathematically correct and efficient EMA implementation.The recursive formula implementation is correct and provides O(1) update complexity. The decay calculation properly models system recovery during idle periods.
One minor optimization opportunity: consider caching
Instant::now()
inget_decayed_time_weighted_average
if it's called multiple times in quick succession.lib/llm/benches/rate_limiter.rs (1)
42-42
: Consider removing the throughput setting for more accurate benchmark comparison.Setting throughput to 1 element makes it harder to compare the efficiency across different sample sizes. Either remove this line to let Criterion use iteration time, or set it to reflect the sample size for better comparison.
- group.throughput(Throughput::Elements(1)); // One calculation per iteration
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (13)
components/http/src/main.rs
(4 hunks)docs/guides/rate_limiting.md
(1 hunks)lib/bindings/python/rust/http.rs
(2 hunks)lib/bindings/python/rust/lib.rs
(1 hunks)lib/bindings/python/src/dynamo/_core.pyi
(2 hunks)lib/llm/Cargo.toml
(2 hunks)lib/llm/benches/rate_limiter.rs
(1 hunks)lib/llm/src/http/service.rs
(1 hunks)lib/llm/src/http/service/metrics.rs
(11 hunks)lib/llm/src/http/service/openai.rs
(9 hunks)lib/llm/src/http/service/rate_limiter.rs
(1 hunks)lib/llm/src/http/service/service_v2.rs
(5 hunks)lib/llm/tests/http-service.rs
(3 hunks)
🧰 Additional context used
🧠 Learnings (4)
lib/llm/src/http/service.rs (1)
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:07:24.843Z
Learning: The codebase uses async-nats version 0.40, not the older nats crate. Error handling should use async_nats::error::Error variants, not nats::Error variants.
lib/llm/src/http/service/openai.rs (1)
Learnt from: kthui
PR: ai-dynamo/dynamo#1424
File: lib/runtime/src/pipeline/network/egress/push_router.rs:204-209
Timestamp: 2025-06-13T22:32:05.022Z
Learning: In async-nats, the "no responders" error is represented as async_nats::client::RequestErrorKind::NoResponders, not async_nats::Error::NoResponders. Use err.downcast_ref::<async_nats::client::RequestError>() and then check request_err.kind() against RequestErrorKind::NoResponders.
lib/llm/tests/http-service.rs (1)
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1919
File: lib/runtime/src/engine.rs:168-168
Timestamp: 2025-07-14T21:25:56.898Z
Learning: The AsyncEngineContextProvider trait in lib/runtime/src/engine.rs was intentionally changed from `Send + Sync + Debug` to `Send + Debug` because the Sync bound was overly constraining. The trait should only require Send + Debug as designed.
lib/llm/src/http/service/rate_limiter.rs (2)
Learnt from: jthomson04
PR: ai-dynamo/dynamo#1429
File: lib/runtime/src/utils/leader_worker_barrier.rs:69-72
Timestamp: 2025-06-08T03:12:03.985Z
Learning: In the leader-worker barrier implementation in lib/runtime/src/utils/leader_worker_barrier.rs, the `wait_for_key_count` function correctly uses exact equality (`==`) instead of greater-than-or-equal (`>=`) because worker IDs must be unique (enforced by etcd create-only operations), ensuring exactly the expected number of workers can register.
Learnt from: PeaBrane
PR: ai-dynamo/dynamo#1285
File: lib/llm/src/kv_router/scoring.rs:58-63
Timestamp: 2025-05-30T06:38:09.630Z
Learning: In lib/llm/src/kv_router/scoring.rs, the user prefers to keep the panic behavior when calculating load_avg and variance with empty endpoints rather than adding guards for division by zero. They want the code to fail fast on this error condition.
🧬 Code Graph Analysis (1)
lib/llm/src/http/service.rs (1)
lib/llm/src/http/service/service_v2.rs (1)
rate_limiter
(48-50)
🪛 LanguageTool
docs/guides/rate_limiting.md
[uncategorized] ~1-~1: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: # Rate Limiting Guide ## Overview The Dynamo LLM serv...
(EN_COMPOUND_ADJECTIVE_INTERNAL)
🔇 Additional comments (35)
lib/llm/Cargo.toml (2)
39-42
: LGTM! Benchmark configuration follows standard patterns.The benchmark section is correctly configured with
harness = false
for custom benchmark harness, which is appropriate for performance testing of the rate limiter implementation.
116-116
: Appropriate dependency addition for concurrent data structures.The dashmap 6.1.0 dependency is well-suited for the rate limiter implementation, providing concurrent hash maps that are essential for thread-safe rate limiting across multiple requests.
lib/llm/src/http/service.rs (1)
26-26
: LGTM! Module declaration follows existing patterns.The public module declaration is correctly placed and follows the same pattern as other modules in the service. This appropriately exposes the rate limiter functionality to the rest of the codebase.
lib/bindings/python/rust/lib.rs (1)
101-101
: LGTM! Correct Python class registration.The RateLimiterConfig class registration follows the established pattern for exposing Rust structs to Python. The placement within the class registration section is appropriate.
lib/bindings/python/src/dynamo/_core.pyi (2)
810-814
: LGTM! Updated constructor signature matches Rust implementation.The HttpService constructor is correctly updated to accept optional parameters for port and rate_limiter_config, maintaining backward compatibility while enabling the new rate limiting functionality.
823-834
: LGTM! Well-designed RateLimiterConfig class.The RateLimiterConfig class provides appropriate parameters for configuring rate limiting:
- TTFT and ITL thresholds in seconds (float precision)
- Time constant for EMA calculation
- Optional per-model limits flag with sensible default
The parameter types and naming conventions are consistent with the implementation.
components/http/src/main.rs (4)
7-7
: LGTM! Appropriate import addition.The RateLimiterConfig import is correctly added to support the new rate limiting functionality.
33-67
: Well-designed CLI arguments for rate limiting configuration.The CLI arguments are comprehensive and well-documented:
- Clear help text for each parameter
- Sensible default values (1000ms TTFT, 30ms ITL, 15s time constant)
- Appropriate data types (f64 for thresholds, bool for flags)
- Consistent naming conventions
The default values appear reasonable for typical LLM serving scenarios.
81-95
: Correct conditional rate limiter integration.The conditional logic properly:
- Converts milliseconds to seconds for the configuration
- Uses the builder pattern appropriately
- Only applies rate limiting when enabled
The unit conversion from milliseconds to seconds is handled correctly.
123-139
: Appropriate validation for rate limiting parameters.The validation function correctly checks that all rate limiting parameters are positive values, which is essential for proper rate limiter operation. The error messages are clear and descriptive.
lib/bindings/python/rust/http.rs (2)
40-49
: LGTM! Clean integration of rate limiter configuration.The optional
RateLimiterConfig
parameter maintains backward compatibility while enabling rate limiting when needed. The builder pattern is properly utilized.
196-221
: Well-structured Python bindings for rate limiter configuration.The
RateLimiterConfig
class properly wraps the Rust implementation with appropriate error handling and parameter validation.docs/guides/rate_limiting.md (1)
1-165
: Excellent documentation coverage for the rate limiting feature.The guide provides clear explanations of the time-weighted EMA algorithm, comprehensive configuration examples, monitoring guidance, and practical tuning recommendations. The mathematical formulas are correctly presented.
lib/llm/src/http/service/openai.rs (3)
101-110
: Correct implementation of rate limit error response.The method properly returns HTTP 429 (Too Many Requests) status code, which is the standard response for rate limiting.
197-199
: Consistent integration of rate limiter with response collectors.All response collectors are properly updated to accept the rate limiter, enabling metric collection across endpoints.
Also applies to: 385-387, 554-556
598-634
: Well-structured rate limiting enforcement logic.The function properly:
- Checks if rate limiting is enabled before processing
- Records both TTFT and ITL metrics for monitoring
- Increments rejection counters for observability
- Provides clear error messages including the model name
lib/llm/src/http/service/service_v2.rs (3)
23-23
: Clean integration of rate limiter into service state.The rate limiter is properly integrated following the existing patterns for shared state management with Arc, and the accessor methods are consistent with the codebase conventions.
Also applies to: 27-32, 48-54
99-100
: Proper builder pattern extension for rate limiter configuration.The optional configuration field and builder method follow the established patterns and maintain the fluent interface.
Also applies to: 151-155
161-162
: Correct instantiation and integration of rate limiter.The rate limiter is properly instantiated with optional configuration and wrapped in Arc for thread-safe sharing.
lib/llm/src/http/service/rate_limiter.rs (4)
1-44
: Excellent module documentation with clear mathematical explanations.The documentation provides a thorough understanding of the time-weighted EMA algorithm and the design philosophy prioritizing "good-put" over raw throughput.
46-101
: Well-designed configuration with proper validation.The
RateLimiterConfig
properly validates inputs to ensure positive thresholds and a minimum time constant. The default values (1s TTFT, 100ms ITL, 30s time constant) are reasonable starting points.
249-401
: Robust rate limiter implementation with good observability.The implementation properly handles:
- Thread-safe concurrent access with DashMap
- Per-model vs global limiting based on configuration
- Informative logging at 90% threshold for early warning
- Safe unwrap usage in
get_metrics
due to prior existence check
409-1171
: Exceptional test coverage ensuring correctness and robustness.The test suite comprehensively covers:
- Mathematical correctness with quantitative verification
- Thread safety under concurrent access
- Numerical stability with extreme values and long time series
- Edge cases including single samples and rapid updates
- Per-model vs global limiting behavior
The tests provide strong confidence in the implementation's correctness.
lib/llm/benches/rate_limiter.rs (9)
1-342
: Excellent comprehensive benchmark suite!The benchmarks provide thorough coverage of the rate limiter functionality, including:
- Performance characteristics under different configurations
- Concurrent access patterns with varying thread counts
- Memory allocation and bounded growth behavior
- Edge cases with extreme values
This will be valuable for performance regression testing and optimization.
1-12
: Well-structured benchmark setup with comprehensive test parameters.The imports are appropriate and the constants provide good coverage across different scales and scenarios for thorough performance testing.
15-35
: Solid benchmark implementation for value recording performance.The benchmark correctly measures sequential recording performance with proper throughput measurement and clean state initialization.
68-93
: Well-designed benchmark for time constant impact analysis.The benchmark effectively measures how different time constants affect performance with a clean and systematic approach.
96-136
: Comprehensive rate limiter decision benchmarking.The benchmark effectively measures all key rate limiter operations with realistic pre-populated data and proper performance isolation.
139-181
: Excellent concurrent access benchmark with proper thread safety.The benchmark effectively tests scalability across different thread counts with appropriate Arc usage and comprehensive throughput measurement.
184-226
: Effective memory pattern benchmarking for performance under memory stress.The benchmark appropriately tests performance characteristics under memory-intensive scenarios and per-model isolation patterns.
284-327
: Comprehensive configuration comparison benchmark.The benchmark effectively compares different rate limiter configurations with realistic usage patterns, providing valuable performance insights across different tuning profiles.
329-341
: Proper Criterion framework setup.The benchmark group and main function are correctly configured with all benchmark functions included.
lib/llm/tests/http-service.rs (2)
476-561
: Well-designed mock engines for testing rate limiting scenarios.The
SlowTTFTEngine
andSlowITLEngine
implementations effectively simulate the specific performance characteristics needed to trigger rate limiting. The controlled delays allow for deterministic testing of the rate limiter behavior.
562-1095
: Comprehensive test coverage for rate limiting functionality.The test suite thoroughly covers:
- Rate limiting trigger conditions with configurable thresholds
- HTTP integration with 429 status codes
- Per-model vs global rate limiting behavior
- Recovery dynamics with EMA decay
Excellent use of async testing patterns and metrics verification.
lib/llm/src/http/service/metrics.rs (1)
13-14
: Excellent integration of rate limiting metrics!The implementation properly extends the existing metrics infrastructure with:
- Clear separation between regular and rate limit specific metrics
- Consistent naming conventions following the established pattern
- Proper documentation updates
- Clean integration with ResponseMetricCollector to record both standard and rate limiting metrics
The metrics will provide valuable observability into the rate limiter's behavior.
Also applies to: 23-25, 33-33, 41-42, 83-92, 97-98, 103-104, 131-148, 224-260, 264-328, 365-374, 404-424, 515-515, 521-524, 555-555, 576-576
// Rate limit check | ||
// TODO: handle streaming, currently just unary | ||
should_reject_request( | ||
&state, | ||
&request.inner.model, | ||
&Endpoint::Responses, | ||
&RequestType::Unary, | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address the TODO for streaming support in responses endpoint.
The rate limit check currently hardcodes RequestType::Unary
, which means streaming responses won't be properly categorized in metrics. This should be fixed to properly detect streaming mode.
Would you like me to help implement proper streaming detection for the responses endpoint?
🤖 Prompt for AI Agents
In lib/llm/src/http/service/openai.rs around lines 487 to 494, the rate limit
check uses a hardcoded RequestType::Unary, which does not correctly handle
streaming responses. Modify the code to detect if the request is for streaming
or unary dynamically, and pass the appropriate RequestType variant to
should_reject_request. This will ensure streaming responses are properly
categorized in metrics.
…r flagging rate limit settings, integrate rate limiter with dynamo launch logic
FYI, details on the cargo check failure:
I ran into similar issue in the past which I fixed by running |
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This PR has been closed due to inactivity. If you believe this PR is still relevant, please feel free to reopen it with additional context or information. |
Overview:
This PR implements adaptive rate limiting for the HTTP service to prevent system overload during periods of degraded model performance. The rate limiter monitors Time-to-First-Token (TTFT) and Inter-Token-Latency (ITL) metrics using exponential moving averages and automatically rejects new requests when performance thresholds are exceeded.
Details:
Core Rate Limiting Logic:
Integration Points:
/v1/chat/completions
,/v1/completions
) with early request rejection (429 status)Why This Approach is Robust:
Key Files Added/Modified:
lib/llm/src/http/service/rate_limiter.rs
- Core rate limiting logic with TimeWeightedAverageTrackerlib/llm/src/http/service/service_v2.rs
- HTTP service integration and configurationlib/llm/src/http/service/openai.rs
- Request rejection logic in endpoint handlerslib/llm/src/http/service/metrics.rs
- Prometheus metrics for rate limitingcomponents/http/src/main.rs
- CLI argument supportlib/bindings/python/
- Python bindings for rate limiter configurationWhere should the reviewer start?
lib/llm/src/http/service/rate_limiter.rs
- Review the TimeWeightedAverageTracker EMA implementation and rate limiting decision logiclib/llm/src/http/service/openai.rs
- Check theshould_reject_request()
integration in endpoint handlerslib/llm/src/http/service/service_v2.rs
- Verify rate limiter configuration and state managementlib/llm/tests/http-service.rs
- Review integration tests covering rate limiting scenarios and recoveryRelated Issues:
Summary by CodeRabbit
New Features
Documentation
Tests
Bug Fixes