-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[CHORE] use send_timeout in component channels #5405
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
Introduce send_timeout for Robust Inter-Component Messaging This PR revamps the messaging infrastructure across component channels by transitioning from an immediate-failure, non-blocking message send (try_send) to a blocking send with a configurable timeout (send_timeout). The change is motivated by observed failures (see Incident #20) where components, particularly the metering event producer, could not reliably enqueue work due to instantaneous capacity checks. Now, message senders will block for a specified timeout, after which a clear error is returned if enqueueing still fails. Configurations are fully backward-compatible with new parameters added where necessary. Key Changes• Component channels now use Affected Areas• System messaging primitives ( This summary was automatically generated by @propel-code-bot |
…per/send-timeout
@@ -42,6 +42,7 @@ pub enum ComponentRuntime { | |||
pub trait Component: Send + Sized + Debug + 'static { | |||
fn get_name() -> &'static str; | |||
fn queue_size(&self) -> usize; | |||
fn send_timeout(&self) -> Duration; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not have a default implementation here so as to avoid implementing for every concrete type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was mimicking the queue size - imo we shouldn't set a blanket send_timeout that gets applied to all Components by default. instead we should force implementers to define their own send_timeout based on their use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like it is easy to shoot yourself in the foot with a wrong value. I'd advocate for a good default value in the base like < 5ms and only override it in places where we need a custom value
@@ -159,11 +160,18 @@ impl ConsumableJoinHandle { | |||
#[derive(Debug)] | |||
pub(crate) struct ComponentSender<C: Component> { | |||
sender: tokio::sync::mpsc::Sender<WrappedMessage<C>>, | |||
send_timeout: Duration, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need the param? If C is already a component that has a send_timeout() then why not directly use that here something like C::send_timeout()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because ComponentSender's wrap_and_send is generic over a C, it doesn't take a C reference, and send_timeout should take a C reference so that implementers can store their send_timeout in config like the queue size
@@ -62,6 +62,10 @@ pub trait Orchestrator: Debug + Send + Sized + 'static { | |||
1000 | |||
} | |||
|
|||
fn send_timeout(&self) -> Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed? Is this used anywhere?
@@ -443,6 +452,10 @@ mod tests { | |||
1000 | |||
} | |||
|
|||
fn send_timeout(&self) -> Duration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't these be < 5ms?
Description of changes
To address Incident #20, we are adding a blocking send with a timeout to the inter-component messaging system. This will (hopefully) allow the metering event producer to enqueue work for the consumer more reliably, instead of failing on a single check of the queue's fullness as the current
try_send
does.Test plan
pytest
for python,yarn test
for js,cargo test
for rustMigration plan
Config is all backward-compatible. Necessary updates have been made where appropriate.
Observability plan
Monitor error frequency after hotfix.
Documentation Changes
N/A