Skip to content

Conversation

andrewjstone
Copy link
Contributor

@andrewjstone andrewjstone commented Sep 4, 2025

Faults has become a layer of indirection for reaching crashed_nodes. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic on_connect calls..

@andrewjstone andrewjstone force-pushed the tq-test-utils-remove-faults branch from b1c62ab to bd56c5f Compare September 4, 2025 21:15
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asymmetric
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node
will reconnect at some point. And reconection can trigger the sending
of messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections
via a new action we can add similar logic and remove the automatic
`on_connect` calls..
@andrewjstone andrewjstone force-pushed the tq-test-utils-remove-faults branch from bd56c5f to d70dec4 Compare September 5, 2025 19:43
Not sure why clippy didn't catch this one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant