-
Notifications
You must be signed in to change notification settings - Fork 55
TQ: Add node crash/restart actions to cluster test #8993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
27 tasks
andrewjstone
added a commit
that referenced
this pull request
Sep 4, 2025
Faults has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asynchronous connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
e0e82be
to
847f524
Compare
andrewjstone
added a commit
that referenced
this pull request
Sep 4, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asynchronous connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
andrewjstone
added a commit
that referenced
this pull request
Sep 4, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asynchronous connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
andrewjstone
added a commit
that referenced
this pull request
Sep 4, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
There's a bug here. I forgot to actually remove the restarted node from the set of crashed nodes. This causes failing tests which I'm digging into. |
Fixed in fe1fe40 |
andrewjstone
added a commit
that referenced
this pull request
Sep 5, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered. We clear all in memory state on node restart, while maintaining persistent state.
fe1fe40
to
c65af26
Compare
andrewjstone
added a commit
that referenced
this pull request
Sep 17, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
andrewjstone
added a commit
that referenced
this pull request
Sep 17, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
charliepark
pushed a commit
that referenced
this pull request
Sep 19, 2025
We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered. We clear all in memory state on node restart, while maintaining persistent state. This builds on #8984
charliepark
pushed a commit
that referenced
this pull request
Sep 19, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered.
We clear all in memory state on node restart, while maintaining persistent state.
This builds on #8984