TQ: Add node crash/restart actions to cluster test #8993

andrewjstone · 2025-09-04T19:20:57Z

We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered.

We clear all in memory state on node restart, while maintaining persistent state.

This builds on #8984

Faults has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asynchronous connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..

`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asynchronous connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..

`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..

andrewjstone · 2025-09-04T21:38:03Z

There's a bug here. I forgot to actually remove the restarted node from the set of crashed nodes. This causes failing tests which I'm digging into.

andrewjstone · 2025-09-05T19:32:56Z

Fixed in fe1fe40

`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..

We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered. We clear all in memory state on node restart, while maintaining persistent state.

`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..

We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered. We clear all in memory state on node restart, while maintaining persistent state. This builds on #8984

`Faults` has become a layer of indirection for reaching `crashed_nodes`. Early on when writing this test I figured that we'd have separate actions for connecting and disconnecting nodes in addition to crashing and restarting them. While I didn't open the possibility to asymmetric connectivity (hard to do realistically with TLS!), I made it so that we could track connectivity between alive nodes. With further reflection this seems unnecessary. As of #8993, we crash and restart nodes. We anticipate on restart that every alive node will reconnect at some point. And reconection can trigger the sending of messages destined for a crashed node. This is how retries are implemented in this connection oriented protocol. So the only real thing we are trying to ensure is that those retried messages get interleaved upon connection and don't always end up delivered in the same order at the destination node. This is accomplished by randomising the connection order. If we decide later on that we want to interleave connections via a new action we can add similar logic and remove the automatic `on_connect` calls..

andrewjstone requested review from sunshowers and plotnick September 4, 2025 19:20

andrewjstone mentioned this pull request Sep 4, 2025

Trust Quorum Protocol Implementation Tracking #8965

Open

27 tasks

andrewjstone force-pushed the tq-cluster-test-crash-restart branch 2 times, most recently from e0e82be to 847f524 Compare September 4, 2025 21:00

andrewjstone mentioned this pull request Sep 4, 2025

TQ: Remove Faults from test-utils TqState #8995

Merged

Base automatically changed from tq-load-rack-secret-2 to main September 17, 2025 18:49

andrewjstone added 2 commits September 17, 2025 18:51

TQ: Add node crash/restart actions to cluster test

a96194e

We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered. We clear all in memory state on node restart, while maintaining persistent state.

Make all aborts explicit events

c65af26

andrewjstone force-pushed the tq-cluster-test-crash-restart branch from fe1fe40 to c65af26 Compare September 17, 2025 18:53

andrewjstone enabled auto-merge (squash) September 17, 2025 19:07

andrewjstone merged commit e612e09 into main Sep 17, 2025
16 checks passed

andrewjstone deleted the tq-cluster-test-crash-restart branch September 17, 2025 20:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TQ: Add node crash/restart actions to cluster test #8993

TQ: Add node crash/restart actions to cluster test #8993

Uh oh!

andrewjstone commented Sep 4, 2025

Uh oh!

andrewjstone commented Sep 4, 2025

Uh oh!

andrewjstone commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

TQ: Add node crash/restart actions to cluster test #8993

TQ: Add node crash/restart actions to cluster test #8993

Uh oh!

Conversation

andrewjstone commented Sep 4, 2025

Uh oh!

andrewjstone commented Sep 4, 2025

Uh oh!

andrewjstone commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!