Skip to content

Conversation

andrewjstone
Copy link
Contributor

We ensure that messages don't get sent to crashed nodes and API calls on the crashed nodes are not triggered.

We clear all in memory state on node restart, while maintaining persistent state.

This builds on #8984

andrewjstone added a commit that referenced this pull request Sep 4, 2025
Faults has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asynchronous
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node
will reconnect at some point. And reconection can trigger the sending
of messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections
via a new action we can add similar logic and remove the automatic
`on_connect` calls..
@andrewjstone andrewjstone force-pushed the tq-cluster-test-crash-restart branch 2 times, most recently from e0e82be to 847f524 Compare September 4, 2025 21:00
andrewjstone added a commit that referenced this pull request Sep 4, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asynchronous
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node
will reconnect at some point. And reconection can trigger the sending
of messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections
via a new action we can add similar logic and remove the automatic
`on_connect` calls..
andrewjstone added a commit that referenced this pull request Sep 4, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asynchronous
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node
will reconnect at some point. And reconection can trigger the sending
of messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections
via a new action we can add similar logic and remove the automatic
`on_connect` calls..
andrewjstone added a commit that referenced this pull request Sep 4, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asymmetric
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node
will reconnect at some point. And reconection can trigger the sending
of messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections
via a new action we can add similar logic and remove the automatic
`on_connect` calls..
@andrewjstone
Copy link
Contributor Author

There's a bug here. I forgot to actually remove the restarted node from the set of crashed nodes. This causes failing tests which I'm digging into.

@andrewjstone
Copy link
Contributor Author

Fixed in fe1fe40

andrewjstone added a commit that referenced this pull request Sep 5, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asymmetric
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node
will reconnect at some point. And reconection can trigger the sending
of messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections
via a new action we can add similar logic and remove the automatic
`on_connect` calls..
Base automatically changed from tq-load-rack-secret-2 to main September 17, 2025 18:49
We ensure that messages don't get sent to crashed nodes and API
calls on the crashed nodes are not triggered.

We clear all in memory state on node restart, while maintaining
persistent state.
@andrewjstone andrewjstone force-pushed the tq-cluster-test-crash-restart branch from fe1fe40 to c65af26 Compare September 17, 2025 18:53
@andrewjstone andrewjstone enabled auto-merge (squash) September 17, 2025 19:07
@andrewjstone andrewjstone merged commit e612e09 into main Sep 17, 2025
16 checks passed
@andrewjstone andrewjstone deleted the tq-cluster-test-crash-restart branch September 17, 2025 20:43
andrewjstone added a commit that referenced this pull request Sep 17, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asymmetric
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node
will reconnect at some point. And reconection can trigger the sending
of messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections
via a new action we can add similar logic and remove the automatic
`on_connect` calls..
andrewjstone added a commit that referenced this pull request Sep 17, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asymmetric
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node will
reconnect at some point. And reconection can trigger the sending of
messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections via
a new action we can add similar logic and remove the automatic
`on_connect` calls..
charliepark pushed a commit that referenced this pull request Sep 19, 2025
We ensure that messages don't get sent to crashed nodes and API calls on
the crashed nodes are not triggered.

We clear all in memory state on node restart, while maintaining
persistent state.

This builds on #8984
charliepark pushed a commit that referenced this pull request Sep 19, 2025
`Faults` has become a layer of indirection for reaching `crashed_nodes`.
Early on when writing this test I figured that we'd have separate
actions for connecting and disconnecting nodes in addition to crashing
and restarting them. While I didn't open the possibility to asymmetric
connectivity (hard to do realistically with TLS!), I made it so that we
could track connectivity between alive nodes.

With further reflection this seems unnecessary. As of #8993, we crash
and restart nodes. We anticipate on restart that every alive node will
reconnect at some point. And reconection can trigger the sending of
messages destined for a crashed node. This is how retries are
implemented in this connection oriented protocol. So the only real thing
we are trying to ensure is that those retried messages get interleaved
upon connection and don't always end up delivered in the same order at
the destination node. This is accomplished by randomising the connection
order. If we decide later on that we want to interleave connections via
a new action we can add similar logic and remove the automatic
`on_connect` calls..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant