Skip to content

Conversation

jmwample
Copy link
Contributor

@jmwample jmwample commented Oct 8, 2025

Ticket

Net-596 / NET-512 / NET-343

Description

  1. While connecting using the Quic transport, there was an issue (specifics don’t really matter, the gateway was incompatible with quic). While it was waiting in attempting to connect to a gateway I knew wouldn’t work I hit ctrl-c (or sent disconnect) to stop the connect attempt. The cancel blocked on the quic connection failng, i.e. timing out. This should definitely not block this way. The quic connection should be aborted when the connect attempt is cancelled.

  2. If an error occurs in the UdpForwarder it will stop and cancel it’s (local) CancellationToken. However, there is no mechanism at the moment by which that error is propagated back to the tunnel state to inform it that the tunnel has entered an error state and should shut down and potentially attempt to reconnect.

Checklist:

  • condition quic connection establishment on statemachine cancellation
  • tunnel subsystem exit handling and propagation -- i.e. if UDP forwarder closes early, close tunnel.

This change is Reviewable

@jmwample jmwample requested a review from pronebird October 8, 2025 21:34
Copy link
Collaborator

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pronebird reviewed 4 of 4 files at r1.
Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved (waiting on @jmwample)

@jmwample jmwample marked this pull request as ready for review October 9, 2025 20:21
Copy link
Collaborator

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pronebird reviewed 1 of 3 files at r2.
Reviewable status: 2 of 4 files reviewed, 1 unresolved discussion (waiting on @doums, @mrkiwi, @neacsu, and @rokas-ambrazevicius)


nym-vpn-core/crates/nym-vpn-lib/src/tunnel_state_machine/tunnel_monitor.rs line 891 at r2 (raw file):

            let bridge_conn = transports::BridgeConn::try_connect(
                entry_bridge_params,
                self.shutdown_token.clone(),

Prefer child token to avoid uncontroller propagation of cancellation.

Copy link
Collaborator

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 4 files reviewed, 2 unresolved discussions (waiting on @doums, @jmwample, @mrkiwi, @neacsu, and @rokas-ambrazevicius)


nym-vpn-core/crates/nym-vpn-lib/src/tunnel_state_machine/tunnel_monitor.rs line 899 at r2 (raw file):

                transports::UdpForwarder::launch(bridge_conn, None, self.shutdown_token.clone())
                    .await
                    .inspect_err(|_| self.shutdown_token.cancel())?;

Why is it needed to call cancellation, is it not enough to return an error from here?

Copy link
Collaborator

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pronebird reviewed 1 of 3 files at r2.
Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @doums, @jmwample, @mrkiwi, @neacsu, and @rokas-ambrazevicius)

Copy link
Collaborator

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pronebird reviewed 1 of 3 files at r2.
Reviewable status: all files reviewed (commit messages unreviewed), 2 unresolved discussions (waiting on @doums, @jmwample, @mrkiwi, @neacsu, and @rokas-ambrazevicius)

Copy link
Contributor Author

@jmwample jmwample left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed (commit messages unreviewed), 2 unresolved discussions (waiting on @doums, @mrkiwi, @neacsu, @pronebird, and @rokas-ambrazevicius)


nym-vpn-core/crates/nym-vpn-lib/src/tunnel_state_machine/tunnel_monitor.rs line 891 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Prefer child token to avoid uncontroller propagation of cancellation.

This is the same issue as returning an error, these steps are on the critical connection path, so if they fail then the connection needs a clean retry. Using a child token here means I have to rely on the returned error triggering the retry properly, which it is not doing.

wrt. uncontrolled propagation of cancel - that seems like an issue for calling funtions to scope their token / child tokens properly.


nym-vpn-core/crates/nym-vpn-lib/src/tunnel_state_machine/tunnel_monitor.rs line 899 at r2 (raw file):

Previously, pronebird (Andrej Mihajlov) wrote…

Why is it needed to call cancellation, is it not enough to return an error from here?

Returning an error didn't work in tests. When I just return an error then something doesn't get shutdown properly as it proceeds to retries. The first error will be reported properly and it moves to retries, it then continuously fails with the error:

2025-10-09T18:16:16.333164Z  INFO perform_initial_authentication: nym_gateway_client::client: close time.busy=11.6ms time.idle=1.32s gateway=<GATEWAY_ID> gateway_address=wss://mainnet-gateway-de.dev.nymte.ch:9001/
2025-10-09T18:16:16.333356Z ERROR nym_client_core::client::base_client: Could not authenticate and start up the gateway connection - gateway returned an error response: There is already an open connection to this client
2025-10-09T18:16:16.334095Z ERROR nym_vpn_lib::tunnel_state_machine::tunnel_monitor: Error: Tunnel monitor exited with error
Caused by: tunnel error
Caused by: registration client error
Caused by: failed to connect to mixnet
Caused by: gateway client error (<GATEWAY_ID>): gateway returned an error response: There is already an open connection to this client
Caused by: gateway returned an error response: There is already an open connection to this client
2025-10-09T18:16:16.335200Z  INFO nym_vpn_lib::tunnel_state_machine::states::connecting_state: Tunnel closed
2025-10-09T18:16:16.335295Z  INFO nym_vpn_lib::tunnel_state_machine::states::connecting_state: Reconnecting, attempt 2

Cancelling the token ensures that on heading towards retry things are stopped and then started again.

Copy link
Collaborator

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed (commit messages unreviewed), 2 unresolved discussions (waiting on @doums, @jmwample, @mrkiwi, @neacsu, and @rokas-ambrazevicius)


nym-vpn-core/crates/nym-vpn-lib/src/tunnel_state_machine/tunnel_monitor.rs line 891 at r2 (raw file):

Previously, jmwample (Jack Wampler) wrote…

This is the same issue as returning an error, these steps are on the critical connection path, so if they fail then the connection needs a clean retry. Using a child token here means I have to rely on the returned error triggering the retry properly, which it is not doing.

wrt. uncontrolled propagation of cancel - that seems like an issue for calling funtions to scope their token / child tokens properly.

Alright I think I'll add a drop guard in TunnelMonitor::run_inner() instead of doing it here.

Copy link
Collaborator

@pronebird pronebird left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: all files reviewed (commit messages unreviewed), all discussions resolved (waiting on @doums, @mrkiwi, @neacsu, and @rokas-ambrazevicius)

@pronebird pronebird merged commit 506273e into develop Oct 10, 2025
21 of 22 checks passed
@pronebird pronebird deleted the jmwample/quic-failure-handling branch October 10, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants