Skip to content

Conversation

@MadLittleMods
Copy link
Contributor

@MadLittleMods MadLittleMods commented Nov 26, 2025

Be able to shutdown homeserver that failed to start

For example, a homeserver can fail to start if the port is already in use or the port number is invalid (not 0-65535)

Fix #19189

Follow-up to #18828

Pull Request Checklist

  • Pull request is based on the develop branch
  • Pull request includes a changelog file. The entry should:
    • Be a short description of your change which makes sense to users. "Fixed a bug that prevented receiving messages from other servers." instead of "Moved X method from EventStore to EventWorkerStore.".
    • Use markdown where necessary, mostly for code blocks.
    • End with either a period (.) or an exclamation mark (!).
    • Start with a capital letter.
    • Feel free to credit yourself, by adding a sentence "Contributed by @github_username." or "Contributed by [Your Name]." to the end of the entry.
  • Code style is correct (run the linters)

This actually turns out to work without failing because
the `MemoryReactor.listenTCP(...)` doesn't actually bind
to the port (just fakes it).
@MadLittleMods MadLittleMods added the A-Shutdown Cleanly shutting down the Synapse homeserver label Nov 26, 2025
We can't bind real ports in the Twisted tests since it's using
`MemoryReactor` and just fakes binding to ports. Doesn't actually
fail when we pass an invalid port, etc.
site,
context_factory,
reactor=reactor,
try:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best to review this part with the "Hide whitespace" option when viewing the diff

Comment on lines +512 to +523
elif isinstance(listener_config, UnixListenerConfig):
ports = listen_unix(
listener_config.path, listener_config.mode, site, reactor=reactor
)
# getHost() returns a UNIXAddress which contains an instance variable of 'name'
# encoded as a byte string. Decode as utf-8 so pretty.
logger.info(
"Synapse now listening on TCP port %d (TLS)", listener_config.port
"Synapse now listening on Unix Socket at: %s",
ports[0].getHost().name.decode("utf-8"),
)
else:
ports = listen_tcp(
listener_config.bind_addresses,
listener_config.port,
site,
reactor=reactor,
)
logger.info("Synapse now listening on TCP port %d", listener_config.port)

else:
ports = listen_unix(
listener_config.path, listener_config.mode, site, reactor=reactor
)
# getHost() returns a UNIXAddress which contains an instance variable of 'name'
# encoded as a byte string. Decode as utf-8 so pretty.
logger.info(
"Synapse now listening on Unix Socket at: %s",
ports[0].getHost().name.decode("utf-8"),
)
assert_never(listener_config)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superfluous but I updated to use a more robust matching pattern since I used it in ListenerException above

Comment on lines +818 to +823
# Replace the resource tree with an empty resource to break circular references
# to the resource tree which holds a bunch of homeserver references. This is
# important if we try to call `hs.shutdown()` after `start` fails. For some
# reason, this doesn't seem to be necessary in the normal case where `start`
# succeeds and we call `hs.shutdown()` later.
self.resource = Resource()
Copy link
Contributor Author

@MadLittleMods MadLittleMods Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent too long trying to figure out why this works exactly. Specifically, why the normal case works without this change but the error case requires it.

The internal references to hs should be the same between the normal and error cases. And we have even less external references to SynapseSite since it just gets orphaned in this function context in the error case and drops away as soon as we handle the traceback/exception in the layers above.

In the normal case, Port holds a reference to the site and then when we call Port.stopListening(), it calls site.stopFactory() down the line just like we're doing in the error case now. But in the error case, it doesn't work without this additional change to sever the circular references.

I've looked through the Twisted internals to try to spot the difference but was unsuccessful. Also tried throwing an LLM at the problem but they were also unable to spot anything.

In any case, clearing circular references in these kinds of callbacks are pretty normal. For example, it's even called out in the HTTPChannel.connectionLost docstring. twisted.web.server.Site.stopFactory describes it as:

This can be overridden to perform 'shutdown' tasks such as disconnecting database connections, closing files, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to have this + the comment saying we don't understand why it's necessary only sometimes.

Replacing your inners with empty values on shutdown/destruction is totally fine as a practice to me.

Comment on lines +524 to +534
except Exception as exc:
# The Twisted interface says that "Users should not call this function
# themselves!" but this appears to be the correct/only way handle proper cleanup
# of the site when things go wrong. In the normal case, a `Port` is created
# which we can call `Port.stopListening()` on to do the same thing (but no
# `Port` is created when an error occurs).
#
# We use `site.stopFactory()` instead of `site.doStop()` as the latter assumes
# that `site.doStart()` was called (which won't be the case if an error occurs).
site.stopFactory()
raise ListenerException(listener_config) from exc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding a test for this case but it's not effective as we're using the MemoryReactor in tests which doesn't actually try to bind any ports (just fakes it).

We do have a Complement test in the Synapse Pro for small hosts project (see element-hq/synapse-small-hosts -> complement/tests/multi_synapse/provision_test.go#L278-L295 which we will uncomment when this PR merges). I have tested that this PR does fix the problem (see https://github.com/element-hq/synapse-small-hosts/pull/326 for where I was playing around with this).

@MadLittleMods MadLittleMods marked this pull request as ready for review November 26, 2025 22:50
@MadLittleMods MadLittleMods requested a review from a team as a code owner November 26, 2025 22:50
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always a minor shame to not have clearer answers, but can't blame you and fine with this

Comment on lines +818 to +823
# Replace the resource tree with an empty resource to break circular references
# to the resource tree which holds a bunch of homeserver references. This is
# important if we try to call `hs.shutdown()` after `start` fails. For some
# reason, this doesn't seem to be necessary in the normal case where `start`
# succeeds and we call `hs.shutdown()` later.
self.resource = Resource()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to have this + the comment saying we don't understand why it's necessary only sometimes.

Replacing your inners with empty values on shutdown/destruction is totally fine as a practice to me.

Comment on lines +525 to +526
# The Twisted interface says that "Users should not call this function
# themselves!" but this appears to be the correct/only way handle proper cleanup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh, well, I guess it is what it is.
If this is only during error handling code I'm OK with this, otherwise I find the idea of doing something we're told not to do a bit unappealing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Shutdown Cleanly shutting down the Synapse homeserver

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants