Skip to content

Conversation

@erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Nov 20, 2025

Fixes #19175

This PR moves tracking of what lazy loaded membership we've sent to each room out of the required state table. This avoids that table from continuously growing, which massively helps performance as we pull out all matching rows for the connection when we receive a request.

The new table is only read when we have data in a room to send, so we end up reading a lot fewer rows from the DB. Though we now read from that table for every room we have events to return in, rather than once at the start of the request.

For an explanation of how the new table works, see the comment on the table schema.

The table is designed so that we can later prune old entries if we wish, but that is not implemented in this PR.

Reviewable commit-by-commit.

We then filter them out before sending to the client, but it is
unnecessary to do so and interferes with later changes.
This is so that clients know if they can use a cached `/members`
response or not.
@erikjohnston erikjohnston force-pushed the erikj/sss_better_membership_storage2 branch from f67e114 to 0d6ccbe Compare November 20, 2025 13:43
This ensures that the set of required state doesn't keep growing as we
add and remove member state. We then only load them from the DB when
needed, rather than all state for all rooms when we get a request.
It was thinking the table name was `IN`, as it matched
`connection_positi(on IS) NULL`.
@erikjohnston erikjohnston force-pushed the erikj/sss_better_membership_storage2 branch from 0d6ccbe to 4984858 Compare November 20, 2025 13:52
@erikjohnston erikjohnston marked this pull request as ready for review November 20, 2025 15:52
@erikjohnston erikjohnston requested a review from a team as a code owner November 20, 2025 15:52
Copy link
Contributor

@MadLittleMods MadLittleMods left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully onboarded onto the concept and details to be confident in the approach.

SCHEMA_FILE_REGEX = re.compile(r"^synapse/storage/schema/(.*)/delta/(.*)/(.*)$")
INDEX_CREATION_REGEX = re.compile(
r"CREATE .*INDEX .*ON ([a-z_0-9]+)", flags=re.IGNORECASE
r"CREATE .*INDEX .*ON ([a-z_0-9]+)\s+\(", flags=re.IGNORECASE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What edge case does this change solve? It doesn't seem relevant to the new indexes being added.

It looks like this just matches some extra faff at the end for some reason?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message briefly mentions it, but the regex was matching against connection_position IS NULL and thinking the table name was IN (as position ends in "on") due to the greedy nature of wildcards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseded by #19224

Comment on lines +1539 to +1548
Attributes:
required_state_map_change: The updated required state map to store in
the room config, or None if there is no change.
added_state_filter: The state filter to use to fetch any additional
current state that needs to be returned to the client.
lazy_members_previously_returned: The set of user IDs we should add to
the lazy members cache that we had previously returned.
lazy_members_invalidated: The set of user IDs whose membership has
changed but we didn't send down, so we need to invalidate them from
the cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is the standard way for the docstring but some of these attributes are a bit tricky and I'd rather see the docstring when I hover the properties.

Potential to convert to the """ variant below the property itself

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's really annoying that VSCode doesn't pick these up, it does for functions.

I'm not sure about moving style, especially since its a bit confusing that the docstring for attributes go after the attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does for functions because the spec says it should do that.

For attributes, it feels like unfortunate Python decisions but we get this from https://peps.python.org/pep-0257/ (2001)

String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called “attribute docstrings”.

And expanded upon in https://peps.python.org/pep-0258/#attribute-docstrings (2001)

A string literal immediately following an assignment statement is interpreted by the docstring extraction machinery as the docstring of the target of the assignment statement, under the following conditions:

Comment on lines 1754 to 1761
# Limit the number of state_keys we should remember sending down the connection
# for each (room_id, user_id). We don't want to store and pull out too much data
# in the database. This is a happy-medium between remembering nothing and
# everything. We can avoid sending redundant state down the connection most of
# the time given that most rooms don't have 100 members anyway and it takes a
# while to cycle through 100 members.
#
# Only remember up to (MAX_NUMBER_PREVIOUS_STATE_KEYS_TO_REMEMBER)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this whole sliding_sync_connection_lazy_members refactor really help things?

We were already supposedly limiting the amount of data that we remembered so there was a cap.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few intertwined issues:

  1. The code assumes that the total required_state is small, as it reads everything on receiving a request. This makes it much easier to check for changes and reduces DB roundtrips
  2. To do the above we do deduping of required_state between rooms, which requires reading everything when persisting changes to required_state.
  3. Lazy loaded members stop the deduping from happening, and so the total required_state is much larger than expected.

By storing the lazy loaded members separately and only fetching them when we have data to send fixes the above issues. Alternatively, we could have tried to refactor things to avoid pulling out all the required_state on startup, but given the need to check for changes on required_state plus the fact that lazy loading is a special case anyway, I think this approach of storing the lazy members in a separate table is easier and clearer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to sync with you on this to understand properly.

I've added a meeting to the calendar on Monday (feel free to change).

Comment on lines 1089 to 1101
else:
# For non-limited timelines we always return all
# membership changes. This is so that clients
# who have fetched the full membership list
# already can continue to maintain it for
# non-limited syncs.
#
# This assumes that for non-limited syncs there
# won't be many membership changes that wouldn't
# have been included already (this can only
# happen if membership state was rolled back due
# to state resolution anyway).
required_state_types.append((EventTypes.Member, None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a bigger behavioral change.

I think this fixes #18782 🤔 - If so, we should add a test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, did mean to factor that out but it sneaked in as it needs to be accounted for in the lazy loading stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added with test_lazy_load_state_reset

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this only fixes it for non-limited syncs. I think we should also return state reset membership in limited timeline scenarios as well.

We should at-least leave a FIXME with a link to the issue in the if-block above.

@erikjohnston erikjohnston force-pushed the erikj/sss_better_membership_storage2 branch from fe94608 to ec45e00 Compare November 25, 2025 11:12
@erikjohnston erikjohnston force-pushed the erikj/sss_better_membership_storage2 branch from 6c2cf0d to 2e844aa Compare November 25, 2025 14:28
SCHEMA_FILE_REGEX = re.compile(r"^synapse/storage/schema/(.*)/delta/(.*)/(.*)$")
INDEX_CREATION_REGEX = re.compile(
r"CREATE .*INDEX .*ON ([a-z_0-9]+)", flags=re.IGNORECASE
r"CREATE .*INDEX .*ON ([a-z_0-9]+)\s+\(", flags=re.IGNORECASE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superseded by #19224


# How often to update the last seen timestamp for lazy members. We don't want to
# update it too often as that causes DB writes.
LAZY_MEMBERS_UPDATE_INTERVAL_MS = ONE_HOUR_SECONDS * MILLISECONDS_PER_SECOND
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs updating now that #19223 was merged

Comment on lines +1539 to +1548
Attributes:
required_state_map_change: The updated required state map to store in
the room config, or None if there is no change.
added_state_filter: The state filter to use to fetch any additional
current state that needs to be returned to the client.
lazy_members_previously_returned: The set of user IDs we should add to
the lazy members cache that we had previously returned.
lazy_members_invalidated: The set of user IDs whose membership has
changed but we didn't send down, so we need to invalidate them from
the cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does for functions because the spec says it should do that.

For attributes, it feels like unfortunate Python decisions but we get this from https://peps.python.org/pep-0257/ (2001)

String literals occurring immediately after a simple assignment at the top level of a module, class, or __init__ method are called “attribute docstrings”.

And expanded upon in https://peps.python.org/pep-0258/#attribute-docstrings (2001)

A string literal immediately following an assignment statement is interpreted by the docstring extraction machinery as the docstring of the target of the assignment statement, under the following conditions:

When persisting this *only* contains updates to the state.
The `room_lazy_membership` field is only used when persisting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `room_lazy_membership` field is only used when persisting.
The `room_lazy_membership` field is only used when persisting (not reading from the database).

StateFilter.from_types([(EventTypes.Member, "@user4:test")]),
# Previous request did not include any explicit members,
# so nothing to store.
lazy_members_previously_returned=frozenset(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we see @user4:test here? We're going to send @user4:test in this sync request. And the relevant code:

# We remember the user if either a) they haven't been
# invalidated...
if (EventTypes.Member, state_key) not in state_deltas:
lazy_members_previously_returned.add(state_key)
# ...or b) if we are going to send the delta down in this
# sync.
if state_key in lazy_load_user_ids:
lazy_members_previously_returned.add(state_key)


Also applies to expected_without_state_deltas

Comment on lines +1568 to +1569
lazy_members_previously_returned: The set of user IDs we should add to
the lazy members cache that we had previously returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the usage below, this also includes members that will be returned in the current sync:

                    # ...or b) if we are going to send the delta down in this
                    # sync.

Perhaps needs a better name as well as updating the description here.

Comment on lines +1568 to +1569
lazy_members_previously_returned: The set of user IDs we should add to
the lazy members cache that we had previously returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lazy_members_previously_returned doesn't appear to be used anywhere but tests.

Is it useful to assert there? We should remove or rename to test_only_xxx

},
request_required_state_map={EventTypes.Member: {StateValues.LAZY}},
previously_returned_user_state=frozenset(),
required_user_state={"@user3:test"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, we're talking about state_key_expand_lazy_keep_previous_memberships.

In the previous version of this test, we never requested @user3:test but now it appears as lazy_load_user_ids={"@user3:test"},. Or put another way, lazy_load_user_ids={"@user3:test"}, came out of thin air

This is a new test variant that should be added instead of edited into an existing test.

(still relevant)

previous_required_state_map={EventTypes.Member: {StateValues.LAZY}},
request_required_state_map={EventTypes.Member: {"@user4:test"}},
previously_returned_lazy_user_ids={"@user2:test", "@user3:test"},
lazy_load_user_ids={"@user4:test"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request is no longer lazy, why is this here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow sliding sync when connection metadata gets large

3 participants