Skip to content

Conversation

dav-is
Copy link
Member

@dav-is dav-is commented Oct 8, 2025

Uses a separate index based on https://v5.mui.com/. We remove the master filter because the version is set to v5 everywhere except Toolpad, which has master as its version. There are only two versions, so there's little reason to filter.

Adds a new crawler that crawls once a month.

Fix: #45771

@dav-is dav-is added type: bug It doesn't behave as expected. scope: docs-infra Involves the docs-infra product (https://www.notion.so/mui-org/b9f676062eb94747b6768209f7751305). labels Oct 8, 2025
@mui-bot
Copy link

mui-bot commented Oct 8, 2025

Netlify deploy preview

https://deploy-preview-47049--material-ui.netlify.app/

Bundle size report

No bundle size changes (Toolpad)
No bundle size changes

Generated by 🚫 dangerJS against 0eb6aff

@Janpot
Copy link
Member

Janpot commented Oct 8, 2025

So the alternative would be to keep a single index, but versioned? Just creating a crawler per version that indexes in the same index.

@dav-is
Copy link
Member Author

dav-is commented Oct 8, 2025

@Janpot Why put old data into the main index? The index has page content in it. We can have up to 20 indexes in Algolia, so why create a large monolithic index when it can be partitioned according to usage?

@Janpot
Copy link
Member

Janpot commented Oct 8, 2025

Why put old data into the main index?

I haven't thought too deeply about it, it was just intuition. But thinking a bit about it:

  • Less forking of the frontend search code across major versions, the search code can just filter based on an env var that probably already exists
  • We're currently only using 0.1% of the available space or so, I wouldn't necessarily call it monolithic.
  • Would allow for decoupled release cycles of the sub products. i.e. it strongly couples our index creation to the idea that we do synchronised releases across the products.
  • Allows for searching across versions should we want that at some point. (don't have an immediate use-case)
  • But maybe some day we want to index every minor version? Maybe with LLMs it potentially may become important to have the ability to do finer grained search per version. In that case we're soon going to need more than 20 indices.

@dav-is
Copy link
Member Author

dav-is commented Oct 8, 2025

@Janpot

Less forking of the frontend search code across major versions, the search code can just filter based on an env var that probably already exists

The index name can just as easily be an ENV variable.

We're currently only using 0.1% of the available space or so, I wouldn't necessarily call it monolithic.

If we are extending the idea of scaling multiple major versions to new packages, such as Base UI, a 40MB index (the size of today's index) could be considered quite significant and near the limits of fitting into a serverless function.

Would allow for decoupled release cycles of the sub products. i.e., it strongly couples our index creation to the idea that we do synchronized releases across the products.

I think of previous major versions as an archive. They exist on a dedicated branch and receive mostly backports for serious fixes. There are also some logistical hurdles to maintaining a separate branch when packages have decoupled releases (how "v7" of MUI X is frozen to 7.0.2 of Material UI). We treat major releases as entirely separate "branches" of content; otherwise, why would we have a subdomain? If we had the docs of each major version stored in a subdirectory maintained in the master branch, then that might be a case for a single index, but I don't think that's a scalable approach either.

Major versions are the only time when deprecated features are deliberately removed, so with a new major version, we also get to prune the index. With a monolithic index, each major version significantly increases the index's size. With each major version, we are choosing to remove unhelpful context.

You would expect the latest index to receive a lot of traffic, the previous major version to receive less, and the major versions before that to receive significantly less traffic. This is why lumping them into one seems monolithic to me.

There is also precedent to the idea of splitting the index by version: v4.mui.com uses a separate index and today works correctly, even though that branch and index probably haven't been touched in a long time.

If indexes combine versions, then they become dependent on the crawler or a database to create the index. We can no longer assume that it is produced by the site content we have checked out in git. An index created for a PR would depend on outside information (or it would work differently from production).

But maybe some day we want to index every minor version? Maybe with LLMs it potentially may become important to have the ability to do finer grained search per version. In that case we're soon going to need more than 20 indices.

It would make sense for minor versions to share an index. They ideally have more in common with one another and will evolve. Ideally, the content would reference older minors explicitly, e.g. "The checkbox component was added in v0.5.0". Then you could search for all features released in v0.4.0 vs v0.6.0.

Maybe with LLMs it potentially may become important to have the ability to do finer grained search per version.

I think with LLMs it is important to filter out information that might be misleading. My feeling would be that even glancing at content from a previous major could confuse an LLM. A major version is meant to be cohesive, and previous majors won't be considering future capabilities or improvements. For example, maybe a page on the last major suggests using a deprecated function that has since been removed. This recommendation would be deliberately removed in the latest version; however, if you're running the older version, the recommendation remains valid, and removing it would also be incorrect.

Allows for searching across versions should we want that at some point. (don't have an immediate use-case)

If we needed this, I think it would be on a separate page or context from the global docs search. We could create a heavier aggregate index for this case specifically. We could also optimize this index for the particular case, maybe we would exclude the content itself, or maybe we would add more metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
scope: docs-infra Involves the docs-infra product (https://www.notion.so/mui-org/b9f676062eb94747b6768209f7751305). type: bug It doesn't behave as expected.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants