Skip to content

Docs sites - read “last commit date/author” efficiently from Git #216

@slorber

Description

@slorber

I’d like to ask the e18e community for help in solving a performance bottleneck that could speed up all Node-based documentation frameworks.

The task is to efficiently read the “last commit date / author” of thousands of files to use/display that metadata on your site.

Context

I’m the maintainer of Docusaurus, a content-driven (Markdown/MDX) static site generator, created by Meta, powering many technical docs websites (React Native, Babel, Prettier, Jest, Electron, Ionic, and more).

As many other docs-related frameworks do, a quite popular feature of Docusaurus is to display the “last update date / author” of a given page (example page):

Image

After optimizing various aspects of Docusaurus, we found out that the way we retrieve that “last update date / author” info from Git is one of the most important remaining bottlenecks of the project, which could significantly impact build times on large projects.

The naive approach we historically use is to just spawn a process git log <filePath> with relevant options for each file. This obviously doesn’t scale well. We implemented a “git command queue” to avoid overwhelming the OS, but it still feels wrong to spawn one process per file. There are definitely better alternatives, but which one is best? Here’s our original Docusaurus issue that tracks this performance problem.

We have also noticed that we are not alone in encountering this bottleneck, and many other docs frameworks would benefit from an efficient solution. Similar “read last update date / author from git” logic can be found in:

Considering the ecosystem impact it could have, I think the e18n initiative is a good place to collaboratively tackle this problem once and for all, and speed up all the docs sites at once.

I’m going to try various approaches, but each attempt takes time and might require a significant effort just to see if this is viable (sometimes working in languages like C/Rust I’m not familiar with). Docusaurus also has constraints that may not be exactly the same as other frameworks, and it could be useful to work on a general solution that works well for the whole community and not just Docusaurus.

We have to make this feature easy to use for the end user, but it might also be possible that a single approach doesn’t always perform best, and creating an npm package offering different strategies/tradeoffs/heuristics could be beneficial to the community. Maybe even this package could be composable and support more than just a Git-based strategy?

Docusaurus constraints

Here are some of the constraints we have for Docusaurus.

Large sites and repositories

We have documentation websites that can contain more than 10k Markdown source files for which we need the last update date / author from Git. The Git history can also be quite large, above 10k commits. The solution should rather be scalable to extremely large setups.

Support for multiple Git repositories/submodules

We have seen community members (PR) assembling multiple Git repositories to create their final documentation website. This means that the “last update data” may not be stored in a single Git repository, but may be scattered across multiple repositories.

Remain fast for untracked

Not all Docusaurus sites are using Markdown files tracked on Git. We also have plugins that download Markdown files from a CMS, a translation SaaS, or generate Markdown files from large OpenAPI schemas for example.

The solution should remain fast when trying to retrieve the “last update info” from thousands of files, whether they are tracked or not. Just checking if a file is tracked or not might add overhead.

Sensible defaults

By default, Docusaurus should have good defaults and let you display the “last updated info” metadata by simply toggling a boolean flag, as we do today.

It is reasonable to assume users will either have their docs tracked on Git (one or more repositories/submodules)/or have untracked files.

Simple API

I’d like to give Docusaurus users maximum flexibility and let them provide their own strategy when the default one we provide is not enough. After all, not everyone use Git, and sometimes the “last update info” is tracked on an external system like a CMS.

I find that the most idiomatic API to expose as a public API is something like:

async function getLastUpdatedInfo(filePath: string)

This public API has 2 sides:

  • Users can provide the implementation as a callback
  • Plugin authors can call the default or user-provided API

This can let users implement custom logic, and eventually add bail-out strategies so that the logic runs faster (may be useful for very large setups):

async function getLastUpdatedInfo(filePath: string) {  
  if (filePath.startsWith(./docs/) {  
    return readFromGit(filePath);  
  }  
  if (filePath.startsWith(./blog/) {  
    return readFromMercurial(filePath);  
  }  
  if (filePath.startsWith(./api/) {  
    return readFromFileSystem(filePath);  
  }  
  else if (filePath.startsWith(./i18n/)) {  
    return readFromTranslationSaaS(filePath)  
  }  
  else if (filePath.startsWith(./untracked/)) {  
    // Bail-out from trying to find these files in Git  
    return null;  
  }  
  else {  
    // Fallback: use the default strategy  
    return readFromGit(filePath);  
  }  
}  

The challenge with this kind of API is that it makes it harder to implement something like Astro/Starlight does: reading the whole Git repository history at once in a single git log command, then using the aggregated result for all the file paths.

Reading everything at once seems way more efficient than our current strategy, and afaik it scales well even for large repositories. However, it might not be as easy to deal with multiple Git repositories, and that strategy might not always be possible. For example, does it scale well for other VSC, and is this even possible to implement on top of this SaaS/CMS API?

Also note that Docusaurus has a modular plugin architecture, and each plugin has the freedom to choose when to retrieve the “last update info” for the files that it handles, at different stages. For us, there’s no single place where calling a big “collectAllLastUpdateInfo” could make sense. To give a concrete example:

  • Our “docs” plugin would retrieve the “last update info” in our plugin.loadContent() lifecycle while reading Markdown source files
  • Our “sitemap” plugin would retrieve the “last update info” in our plugin.postBuild() lifecycle while generating a sitemap.xml file with <lastmod> tags, by iterating over all the routes and their metadata (that may include the source file path that led to the creation of a given route).
  • Third-party plugins should be able to conveniently call the API we expose if they need to display the “last updated info” anywhere. It’s their responsibility to decide when this makes sense to call it.

I also find that a getLastUpdatedInfo(filePath) API is more intuitive to customize for users than a getAllLastUpdateInfos(filePaths), and for most sites that do not have extreme scales, these customizations should perform good enough.

For all these reasons, I believe getLastUpdatedInfo makes more sense. Given our constraints, we can still implement a strategy like Astro Starlight does by reading the whole repo lazily on the first getLastUpdatedInfo call on a given repo. The tradeoff would be that very large sites providing their own getLastUpdatedInfo (does it happen often?) would now have to implement this “read the whole repo lazily” themselves. That seems acceptable considering it’s rare to have very large sites that do not use Git and need to customize the default strategy.

My ideal default Git-based solution would let me spam the API with thousands of independent function calls with a single file path, and it would figure out on its own how to optimize/batch those independent calls to respond relatively fast and to not overwhelm the system.

Implementation ideas

This is just a subset of the solutions to explore:

Related links / discussions


What’s next?

I’m going to see how to create a representative repository of a very large and complex Docusaurus site, with a good mix of untracked and tracked files under distinct Git repositories. From there, it should become easier to create a benchmark that compares various technical solutions side by side.

What I’m going to try first is the approach that Astro implements, with a few enhancements to support our constraints. It’s probably not the most optimal solution, but it's probably the easiest to implement for me.

If you want to help, please let me now.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions