Docs sites - read “last commit date/author” efficiently from Git


I’d like to ask the e18e community for help in solving a performance bottleneck that could speed up all Node-based documentation frameworks.

The task is to efficiently read the “last commit date / author” of thousands of files to use/display that metadata on your site.

## Context

I’m the maintainer of [Docusaurus](https://github.com/facebook/docusaurus), a content-driven (Markdown/MDX) static site generator, created by Meta, powering many technical docs websites (React Native, Babel, Prettier, Jest, Electron, Ionic, and more). 

As many other docs-related frameworks do, a quite popular feature of Docusaurus is to display the “last update date / author” of a given page ([example page](https://docusaurus.io/docs/installation#:~:text=Edit%20this%20page-,Last%20updated)):

<img width="842" height="152" alt="Image" src="https://github.com/user-attachments/assets/909c2944-dd92-48a7-8861-84f6ab29c223" />

After optimizing various aspects of Docusaurus, we found out that the way we retrieve that “last update date / author” info from Git is one of the most important remaining bottlenecks of the project, which could significantly impact build times on large projects.

The naive approach we historically use is to just spawn a process `git log <filePath>` with relevant options for each file. This obviously doesn’t scale well. We implemented a [“git command queue”](https://github.com/facebook/docusaurus/pull/11163) to avoid overwhelming the OS, but it still feels wrong to spawn one process per file. There are definitely better alternatives, but which one is best? Here’s our original [Docusaurus issue](https://github.com/facebook/docusaurus/issues/11208) that tracks this performance problem.

We have also noticed that we are not alone in encountering this bottleneck, and many other docs frameworks would benefit from an efficient solution. Similar “read last update date / author from git” logic can be found in:

- [Astro Starlight](https://bsky.app/profile/bluwy.me/post/3lyihod6qos2a)  
- [Fumadocs](https://x.com/fuma_nama/status/1965075197125620055)  
- [Rspress](https://x.com/jait_chen/status/1965223308804456695)  
- [MkDocs](https://github.com/timvink/mkdocs-git-revision-date-localized-plugin)   
- [Eleventy](https://github.com/11ty/eleventy/blob/b7ce4eed9c5263fc708f98bb28a36ffc5cbb9aeb/src/Util/DateGitLastUpdated.js)  
- Custom docs sites  
- More?

Considering the ecosystem impact it could have, I think the e18n initiative is a good place to collaboratively tackle this problem once and for all, and speed up all the docs sites at once. 

I’m going to try various approaches, but each attempt takes time and might require a significant effort just to see if this is viable (sometimes working in languages like C/Rust I’m not familiar with). Docusaurus also has constraints that may not be exactly the same as other frameworks, and it could be useful to work on a general solution that works well for the whole community and not just Docusaurus. 

We have to make this feature easy to use for the end user, but it might also be possible that a single approach doesn’t always perform best, and creating an npm package offering different strategies/tradeoffs/heuristics could be beneficial to the community. Maybe even this package could be composable and support more than just a Git-based strategy?

## Docusaurus constraints

Here are some of the constraints we have for Docusaurus.

### Large sites and repositories

We have documentation websites that can contain more than 10k Markdown source files for which we need the last update date / author from Git. The Git history can also be quite large, above 10k commits. The solution should rather be scalable to extremely large setups.

### Support for multiple Git repositories/submodules

We have seen community members ([PR](https://github.com/facebook/docusaurus/pull/5048)) assembling multiple Git repositories to create their final documentation website. This means that the “last update data” may not be stored in a single Git repository, but may be scattered across multiple repositories. 

### Remain fast for untracked

Not all Docusaurus sites are using Markdown files tracked on Git. We also have plugins that download Markdown files from a CMS, a translation SaaS, or generate Markdown files from large OpenAPI schemas for example. 

The solution should remain fast when trying to retrieve the “last update info” from thousands of files, whether they are tracked or not. Just checking if a file is tracked or not might add overhead.

### Sensible defaults

By default, Docusaurus should have good defaults and let you display the “last updated info” metadata by simply toggling a boolean flag, as we do today. 

It is reasonable to assume users will either have their docs tracked on Git (one or more repositories/submodules)/or have untracked files.

### Simple API

I’d like to give Docusaurus users maximum flexibility and let them provide their own strategy when the default one we provide is not enough. After all, not everyone use Git, and sometimes the “last update info” is tracked on an external system like a CMS.

I find that the most idiomatic API to expose as a public API is something like:
 
`async function getLastUpdatedInfo(filePath: string)` 

This public API has 2 sides:
- Users can provide the implementation as a callback
- Plugin authors can call the default or user-provided API

This can let users implement custom logic, and eventually add bail-out strategies so that the logic runs faster (may be useful for very large setups):

```tsx  
async function getLastUpdatedInfo(filePath: string) {  
  if (filePath.startsWith(‘./docs/’) {  
    return readFromGit(filePath);  
  }  
  if (filePath.startsWith(‘./blog/’) {  
    return readFromMercurial(filePath);  
  }  
  if (filePath.startsWith(‘./api/’) {  
    return readFromFileSystem(filePath);  
  }  
  else if (filePath.startsWith(‘./i18n/’)) {  
    return readFromTranslationSaaS(filePath)  
  }  
  else if (filePath.startsWith(‘./untracked/’)) {  
    // Bail-out from trying to find these files in Git  
    return null;  
  }  
  else {  
    // Fallback: use the default strategy  
    return readFromGit(filePath);  
  }  
}  
```

The challenge with this kind of API is that it makes it harder to implement something like [Astro/Starlight does](https://github.com/withastro/starlight/blob/c417f1efd463be63b7230617d72b120caed098cd/packages/starlight/utils/git.ts#L58): reading the whole Git repository history at once in a single `git log` command, then using the aggregated result for all the file paths. 

Reading everything at once seems way more efficient than our current strategy, and afaik it scales well even for large repositories. However, it might not be as easy to deal with multiple Git repositories, and that strategy might not always be possible. For example, does it scale well for other VSC, and is this even possible to implement on top of this SaaS/CMS API?

Also note that Docusaurus has a modular plugin architecture, and each plugin has the freedom to choose when to retrieve the “last update info” for the files that it handles, at different stages. For us, there’s no single place where calling a big “collectAllLastUpdateInfo” could make sense. To give a concrete example:

- Our “docs” plugin would retrieve the “last update info” in our `plugin.loadContent()` lifecycle while reading Markdown source files  
- Our “sitemap” plugin would retrieve the “last update info” in our `plugin.postBuild()` lifecycle while generating a `sitemap.xml` file with `<lastmod>` tags, by iterating over all the routes and their metadata (that may include the source file path that led to the creation of a given route).  
- Third-party plugins should be able to conveniently call the API we expose if they need to display the “last updated info” anywhere. It’s their responsibility to decide when this makes sense to call it.

I also find that a `getLastUpdatedInfo(filePath)` API is more intuitive to customize for users than a `getAllLastUpdateInfos(filePaths)`, and for most sites that do not have extreme scales, these customizations should perform good enough.

For all these reasons, I believe `getLastUpdatedInfo` makes more sense. Given our constraints, we can still implement a strategy like [Astro Starlight does](https://github.com/withastro/starlight/blob/c417f1efd463be63b7230617d72b120caed098cd/packages/starlight/utils/git.ts#L58) by reading the whole repo lazily on the first `getLastUpdatedInfo` call on a given repo. The tradeoff would be that very large sites providing their own `getLastUpdatedInfo` (does it happen often?) would now have to implement this “read the whole repo lazily” themselves. That seems acceptable considering it’s rare to have very large sites that do not use Git and need to customize the default strategy.

My ideal default Git-based solution would let me spam the API with thousands of independent function calls with a single file path, and it would figure out on its own how to optimize/batch those independent calls to respond relatively fast and to not overwhelm the system.

## Implementation ideas

This is just a subset of the solutions to explore:

- [Astro Starlight solution](https://github.com/withastro/starlight/blob/c417f1efd463be63b7230617d72b120caed098cd/packages/starlight/utils/git.ts#L58): single exec command that reads the whole Git repo. This needs to be extended to support multiple repositories? Eventually, add caching / incremental mode in case the init takes too long? Would probably be more efficient if it was converted to native, but maybe not needed.

- [Brooooooklyn/simple-git: Simple and fast git helper functions.](https://github.com/Brooooooklyn/simple-git): Rust-based npm package, supposed to be fast. I think I tried it already and for some reason I couldn’t make it work. Also it doesn’t expose an API to retrieve the last update author which we need too (not just the date)

- [NodeGit \- Node bindings to libgit2](https://www.nodegit.org/): Could help but not super maintained and does not provide [Node 22 binaries](https://github.com/nodegit/nodegit/issues/2015)

- [Isomorphic-git: A pure JavaScript implementation of git for node and browsers](https://github.com/isomorphic-git/isomorphic-git): Would this be faster than node exec calls?

- Using fancy git flags/options: I’m not an expert in Git options, but maybe Git already provides a solution to our problem, and we are simply not aware of it?

- Creating our own libgit2 native bindings exposed with Node API.


## Related links / discussions

- e18e Discord discussion: [https://discord.com/channels/1227627367204257852/1415353789552001124](https://discord.com/channels/1227627367204257852/1415353789552001124)  
- Docusaurus issue: [https://github.com/facebook/docusaurus/issues/11208](https://github.com/facebook/docusaurus/issues/11208)  
- X \- May 2025: [https://x.com/sebastienlorber/status/1927310075284111744](https://x.com/sebastienlorber/status/1927310075284111744)  
- Bluesky \- May 2025[https://bsky.app/profile/sebastienlorber.com/post/3lq5hx3kuec2d](https://bsky.app/profile/sebastienlorber.com/post/3lq5hx3kuec2d)  
- X \- September 2025: [https://x.com/sebastienlorber/status/1965070340213834196](https://x.com/sebastienlorber/status/1965070340213834196)  
- Bluesky \- September 2025: [https://bsky.app/profile/sebastienlorber.com/post/3lydik4nj4s27](https://bsky.app/profile/sebastienlorber.com/post/3lydik4nj4s27)


---


## What’s next?

I’m going to see how to create a representative repository of a very large and complex Docusaurus site, with a good mix of untracked and tracked files under distinct Git repositories. From there, it should become easier to create a benchmark that compares various technical solutions side by side.

What I’m going to try first is the approach that Astro implements, with a few enhancements to support our constraints. It’s probably not the most optimal solution, but it's probably the easiest to implement for me.

If you want to help, please let me now.


  


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Docs sites - read “last commit date/author” efficiently from Git #216

Context

Docusaurus constraints

Large sites and repositories

Support for multiple Git repositories/submodules

Remain fast for untracked

Sensible defaults

Simple API

Implementation ideas

Related links / discussions

What’s next?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Docs sites - read “last commit date/author” efficiently from Git #216

Description

Context

Docusaurus constraints

Large sites and repositories

Support for multiple Git repositories/submodules

Remain fast for untracked

Sensible defaults

Simple API

Implementation ideas

Related links / discussions

What’s next?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions