Description
Description
We need to read git log entries for:
- Content plugins
loadContent()
lifecycle, to compute docs/blog/pages last update metadata whenshowLastUpdateTime
orshowLastUpdateAuthor
is enabled - Sitemap
postBuild()
lifecycle to compute the value of a<lastmod>
tag from theroute?.metadata?.sourceFilePath
(if provided by plugins, which may not be the case for third-party plugins)
As of Docusaurus 3.8, we queue commands like this one for each relevant file
git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "i18n/fr/docusaurus-plugin-content-blog/releases/2.3/index.mdx"
Not exactly, here's a more accurate command:
In practice, we run this:
cd i18n/fr/docusaurus-plugin-content-blog/releases/2.3/
git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "index.mdx"
We do this because it's possible that users have git submodules, and the Git repo at the root may not contain the history of that specific file.
Performance problems
There are various performance bottlenecks in our implementation.
Main case
The git log command queueing works, but it remains a bottleneck, and there's probably room to improve things.
On a large 11k docs site, I noticed that loadContent()
takes 24s when docs lastUpdate options are enabled, but only 4s when they are disabled, and these options are the main bottleneck of the loadContent()
lifecycle for large sites.
I'm not sure how to improve the performance of this, but we likely have a few options to explore.
Useful resources:
- https://github.com/Brooooooklyn/simple-git
- https://www.nodegit.org/
- https://github.com/isomorphic-git/isomorphic-git
Suggested in:
- https://x.com/sebastienlorber/status/1927310075284111744
- https://bsky.app/profile/sebastienlorber.com/post/3lq5hx3kuec2d
We could also try to optimize the git commands. I think rev-parse
and batching history read commands might work but we need to study that.
We could also try to create an incremental "last update" cache, reading the full history first, and then only reading from the newly added commits on rebuilds. This could also help to display the real last update data in dev. But maybe it would be slow to initialize for very large Git repos 🤷♂
i18n / codegen / untracked case
I've also noticed that the commands we issue are significantly slower for untracked Git files. This is usually the case:
- when downloading i18n mdx files from Crowdin
- when using codegen plugins creating untracked MDX files before running
docusaurus build
This can be seen locally on our own website on a localized build after downloading French translation files:
[PERF] Build > en > Load site - 4.53 seconds!
[PERF] Build > en > postBuild() - 3.34 seconds!
[PERF] Build > fr > Load site - 6.15 seconds!
[PERF] Build > fr > postBuild() - 11.53 seconds!
It is unexpected that a French site build has a slower loadContent()
and postBuild()
compared to the English site.
The impact is even more significant on our Netlify CI, not as powerful as my mac M3 Pro
[PERF] Build > en > Load site - 5.71 seconds!
[PERF] Build > en > postBuild() - 12.78 seconds!
[PERF] Build > fr > Load site - 104.85 seconds!
[PERF] Build > fr > postBuild() - 203.77 seconds!
[PERF] Build > pt-BR > Load site - 191.19 seconds!
[PERF] Build > pt-BR > postBuild() - 298.34 seconds!
[PERF] Build > ko > Load site - 265.14 seconds!
[PERF] Build > ko > postBuild() - 372.63 seconds!
[PERF] Build > zh-CN > Load site - 352.54 seconds!
[PERF] Build > zh-CN > postBuild() - 449.76 seconds!
Note: there's likely a logging issue (or something worse 🤔), numbers look like they accumulate and are not expected to grow like that, but we can already see that loadSite()
and postBuild
are much slower on localized sites.
It seems that the postBuild()
hook takes more time for subsequent locales in practices so I suspect something really weird is happening here, but I can't reproduce it locally yet.
Duplicated work
If a file is not tracked by Git and the history couldn't be read in loadContent()
, then we try to read it again in postBuild()
, which, as described above, is quite slow.
This should be easy to fix so I'll handle this right now in #11211
Design issue
There are cases where we know for sure there's no last update metadata available. It should be possible to configure the site or plugins to avoid reading the Git history entirely. It's probably possible today through parseFrontMatter
and process.env.DOCUSAURUS_CURRENT_LOCALE
, but it is clearly not intuitive and we should provide better APIs.
Examples include:
- i18n sites: if translations are all downloaded from Crowdin, we likely don't want to read the last update metadata from Git for any file in the untracked
i18n
folder. - plugins using code-generation: we have various third-party plugins that generate MDX files that our content plugins will then read. An example is our own Changelog plugin. It should be possible to configure our content plugins so that they directly emit
null
values instead of running uselessgit log
commands.
In general, we probably want to introduce a global siteConfig.readLastUpdateData
callback hook so that it's possible to implement other strategies that are not related to Git.
Here's an example usage based on a pseudo-implementation:
const siteConfig = {
readLastUpdateData: async (filePath, utils) => {
if (filePath.startsWith("i18n")) {
return null;
}
if (filePath.startsWith("changelog")) {
return utils.readFromFS(filePath);
}
return utils.readFromGit(filePath);
}
}