Skip to content

Performance issue in git log commands to read last update metadata #11208

Open
@slorber

Description

@slorber

Description

We need to read git log entries for:

  • Content plugins loadContent() lifecycle, to compute docs/blog/pages last update metadata when showLastUpdateTime or showLastUpdateAuthor is enabled
  • Sitemap postBuild() lifecycle to compute the value of a <lastmod> tag from the route?.metadata?.sourceFilePath (if provided by plugins, which may not be the case for third-party plugins)

As of Docusaurus 3.8, we queue commands like this one for each relevant file

git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "i18n/fr/docusaurus-plugin-content-blog/releases/2.3/index.mdx"
Not exactly, here's a more accurate command:

In practice, we run this:

cd i18n/fr/docusaurus-plugin-content-blog/releases/2.3/
git -c log.showSignature=false log --format=RESULT:%ct,%an --max-count=1 -- "index.mdx"

We do this because it's possible that users have git submodules, and the Git repo at the root may not contain the history of that specific file.


Performance problems

There are various performance bottlenecks in our implementation.

Main case

The git log command queueing works, but it remains a bottleneck, and there's probably room to improve things.

On a large 11k docs site, I noticed that loadContent() takes 24s when docs lastUpdate options are enabled, but only 4s when they are disabled, and these options are the main bottleneck of the loadContent() lifecycle for large sites.

I'm not sure how to improve the performance of this, but we likely have a few options to explore.

Useful resources:

Suggested in:

We could also try to optimize the git commands. I think rev-parse and batching history read commands might work but we need to study that.

We could also try to create an incremental "last update" cache, reading the full history first, and then only reading from the newly added commits on rebuilds. This could also help to display the real last update data in dev. But maybe it would be slow to initialize for very large Git repos 🤷‍♂

i18n / codegen / untracked case

I've also noticed that the commands we issue are significantly slower for untracked Git files. This is usually the case:

  • when downloading i18n mdx files from Crowdin
  • when using codegen plugins creating untracked MDX files before running docusaurus build

This can be seen locally on our own website on a localized build after downloading French translation files:

[PERF] Build > en > Load site - 4.53 seconds!
[PERF] Build > en > postBuild() - 3.34 seconds!

[PERF] Build > fr > Load site - 6.15 seconds!
[PERF] Build > fr > postBuild() - 11.53 seconds!

It is unexpected that a French site build has a slower loadContent() and postBuild() compared to the English site.

The impact is even more significant on our Netlify CI, not as powerful as my mac M3 Pro

[PERF] Build > en > Load site - 5.71 seconds!
[PERF] Build > en > postBuild() - 12.78 seconds!

[PERF] Build > fr > Load site - 104.85 seconds!
[PERF] Build > fr > postBuild() - 203.77 seconds!

[PERF] Build > pt-BR > Load site - 191.19 seconds!
[PERF] Build > pt-BR > postBuild() - 298.34 seconds!

[PERF] Build > ko > Load site - 265.14 seconds!
[PERF] Build > ko > postBuild() - 372.63 seconds!

[PERF] Build > zh-CN > Load site - 352.54 seconds!
[PERF] Build > zh-CN > postBuild() - 449.76 seconds!

Note: there's likely a logging issue (or something worse 🤔), numbers look like they accumulate and are not expected to grow like that, but we can already see that loadSite() and postBuild are much slower on localized sites.

It seems that the postBuild() hook takes more time for subsequent locales in practices so I suspect something really weird is happening here, but I can't reproduce it locally yet.

Duplicated work

If a file is not tracked by Git and the history couldn't be read in loadContent(), then we try to read it again in postBuild(), which, as described above, is quite slow.

This should be easy to fix so I'll handle this right now in #11211


Design issue

There are cases where we know for sure there's no last update metadata available. It should be possible to configure the site or plugins to avoid reading the Git history entirely. It's probably possible today through parseFrontMatter and process.env.DOCUSAURUS_CURRENT_LOCALE, but it is clearly not intuitive and we should provide better APIs.

Examples include:

  • i18n sites: if translations are all downloaded from Crowdin, we likely don't want to read the last update metadata from Git for any file in the untracked i18n folder.
  • plugins using code-generation: we have various third-party plugins that generate MDX files that our content plugins will then read. An example is our own Changelog plugin. It should be possible to configure our content plugins so that they directly emit null values instead of running useless git log commands.

In general, we probably want to introduce a global siteConfig.readLastUpdateData callback hook so that it's possible to implement other strategies that are not related to Git.

Here's an example usage based on a pseudo-implementation:

const siteConfig = {
  readLastUpdateData: async (filePath, utils) => {
    if (filePath.startsWith("i18n")) {
      return null;
    }
    if (filePath.startsWith("changelog")) {
      return utils.readFromFS(filePath);
    }
    return utils.readFromGit(filePath);
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugAn error in the Docusaurus core causing instability or issues with its executiondomain: performanceRelated to bundle size or perf optimization

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions