Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Oct 10, 2025

Problem

AI assistants (ChatGPT, Claude, Gemini, etc.) sometimes reference outdated MLflow documentation when answering questions about MLflow features. This happens because AI crawlers index all documentation versions equally, including legacy 1.x and 2.x versions, leading to confusion and incorrect information being provided to users.

Solution

This PR adds a robots.txt file to the MLflow website that optimizes AI crawler behavior by:

  1. Allowing only the latest documentation (/docs/latest/) to be indexed
  2. Disallowing all legacy versions (/docs/1.*/, /docs/2.*/, /docs/0.*/) from being indexed

The robots.txt includes specific configurations for major AI crawlers:

  • OpenAI (ChatGPT, GPTBot)
  • Google Gemini (Google-Extended)
  • Anthropic Claude (ClaudeBot, Claude-Web)
  • Common Crawl (CCBot)
  • Perplexity (PerplexityBot)
  • Cohere (cohere-ai)

Implementation

The robots.txt file is placed in website/static/ directory, which is automatically copied to the root of the built site by Docusaurus. The file follows the standard robots.txt format and includes:

User-agent: *
Allow: /docs/latest/
Disallow: /docs/1.*/
Disallow: /docs/2.*/
Disallow: /docs/0.*/

With additional specific rules for each major AI crawler to ensure maximum compatibility.

Testing

  • ✅ Built the site successfully - robots.txt is correctly copied to build output
  • ✅ Verified robots.txt is accessible at /robots.txt endpoint
  • ✅ TypeScript type checking passes
  • ✅ Added comprehensive test suite in robots.spec.ts to validate robots.txt content

Screenshot

robots.txt in production

The screenshot shows the robots.txt file being served correctly with all AI crawler configurations.

Impact

After this change is deployed, AI assistants will prioritize the latest MLflow documentation when crawling the site, ensuring users receive accurate and up-to-date information when asking about MLflow features. This addresses the issue where AI assistants sometimes point to older version documentation, causing confusion.

Original prompt

This section details on the original issue you should resolve

<issue_title>Optimization of AI crawlers for MLflow docs</issue_title>
<issue_description>There is no optimization for AI crawlers for our documentation at the moment, sometime when asking AI assistant about certain features, it won't yield the correct results due to legacy docs reference. In some cases, asking for doc explicitly will point to older version documentation.

To mitigate this, we should at least add a robot.txt to cover:

  1. Only allow "latest" documentation.
  2. Disallow all legacy versions(1.x, 2.x) being indexed.</issue_description>

<agent_instructions>Please fix the issue based on the issue description. If you have other ideas to fix this beyond the scope of this issue, please add them in a separate PR.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes #385


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Add robots.txt for AI crawler optimization Add robots.txt to optimize AI crawler indexing for MLflow documentation Oct 10, 2025
Copilot AI requested a review from kevin-lyn October 10, 2025 03:56
@kevin-lyn kevin-lyn marked this pull request as ready for review October 13, 2025 03:41
@kevin-lyn kevin-lyn merged commit 31c4d2e into main Oct 13, 2025
8 of 9 checks passed
@kevin-lyn kevin-lyn deleted the copilot/optimize-ai-crawlers-docs branch October 13, 2025 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimization of AI crawlers for MLflow docs

2 participants