Add robots.txt to optimize AI crawler indexing for MLflow documentation #386

Copilot · 2025-10-10T03:41:26Z

Problem

AI assistants (ChatGPT, Claude, Gemini, etc.) sometimes reference outdated MLflow documentation when answering questions about MLflow features. This happens because AI crawlers index all documentation versions equally, including legacy 1.x and 2.x versions, leading to confusion and incorrect information being provided to users.

Solution

This PR adds a robots.txt file to the MLflow website that optimizes AI crawler behavior by:

Allowing only the latest documentation (/docs/latest/) to be indexed
Disallowing all legacy versions (/docs/1.*/, /docs/2.*/, /docs/0.*/) from being indexed

The robots.txt includes specific configurations for major AI crawlers:

OpenAI (ChatGPT, GPTBot)
Google Gemini (Google-Extended)
Anthropic Claude (ClaudeBot, Claude-Web)
Common Crawl (CCBot)
Perplexity (PerplexityBot)
Cohere (cohere-ai)

Implementation

The robots.txt file is placed in website/static/ directory, which is automatically copied to the root of the built site by Docusaurus. The file follows the standard robots.txt format and includes:

User-agent: *
Allow: /docs/latest/
Disallow: /docs/1.*/
Disallow: /docs/2.*/
Disallow: /docs/0.*/

With additional specific rules for each major AI crawler to ensure maximum compatibility.

Testing

✅ Built the site successfully - robots.txt is correctly copied to build output
✅ Verified robots.txt is accessible at /robots.txt endpoint
✅ TypeScript type checking passes
✅ Added comprehensive test suite in robots.spec.ts to validate robots.txt content

Screenshot

The screenshot shows the robots.txt file being served correctly with all AI crawler configurations.

Impact

After this change is deployed, AI assistants will prioritize the latest MLflow documentation when crawling the site, ensuring users receive accurate and up-to-date information when asking about MLflow features. This addresses the issue where AI assistants sometimes point to older version documentation, causing confusion.

Original prompt

This section details on the original issue you should resolve

<issue_title>Optimization of AI crawlers for MLflow docs</issue_title>
<issue_description>There is no optimization for AI crawlers for our documentation at the moment, sometime when asking AI assistant about certain features, it won't yield the correct results due to legacy docs reference. In some cases, asking for doc explicitly will point to older version documentation.

To mitigate this, we should at least add a robot.txt to cover:

Only allow "latest" documentation.

Disallow all legacy versions(1.x, 2.x) being indexed.</issue_description>

<agent_instructions>Please fix the issue based on the issue description. If you have other ideas to fix this beyond the scope of this issue, please add them in a separate PR.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes #385

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: kevin-lyn <[email protected]>

Initial plan

185e09b

Copilot AI assigned Copilot and kevin-lyn Oct 10, 2025

Copilot started work on behalf of kevin-lyn October 10, 2025 03:41 View session

Add robots.txt for AI crawler optimization

7cd1138

Co-authored-by: kevin-lyn <[email protected]>

Copilot AI changed the title ~~[WIP] Add robots.txt for AI crawler optimization~~ Add robots.txt to optimize AI crawler indexing for MLflow documentation Oct 10, 2025

Copilot AI requested a review from kevin-lyn October 10, 2025 03:56

Copilot finished work on behalf of kevin-lyn October 10, 2025 03:56

kevin-lyn marked this pull request as ready for review October 13, 2025 03:41

kevin-lyn approved these changes Oct 13, 2025

View reviewed changes

kevin-lyn merged commit 31c4d2e into main Oct 13, 2025
8 of 9 checks passed

kevin-lyn deleted the copilot/optimize-ai-crawlers-docs branch October 13, 2025 03:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add robots.txt to optimize AI crawler indexing for MLflow documentation #386

Add robots.txt to optimize AI crawler indexing for MLflow documentation #386

Uh oh!

Copilot AI commented Oct 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add robots.txt to optimize AI crawler indexing for MLflow documentation #386

Add robots.txt to optimize AI crawler indexing for MLflow documentation #386

Uh oh!

Conversation

Copilot AI commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Implementation

Testing

Screenshot

Impact

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Oct 10, 2025 •

edited

Loading