Skip to content

Conversation

JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Oct 13, 2025

Summary

This PR adds a comprehensive architecture documentation suite in docs/arch/ covering ToolHive's design, components, and concepts.

Documentation Added

  • 00-overview.md: High-level architecture overview, key components, and platform philosophy
  • 01-deployment-modes.md: Local CLI, UI, and Kubernetes deployment patterns
  • 02-core-concepts.md: Core terminology, abstractions, nouns/verbs, and design patterns
  • 03-transport-architecture.md: MCP transport protocols (stdio, SSE, streamable-http) and proxy architecture
  • 04-secrets-management.md: Secret handling and backend integrations (1Password, encrypted storage)
  • 05-runconfig-and-permissions.md: Configuration schema, permission profiles, and security model
  • 06-registry-system.md: Registry architecture, distribution, and server catalog management
  • 07-groups.md: Group management, virtual MCP servers, and logical organization
  • 08-workloads-lifecycle.md: Workload state management, lifecycle operations, and process model
  • 09-operator-architecture.md: Kubernetes operator design, CRDs, and reconciliation patterns
  • README.md: Navigation guide with quick links and documentation index

🤖 Generated with Claude Code

@JAORMX JAORMX force-pushed the docs/arch branch 5 times, most recently from 54cb6dc to 680158a Compare October 13, 2025 10:44
Copy link

codecov bot commented Oct 13, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 53.36%. Comparing base (012d3b8) to head (c9a2889).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2165      +/-   ##
==========================================
+ Coverage   53.32%   53.36%   +0.03%     
==========================================
  Files         231      231              
  Lines       29529    29529              
==========================================
+ Hits        15747    15757      +10     
+ Misses      12649    12633      -16     
- Partials     1133     1139       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX changed the title Add comprehensive architecture documentation Add architecture documentation Oct 13, 2025
@JAORMX JAORMX force-pushed the docs/arch branch 2 times, most recently from 4d6d2bc to 309ebb5 Compare October 13, 2025 11:26
@ChrisJBurns
Copy link
Collaborator

Wondering if @danbarr has any thoughts on this, as there will seem to be overlaps in documentation between the docs website and this repo?

@JAORMX
Copy link
Collaborator Author

JAORMX commented Oct 13, 2025

@ChrisJBurns docs for a different purpose. these are for devs

@eleftherias
Copy link
Member

We have some other documentation throughout the codebase that overlap with this, for example https://github.com/stacklok/toolhive/blob/main/docs/middleware.md.
Should we move them all to this section or keep them spread out?

Also is there an opportunity to split up this PR? It's quite long and dense to fit into my mind at once.

@JAORMX
Copy link
Collaborator Author

JAORMX commented Oct 15, 2025

@eleftherias I can split it into multiple PRs... but then I'd have broken markdown references and incomplete parts 😕 I figured it might just be worth getting something started and iterating on top of this.

regarding middleware.md, that's a good idea! We could ditch that one and absorb it to the new arch docs.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation review for docs/arch/04-secrets-management.md: Found 2 technical inaccuracies with suggested fixes.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation Review - Factual Accuracy

Thorough review of architecture documentation for factual accuracy against codebase. Found 14 issues across 4 files:

  • 06-registry-system.md: 6 issues (file paths, annotations, phases, README reference)
  • 07-groups.md: 1 issue (stale PR reference)
  • 08-workloads-lifecycle.md: 3 issues (line numbers, storage paths, label names)
  • 09-operator-architecture.md: 4 issues (filename, annotation, example code, missing controller)

Most issues have inline suggestions that can be applied directly.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation review findings: Found 2 inaccuracies in the Groups documentation that should be corrected for accuracy.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation review findings for registry system architecture doc. Found 13 issues including incorrect CLI flags, wrong CRD field names, non-existent file paths, and incomplete examples. Most have inline suggestions for fixes.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation review findings for 08-workloads-lifecycle.md. Found several inaccuracies in CLI commands, file paths, and label formats. Most issues have inline suggestions for easy fixes.

JAORMX and others added 22 commits October 17, 2025 19:26
This commit introduces a new architectural documentation suite in docs/arch/
that provides in-depth coverage of ToolHive's design, components, and concepts.

The documentation is organized into the following sections:

- 00-overview.md: High-level architecture overview and introduction
- 01-deployment-modes.md: Local CLI, UI, and Kubernetes deployment patterns
- 02-core-concepts.md: Core terminology, abstractions, and design patterns
- 03-transport-architecture.md: MCP transport protocols and proxy architecture
- 04-secrets-management.md: Secret handling and backend integrations
- 05-runconfig-and-permissions.md: Configuration schema and security profiles
- 06-registry-system.md: Registry architecture and distribution
- 07-groups.md: Group management and virtual MCP servers
- 08-workloads-lifecycle.md: Workload state management and operations
- 09-operator-architecture.md: Kubernetes operator design and patterns
- README.md: Navigation guide and documentation index

This documentation serves as the canonical reference for understanding
ToolHive's architecture, making it easier for contributors to navigate
the codebase and for users to understand deployment options.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Juan Antonio Osorio <[email protected]>
Made the following changes based on review comments:

- Fix API version references: point to actual examples instead of inline YAML
- Fix CRD names: ToolConfig → MCPToolConfig, add MCPExternalAuthConfig
- Remove all line number references from code file paths
- Fix CLI commands: registry show → info, group delete → rm
- Remove non-existent CLI commands from documentation
- Fix 1Password implementation details (uses SDK not CLI)
- Point to cmd/thv-operator/ README instead of duplicating info
- Add note that thv-registry-api is moving out of tree

These changes make the documentation more maintainable by reducing
references to implementation details that change frequently and
ensuring all commands and APIs referenced actually exist.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Reduces duplication and improves maintainability of architecture documentation:

- Remove duplicated Core Concepts section from overview, replace with brief summary
- Update stdio flow diagram to show independent stdin/stdout streams more clearly
- Add context for when to use exported configs (sharing, migration, version control)
- Remove Project Structure section to reduce maintenance burden
- Simplify Registry API Server section with note about out-of-tree migration
- Fix persistent volume statement in Kubernetes scaling section

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Change "metrics" to "telemetry" for proxy endpoints clarity
- Clarify stdio session limitations (single connection to container)
- Explain why tool filter vs tool call filter (context optimization)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Use backticks for proper code formatting in attach process documentation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Remove non-existent commands and fix interactive command documentation:
- Remove 'thv group move' (doesn't exist)
- Fix 'thv client setup' description (is interactive, doesn't take client name)
- Update group operations list to match actual CLI

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Soften HA scaling claim (not currently tested)
- Add stdio transport limitation for proxy scaling
- Clarify MCP server scaling applies to SSE/Streamable HTTP transports

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add note that SSE transport is deprecated in the MCP specification,
though ToolHive continues to support it with potential future transition.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Replace full struct definition with link to pkg/runner/config.go
and categorized field summary to reduce maintenance burden.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Verified against source code and corrected:
- Export command syntax (requires 2 args: workload and path, no stdout)
- Cedar policy format (Client:: not User::, Action::call_tool not "tools/call")
- Group operations (thv list --group, not thv group list <name>)
- File locations (data files in ~/.local/share, state in ~/.local/state)
- Complete socket paths including macOS locations (Podman Machine, Docker Desktop, Rancher)

All changes verified against pkg/authz/cedar.go, cmd/thv/app/export.go,
pkg/container/docker/sdk/client_unix.go, and pkg state/workloads code.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Verified against actual code:
- Scalar UI path is /api/doc not /scalar (pkg/api/docs.go:13, server.go:234)
- Fixed audit event types based on pkg/audit/mcp_events.go (15 total types)
- Corrected mcp_list_operation to actual types: mcp_tools_list, mcp_resources_list, mcp_prompts_list

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Project structure section was removed from overview, update index to match.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Clarify tool-filter and tool-call-filter middleware descriptions
- Separate tool filtering from tool overriding in documentation
- Rename "Filter" section to "Filter and Override" to reflect both operations
- Change "metrics" to "telemetry" for consistency with middleware naming
- Explain that both middlewares work together with shared configuration

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Replace CRD examples with references to examples/operator/mcp-servers/ directory
- Fix export command syntax (thv export requires output path)
- Fix group commands documentation (thv list --group instead of thv group list)
- Refocus groups documentation on architecture rather than CLI usage
- Remove excessive CLI usage examples to reduce maintenance burden

All changes verified against actual codebase implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Clarify token storage security in remote authentication (AES-256-GCM encryption)
- Add Kubernetes Mode section to secrets documentation explaining native K8s Secret usage
- Note that Kubernetes uses SecretKeyRef, not the provider system used in CLI mode

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Add a new section to CLAUDE.md instructing agents to update
architecture documentation when making code changes. Includes
a mapping table of code areas to documentation files and
guidelines for keeping docs in sync with implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fix all 12 unresolved review comments by improving architectural focus:

- Remove CLI command examples, focus on architectural concepts
- Update file path references to actual implementation files
- Fix middleware type name from 'authz' to 'authorization'
- Organize RunConfig fields by architectural categories
- Simplify audit events to categories instead of exhaustive list
- Simplify request flow diagram and reference middleware.md
- Correct file paths for registry, session, client, MCP, audit, monitor, healthcheck

These changes align the documentation with architectural best practices:
focusing on concepts, patterns, and system design rather than CLI
usage or exhaustive implementation details.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Address PR feedback by removing CLI examples and correcting technical details:

- Remove all CLI command examples (architecture docs should focus on design, not usage)
- Fix container monitor path: pkg/container/docker/monitor.go (not pkg/container/monitor.go)
- Correct OAuth token storage: tokens managed in-memory by TokenSource, not persisted
- Clarify MCP_HOST: defaults to 127.0.0.1 locally, 0.0.0.0 in Kubernetes
- Replace CLI examples with architectural descriptions of concepts
- Update port management to describe architecture, not command flags
- Document TokenSource pattern and client credential storage distinction

These changes align documentation with actual implementation and follow
architecture documentation best practices.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Address final round of PR feedback by removing CLI examples and correcting technical details:

- Remove all CLI command examples from architecture docs
- Fix 1Password implementation: SDK not CLI (diagram and text)
- Add missing secret providers: environment and none
- Document Environment provider security: ListSecrets disabled for security
- Correct environment variable merge order with architectural reasoning
- Fix Windows path handling: allowed as host paths only, not container paths
- Replace export/import CLI examples with architectural descriptions
- Update permission auditing, network isolation, secrets management sections
- Remove CLI flags from custom profiles section

All changes verified by toolhive-expert agent. Documentation now focuses on
architectural concepts and design patterns rather than CLI usage.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Fix architecture documentation inaccuracies identified in code review:

Registry System (06):
- Update file references to actual provider implementation files
- Remove reference to non-existent README
- Fix annotation keys to use correct toolhive.stacklok.dev domain
- Correct MCPRegistry phases (remove Degraded, add Terminating)
- Fix YAML examples (apiVersion, Git repository field, sync policy)
- Remove incomplete OAuth example
- Update CLI flags to match actual implementation
- Remove reference to non-existent converter command
- Simplify architecture diagram to reflect actual implementation

Groups (07):
- Clarify group move functionality is internal only
- Add note about empty default registry groups
- Remove stale PR reference, use generic description

Workloads Lifecycle (08):
- Remove all line number references per documentation guidelines
- Fix storage paths to match XDG directory structure
- Correct label format to simple prefix style

Operator Architecture (09):
- Fix MCPExternalAuthConfig filename reference
- Add missing controller reference
- Remove incorrect StatusCollector example code
- Fix sync trigger annotation key

All changes verified against actual code implementation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Signed-off-by: Juan Antonio Osorio <[email protected]>
Remove CLI-focused content and maintain architecture focus:
- Fix state transition: container exit goes to stopped (was already correct in diagram)
- Remove non-existent update command section
- Remove CLI examples from List section, describe architecture instead
- Rename 'Async Operations' to 'Batch Operations' for clarity
- Remove CLI flags from filtering, describe capability architecturally
- Expand label descriptions with purpose/meaning

Architecture docs should describe system design, not CLI usage.
Verified against pkg/workloads/manager.go and pkg/container/runtime/types.go

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants