Skip to content

Conversation

JAORMX
Copy link
Collaborator

@JAORMX JAORMX commented Sep 25, 2025

Summary

This PR addresses issue #2013 by significantly reducing the MCPServer CRD size from ~9500 lines to 651 lines (93% reduction).

Problem

The MCPServer CRD was too large (~9500 lines) to apply without server-side apply due to the embedded PodTemplateSpec taking up ~8500 lines. This was causing deployment issues as reported in #2013.

Solution

Changed the PodTemplateSpec field from a strongly-typed `corev1.PodTemplateSpec` to `runtime.RawExtension`, which stores the raw JSON without schema validation at the CRD level.

Key Benefits

  • 93% CRD size reduction (from ~9500 to 651 lines)
  • Full backwards compatibility - users can still use the same YAML structure
  • Runtime validation - validation now happens at runtime in the operator
  • Proper error handling - via Kubernetes events and status conditions

Changes Made

  • Modified MCPServer type to use `runtime.RawExtension` for PodTemplateSpec
  • Updated PodTemplateSpecBuilder to unmarshal and validate at runtime
  • Added event recording and status conditions for validation errors
  • Added comprehensive tests for invalid PodTemplateSpec scenarios
  • Fixed race conditions in parallel tests

Testing

Added comprehensive test coverage including:

  • Invalid PodTemplateSpec validation tests
  • Integration tests for error scenarios
  • Race condition fixes in parallel test execution

Breaking Changes

None - this change maintains full backwards compatibility.

Fixes #2013

@codecov
Copy link

codecov bot commented Sep 25, 2025

Codecov Report

❌ Patch coverage is 79.64072% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.57%. Comparing base (ac8f112) to head (34b57e1).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...d/thv-operator/controllers/mcpserver_controller.go 74.61% 25 Missing and 8 partials ⚠️
cmd/thv-operator/main.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2015      +/-   ##
==========================================
+ Coverage   53.43%   53.57%   +0.13%     
==========================================
  Files         238      238              
  Lines       30572    30658      +86     
==========================================
+ Hits        16336    16424      +88     
+ Misses      13068    13064       -4     
- Partials     1168     1170       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JAORMX JAORMX force-pushed the fix/mcpserver-crd-size-reduction branch from bf4a82f to c22bf13 Compare September 29, 2025 13:41
jhrozek
jhrozek previously approved these changes Sep 29, 2025
@jhrozek
Copy link
Contributor

jhrozek commented Sep 29, 2025

aah, I approved based on reading the code :-)

But in general the approach of not reconciling again if the podTemplateSpec is bad seems good to me.

@JAORMX JAORMX force-pushed the fix/mcpserver-crd-size-reduction branch 4 times, most recently from 4d28669 to cd7f211 Compare September 30, 2025 11:43
@JAORMX JAORMX requested a review from jhrozek September 30, 2025 11:46
@JAORMX JAORMX force-pushed the fix/mcpserver-crd-size-reduction branch 3 times, most recently from 0401195 to ac354b1 Compare October 1, 2025 09:31
@JAORMX JAORMX force-pushed the fix/mcpserver-crd-size-reduction branch 3 times, most recently from 160cd36 to 58d7698 Compare October 14, 2025 13:45
@JAORMX JAORMX force-pushed the fix/mcpserver-crd-size-reduction branch 2 times, most recently from 0831563 to 8a1d9d2 Compare October 16, 2025 08:21
…teSpec

The MCPServer CRD was too large (~9500 lines) to apply without server-side
apply due to the embedded PodTemplateSpec taking up ~8500 lines. This was
causing issues as reported in GitHub issue #2013.

Changed the PodTemplateSpec field from a strongly-typed corev1.PodTemplateSpec
to runtime.RawExtension, which stores the raw JSON without schema validation
at the CRD level. This reduces the CRD size from ~9500 lines to 651 lines
(93% reduction).

The solution maintains full backwards compatibility - users can still use
the same YAML structure. Validation now happens at runtime in the operator,
with proper error handling via Kubernetes events and status conditions to
notify users when invalid PodTemplateSpec data is provided.

Key changes:
- Modified MCPServer type to use runtime.RawExtension for PodTemplateSpec
- Updated PodTemplateSpecBuilder to unmarshal and validate at runtime
- Added event recording and status conditions for validation errors
- Added comprehensive tests for invalid PodTemplateSpec scenarios
- Fixed race conditions in parallel tests

Fixes #2013
@JAORMX JAORMX force-pushed the fix/mcpserver-crd-size-reduction branch from 53cc1aa to 33b8094 Compare October 22, 2025 12:24
The CRD has been modified (MCPServer.spec.podTemplateSpec changed to
runtime.RawExtension), so the chart version needs to be bumped per
CI requirements.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@JAORMX
Copy link
Collaborator Author

JAORMX commented Oct 22, 2025

@claude please do a thorough review of this PR.

@claude
Copy link
Contributor

claude bot commented Oct 22, 2025

Claude finished @JAORMX's task —— View job


Thorough PR Review

I've completed a comprehensive review of this PR. Here are my findings:

Review Summary

This PR successfully reduces the MCPServer CRD size from ~9,500 lines to 678 lines (93% reduction) by changing the PodTemplateSpec field from a strongly-typed corev1.PodTemplateSpec to runtime.RawExtension. The approach is sound and includes proper validation, error handling, and test coverage.


Major Strengths ✅

1. Excellent Problem-Solution Fit

  • The CRD size reduction is dramatic and addresses the immediate issue (The MCPServers CRD is too big #2013)
  • Using runtime.RawExtension is the correct approach for this problem
  • The +kubebuilder:pruning:PreserveUnknownFields marker is appropriately used

2. Robust Error Handling

The validation logic in cmd/thv-operator/controllers/mcpserver_controller.go:474-529 is well-designed:

  • Early validation prevents invalid specs from propagating
  • Proper status conditions (PodTemplateValid) with clear reasons
  • Event recording for user visibility
  • Phase set to Failed to clearly indicate the problem
  • Returns without error to avoid infinite retry loops

3. Strong Test Coverage

Three levels of testing:

  • Unit tests (mcpserver_podtemplatespec_builder_test.go) - comprehensive coverage of builder logic
  • Controller unit tests (mcpserver_invalid_podtemplate_reconcile_test.go) - validates reconciliation behavior
  • Integration tests (mcpserver_controller_integration_test.go) - end-to-end validation

4. Backward Compatibility Maintained

Users can continue using the same YAML structure - the change is transparent to them.


Issues and Concerns 🔴

1. Security Concern: Validation Gap ⚠️

Location: cmd/thv-operator/controllers/mcpserver_controller.go:880-906

The code handles invalid PodTemplateSpec gracefully by logging and continuing:

builder, err := NewMCPServerPodTemplateSpecBuilder(m.Spec.PodTemplateSpec)
if err != nil {
    ctxLogger := log.FromContext(ctx)
    ctxLogger.Error(err, "Invalid PodTemplateSpec in MCPServer spec, continuing without customizations")
    // Continue without pod template patch - the deployment will still work
}

Problem: This creates a critical security issue. If a user specifies a PodTemplateSpec with security hardening (e.g., runAsNonRoot, readOnlyRootFilesystem, restricted capabilities), but makes a typo in the JSON, the operator silently ignores it and deploys the pod without those security constraints.

Example Attack Scenario:

# User intends to enforce security
podTemplateSpec:
  spec:
    containers:
    - name: mcp
      securityContext:
        runAsNonRoot: true
        readOnlyRootFilesystem: true
    # Accidentally adds invalid field
    invalidField: "typo"

The deployment would proceed without runAsNonRoot or readOnlyRootFilesystem, potentially allowing privilege escalation.

Impact: High - users may believe they have security protections that aren't actually applied.

Recommendation:

  1. The validation in the reconcile loop (line 184) correctly blocks deployment
  2. But the deploymentForMCPServer function should fail hard if validation fails, not silently continue
  3. Consider adding a check at the beginning of deploymentForMCPServer:
// Validate PodTemplateSpec early - fail if invalid
if m.Spec.PodTemplateSpec != nil {
    _, err := NewMCPServerPodTemplateSpecBuilder(m.Spec.PodTemplateSpec)
    if err != nil {
        // This should never happen if reconcile validation works correctly
        ctxLogger.Error(err, "CRITICAL: Invalid PodTemplateSpec passed validation")
        return nil // or panic in development
    }
}

This was partially addressed by the early validation in reconcile (line 184), but there's still a defensive programming issue in deploymentForMCPServer.

2. Documentation Gap

Location: cmd/thv-operator/api/v1alpha1/mcpserver_types.go:98-106

The comment says "This field accepts a PodTemplateSpec object as JSON/YAML" but doesn't explain:

  • What happens if the spec is invalid (deployment blocked, condition set)
  • The validation is runtime, not at CRD admission time
  • How users can check if their spec is valid

Recommendation: Enhance the godoc comment:

// PodTemplateSpec defines the pod template to use for the MCP server
// This allows for customizing the pod configuration beyond what is provided by the other fields.
// Note that to modify the specific container the MCP server runs in, you must specify
// the `mcp` container name in the PodTemplateSpec.
// This field accepts a PodTemplateSpec object as JSON/YAML.
//
// IMPORTANT: Validation occurs at runtime during reconciliation. If the PodTemplateSpec
// is invalid, the MCPServer will transition to Failed phase and deployment will be blocked
// until the spec is corrected. Check the PodTemplateValid condition in status for validation results.
//
// +optional
// +kubebuilder:pruning:PreserveUnknownFields
// +kubebuilder:validation:Type=object
PodTemplateSpec *runtime.RawExtension `json:"podTemplateSpec,omitempty"`

3. Missing Condition in API Types

Location: cmd/thv-operator/api/v1alpha1/mcpserver_types.go:8-37

The code adds a new condition type PodTemplateValid, but it's not documented in the constants section alongside ConditionImageValidated and ConditionGroupRefValidated.

Recommendation: Add to the constants:

const (
    // ConditionImageValidated indicates whether this image is fine to be used
    ConditionImageValidated = "ImageValidated"

    // ConditionGroupRefValidated indicates whether the GroupRef is valid
    ConditionGroupRefValidated = "GroupRefValidated"

    // ConditionPodTemplateValid indicates whether the PodTemplateSpec is valid
    ConditionPodTemplateValid = "PodTemplateValid"
)

And add reason constants:

const (
    // ConditionReasonPodTemplateValid indicates PodTemplateSpec validation succeeded
    ConditionReasonPodTemplateValid = "ValidPodTemplateSpec"
    
    // ConditionReasonPodTemplateInvalid indicates PodTemplateSpec validation failed
    ConditionReasonPodTemplateInvalid = "InvalidPodTemplateSpec"
)

Then use these constants instead of string literals in the controller code.

4. Test Coverage Gap (Codecov)

The codecov report shows 60.43% patch coverage with 55 lines missing coverage, primarily in:

  • cmd/thv-operator/controllers/mcpserver_controller.go (49 lines missing)

However, reviewing the test files shows good coverage of the validation logic. The missing coverage appears to be:

  • Error paths in deploymentForMCPServer when marshaling fails (line 898-902)
  • Some status update error handling paths

Recommendation: This is acceptable for this PR - the critical paths are covered. Consider adding tests for the marshaling error path in a follow-up.


Minor Issues 🟡

5. Inconsistent Error Message Formatting

Location: cmd/thv-operator/controllers/mcpserver_controller.go:488

The error message in the event uses %v which will include the full error chain:

"Failed to parse PodTemplateSpec: %v. Deployment blocked until PodTemplateSpec is fixed."

For json.Unmarshal errors, this produces user-unfriendly messages like:

Failed to parse PodTemplateSpec: failed to unmarshal PodTemplateSpec: json: cannot unmarshal string into Go struct field PodSpec.spec.containers of type []v1.Container. Deployment blocked until PodTemplateSpec is fixed.

Recommendation: Consider wrapping with a cleaner message or using %s for just the error message:

"Invalid PodTemplateSpec syntax. Please verify the JSON/YAML structure matches Kubernetes PodTemplateSpec format. Error: %v"

6. Magic String "PodTemplateValid"

Location: Multiple files

The condition type "PodTemplateValid" is used as a string literal in tests and controller code. This should be a constant (see issue #3).

7. Comment Inconsistency

Location: cmd/thv-operator/controllers/mcpserver_controller.go:884

The comment says "continuing without customizations" but this could be misleading - the deployment might not work correctly if the user was relying on those customizations.


Architecture and Design 🏗️

Positive Aspects:

  1. Follows Operator Design Guidelines

    • Aligns with cmd/thv-operator/DESIGN.md principles
    • PodTemplateSpec is correctly used for infrastructure concerns
    • Business logic (image, transport, secrets) remains as dedicated CRD attributes
  2. Clean Builder Pattern

    • MCPServerPodTemplateSpecBuilder is well-designed
    • Proper separation of concerns
    • Validates early in constructor
  3. Proper Status Management

    • Uses standard Kubernetes condition types
    • Sets appropriate phase transitions
    • Records events for observability

Concerns:

  1. Runtime vs Admission Time Validation

    • Moving validation from CRD admission (via OpenAPI schema) to runtime validation means users get slower feedback
    • However, this is an acceptable trade-off for the 93% size reduction
    • The condition and event system provides clear feedback
  2. Loss of Type Safety

    • runtime.RawExtension removes compile-time type checking
    • This is mitigated by runtime validation
    • Tests provide good coverage

Code Quality 📝

Strengths:

  • Well-organized test files with clear test names
  • Good use of table-driven tests
  • Proper use of t.Parallel() for test parallelization
  • Helper functions (podTemplateSpecToRawExtension, findMCPContainer) improve readability

Suggestions:

  1. Consider extracting the validation logic into a dedicated validator type for easier testing
  2. The deploymentForMCPServer function is already marked //nolint:gocyclo - this PR adds complexity. Consider refactoring in a follow-up.

Review of Previous Comments

Review Comment from ChrisJBurns (2025-10-15):

"Could we do a Warn here maybe if we're worried about infinite retries from error?"

Status: ✅ Addressed - The author correctly responded that Warn is not available with this logger and removed unnecessary log messages. The event recorder is sufficient.

Review Comment from ChrisJBurns (2025-10-15):

"We can probably put these into the cmd/thv-operator/controllers/mcpserver_podtemplatespec_builder_test.go file?"

Status: ✅ Addressed - Tests were moved to integration tests instead (even better).

Review Comment from ChrisJBurns (2025-10-15):

"I think we can probably avoid doing the heavyweight e2e tests and instead do them in integration tests."

Status: ✅ Addressed - E2E chainsaw tests were removed in favor of integration tests in mcpserver_controller_integration_test.go.


Recommendations Summary

Must Fix (Security/Correctness):

  1. 🔴 Critical: Review the error handling in deploymentForMCPServer (line 880-906) to ensure invalid specs don't silently deploy without security constraints
  2. 🟡 Add condition type constants to API types for consistency

Should Fix (Best Practices):

  1. Enhance documentation in the PodTemplateSpec field comment
  2. Improve error message formatting for better UX
  3. Replace magic string literals with constants

Nice to Have:

  1. Add additional test coverage for error paths
  2. Consider refactoring deploymentForMCPServer to reduce complexity

Final Verdict ✅

This is a well-executed PR that solves a real problem with a sound technical approach. The validation logic is robust, test coverage is good, and the design aligns with operator best practices.

Primary concern: The security implication of silently continuing when PodTemplateSpec validation fails in deploymentForMCPServer. This is partially mitigated by the early validation in the reconcile loop, but it's a defensive programming issue that should be addressed.

Recommendation: Approve after addressing the critical security concern (issue #1) and adding the condition constants (issue #3). The other issues can be addressed in follow-up PRs.


Code References for Review

  • API Changes: cmd/thv-operator/api/v1alpha1/mcpserver_types.go:106
  • Validation Logic: cmd/thv-operator/controllers/mcpserver_controller.go:474-529
  • Builder Implementation: cmd/thv-operator/controllers/mcpserver_podtemplatespec_builder.go:20-40
  • Reconcile Integration: cmd/thv-operator/controllers/mcpserver_controller.go:184-188
  • Deployment Creation: cmd/thv-operator/controllers/mcpserver_controller.go:880-906
  • CRD Definition: deploy/charts/operator-crds/crds/toolhive.stacklok.dev_mcpservers.yaml

Job: https://github.com/stacklok/toolhive/actions/runs/UNKNOWN
Branch: fix/mcpserver-crd-size-reduction

@JAORMX JAORMX requested review from ChrisJBurns and Copilot October 22, 2025 12:30
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reduces the MCPServer CRD size by 93% (from ~9,500 to 651 lines) by changing the PodTemplateSpec field from a strongly-typed corev1.PodTemplateSpec to runtime.RawExtension. This resolves deployment issues while maintaining full backwards compatibility through runtime validation.

Key Changes:

  • Modified PodTemplateSpec field type to runtime.RawExtension for schema flexibility
  • Added runtime validation with proper error handling via events and status conditions
  • Updated builder pattern to handle unmarshalling and validation of raw JSON/YAML

Reviewed Changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
cmd/thv-operator/api/v1alpha1/mcpserver_types.go Changed PodTemplateSpec field from corev1.PodTemplateSpec to runtime.RawExtension
cmd/thv-operator/api/v1alpha1/zz_generated.deepcopy.go Updated generated deep copy code for the new type
cmd/thv-operator/controllers/mcpserver_controller.go Added validation logic, event recorder, and error handling for invalid PodTemplateSpec
cmd/thv-operator/controllers/mcpserver_podtemplatespec_builder.go Updated builder to unmarshal RawExtension and return errors on invalid specs
cmd/thv-operator/controllers/mcpserver_podtemplatespec_builder_test.go Added comprehensive tests for invalid spec scenarios
cmd/thv-operator/controllers/mcpserver_invalid_podtemplate_reconcile_test.go New test file for reconciler behavior with invalid specs
cmd/thv-operator/controllers/mcpserver_runconfig.go Updated Vault detection to unmarshal RawExtension
cmd/thv-operator/controllers/mcpserver_pod_template_test.go Refactored to use helper function for RawExtension conversion
cmd/thv-operator/controllers/helpers_test.go New helper function for test PodTemplateSpec conversion
cmd/thv-operator/main.go Added event recorder initialization
test/e2e/chainsaw/operator/single-tenancy/test-scenarios/valid-podtemplatespec/* New e2e test suite for valid PodTemplateSpec scenarios
docs/operator/crd-api.md Updated documentation to reflect new field type
deploy/charts/operator-crds/Chart.yaml Bumped chart version to 0.0.40

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Fixes based on Copilot and reviewer feedback:
- Fix DeepCopy call to use pointer: (&userTemplate).DeepCopy()
- Remove duplicate test assertions in pod template tests
- Remove redundant PodTemplateSpec check in deploymentNeedsUpdate
- Remove ~200 lines of dead code (non-ConfigMap mode paths)
- Remove 5 unused functions (getProxyHost, generateOIDCArgs, etc.)
- Improve isEmpty() to check all meaningful PodTemplateSpec fields

All tests passing and linter clean.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@JAORMX JAORMX requested a review from Copilot October 22, 2025 13:00
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 20 out of 21 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

JAORMX and others added 3 commits October 22, 2025 16:46
Define constants for PodTemplateSpec validation condition types and
reasons in the API types file, following the pattern established for
ImageValidated and GroupRefValidated conditions.

Changes:
- Add ConditionPodTemplateValid constant
- Add ConditionReasonPodTemplateValid and ConditionReasonPodTemplateInvalid
- Update controller code to use constants instead of string literals
- Update tests to use constants

This improves maintainability and follows Kubernetes API conventions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
The Watches() handler for MCPExternalAuthConfig was accidentally removed
during code cleanup. This handler is critical for triggering MCPServer
reconciliation when the referenced external auth config changes.

Without this handler, MCPServers would not be updated when their
ExternalAuthConfigRef changes, breaking the token exchange integration.

Changes:
- Restore externalAuthConfigHandler in SetupWithManager
- Add Watches() call for MCPExternalAuthConfig
- Add missing imports: handler and reconcile packages

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Update the comment in deploymentForMCPServer to clarify that the
PodTemplateSpec validation error handling is unreachable in normal flow
because early validation in Reconcile() blocks invalid specs.

The error handling at line 881-885 is defense-in-depth: it provides a
safety net if validation is somehow bypassed or refactored in the future.

This addresses confusion about whether invalid PodTemplateSpec can
bypass security constraints (it cannot - early validation blocks it).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@JAORMX
Copy link
Collaborator Author

JAORMX commented Oct 22, 2025

@claude can you remove the e2e tests from this PR? We don't need those right now for this feature

@claude
Copy link
Contributor

claude bot commented Oct 22, 2025

Claude finished @JAORMX's task —— View job


Removing E2E Tests

I've successfully removed the e2e tests from this PR as requested.

Todo List:

  • Identify all e2e test files to remove
  • Remove e2e test files using git rm
  • Commit changes with proper attribution
  • Push changes to PR branch

Summary:

Removed the entire test/e2e/chainsaw/operator/single-tenancy/test-scenarios/valid-podtemplatespec/ directory containing 5 e2e test files:

  • assert-deployment-exists.yaml
  • assert-mcpserver-running.yaml
  • assert-mcpserver-valid-condition.yaml
  • chainsaw-test.yaml
  • mcpserver-valid.yaml

The integration tests in cmd/thv-operator/test-integration/mcp-server/mcpserver_controller_integration_test.go already provide sufficient coverage for PodTemplateSpec validation, making these e2e tests redundant.

Changes have been committed and pushed to the fix/mcpserver-crd-size-reduction branch.


Branch: fix/mcpserver-crd-size-reduction

Integration tests provide sufficient coverage for this feature,
making the e2e tests redundant.

Co-authored-by: Juan Antonio Osorio <[email protected]>
ChrisJBurns
ChrisJBurns previously approved these changes Oct 22, 2025
@ChrisJBurns
Copy link
Collaborator

@jhrozek one more green tick and I think we should be good to go

@ChrisJBurns ChrisJBurns enabled auto-merge (squash) October 22, 2025 20:19
@ChrisJBurns ChrisJBurns disabled auto-merge October 22, 2025 20:19
@ChrisJBurns ChrisJBurns enabled auto-merge (squash) October 22, 2025 20:29
@ChrisJBurns ChrisJBurns merged commit f145cfc into main Oct 22, 2025
30 checks passed
@ChrisJBurns ChrisJBurns deleted the fix/mcpserver-crd-size-reduction branch October 22, 2025 22:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The MCPServers CRD is too big

3 participants