MCPGroup support in k8s #2247

tgrunnagle · 2025-10-17T15:59:44Z

Implement MCPGroup types and k8s controller per this proposal, tracked by #2234.

The MCP group controller reconciles MCPGroup resources and watches for changes to MCPServer resources. When the latter happens, it requests reconciliation of the server's referenced MCPGroup.

The MCP server controller checks the groupRef during reconciliation and sets status conditions accordingly (no fatal error if the groupRef is invalid).

Generated code and manifests.

toolhive % task operator-generate
toolhive % task task operator-manifests

Added unit, integration, and e2e (chainsaw) tests.

toolhive % task operator-test
toolhive % task operator-test-integration
toolhive % task operator-e2e-test

Many added lines are generated (CRD) or test cases.

Implements MCPGroup types and k8s controller per [this proposal](https://github.com/stacklok/toolhive/blob/73707af47db039f51b2bda07d1ef8eccb25ba819/docs/proposals/kubernetes-mcpgroup-crd.md). The MCP group controller reconciles MCPGroup resources and watches for changes to MCPServer resources. When the latter happens, it requests reconciliation of the server's referenced MCPGroup. The MCP server controller checks the groupRef during reconciliation and sets status conditions accordingly (no fatal error if the groupRef is invalid). Generated code and manifests. ```console toolhive % task operator-generate toolhive % task task operator-manifests ``` Added unit, integration, and e2e (chainsaw) tests. ```console toolhive % task operator-test toolhive % task operator-test-integration toolhive % task operator-e2e-test ```

JAORMX · 2025-10-20T09:40:53Z

@claude please review this and take into account kubernetes best practices while you're at it.

claude · 2025-10-20T09:41:12Z

Claude finished @JAORMX's task —— View job

Code Review Complete ✅

I've completed my review of the MCPGroup implementation with a focus on Kubernetes best practices. Here's my comprehensive analysis:

Summary

This PR implements MCPGroup CRD support in the Kubernetes operator following the approved proposal. The implementation is solid overall with good adherence to Kubernetes best practices. I've identified several improvements related to best practices, edge cases, and API design.

✅ Strengths

1. Excellent Design Alignment

The implementation faithfully follows the proposal document
Clear separation of concerns between MCPGroup and MCPServer controllers
Proper use of status subresources for both CRDs

2. Strong Test Coverage

Unit tests for controller logic (mcpgroup_controller_test.go, mcpserver_groupref_test.go)
Integration tests using envtest (mcpgroup_controller_integration_test.go)
End-to-end tests using Chainsaw
Tests cover happy paths, edge cases, and cross-namespace scenarios

3. Good Controller Patterns

Proper use of controller-runtime watches for cross-resource reconciliation
MCPGroup controller correctly watches MCPServer changes via findMCPGroupForMCPServer
Appropriate use of status conditions

🔍 Issues & Improvements

Critical: Missing observedGeneration in Status Updates

Issue: Both controllers update status conditions without setting observedGeneration, which is a Kubernetes best practice for tracking whether conditions reflect the latest spec.

Location:

cmd/thv-operator/controllers/mcpgroup_controller.go:52-72
cmd/thv-operator/controllers/mcpserver_controller.go:410-429

Fix:

// In mcpgroup_controller.go
meta.SetStatusCondition(&mcpGroup.Status.Conditions, metav1.Condition{
    Type:               mcpv1alpha1.ConditionTypeMCPServersChecked,
    Status:             metav1.ConditionFalse,
    Reason:             mcpv1alpha1.ConditionReasonListMCPServersFailed,
    Message:            "Failed to list MCPServers in namespace",
    ObservedGeneration: mcpGroup.Generation, // ADD THIS
})

// Similar fixes needed in mcpserver_controller.go validateGroupRef()

Why: observedGeneration allows clients to determine if a condition reflects the current spec or an outdated one. This is critical for automation and proper status interpretation.

High Priority: Lack of Finalizers for Cascade Handling

Issue: The proposal mentions that when an MCPGroup is deleted, MCPServers should continue running but will have dangling references. However, there's no finalizer to clean up the group gracefully or update server conditions before deletion.

Location: cmd/thv-operator/controllers/mcpgroup_controller.go

Recommendation:

Add a finalizer to MCPGroup to ensure clean deletion
Before removing the finalizer, optionally add a condition to all referencing MCPServers indicating their group is being deleted
Or at minimum, document the current behavior clearly in the CRD status

Current behavior risk: MCPServers will have GroupRefValidated=False conditions appear suddenly when the group is deleted, which could be confusing for users.

Medium: Status Update Without Conflict Handling

Issue: Status updates in the MCPGroup controller don't handle conflicts properly. If two reconciliations happen simultaneously, one could overwrite the other.

Location: cmd/thv-operator/controllers/mcpgroup_controller.go:98-101

// Current code
if err := r.Status().Update(ctx, mcpGroup); err != nil {
    ctxLogger.Error(err, "Failed to update MCPGroup status")
    return ctrl.Result{}, err
}

Better approach:

if err := r.Status().Update(ctx, mcpGroup); err != nil {
    if errors.IsConflict(err) {
        // Requeue to retry with fresh data
        return ctrl.Result{Requeue: true}, nil
    }
    ctxLogger.Error(err, "Failed to update MCPGroup status")
    return ctrl.Result{}, err
}

Medium: Non-Deterministic Server List Order

Issue: The status.servers list order is not deterministic, which could cause unnecessary status updates and confuse users watching changes.

Location: cmd/thv-operator/controllers/mcpgroup_controller.go:88-91

Fix:

mcpGroup.Status.Servers = make([]string, len(filteredServers))
for i, server := range filteredServers {
    mcpGroup.Status.Servers[i] = server.Name
}
// Add this:
sort.Strings(mcpGroup.Status.Servers)

Medium: MCPServer Controller Doesn't Update Status After GroupRef Validation

Issue: The validateGroupRef function updates conditions in memory but doesn't persist them to the API server.

Location: cmd/thv-operator/controllers/mcpserver_controller.go:398-431

Impact: The GroupRef validation conditions are lost unless another reconciliation updates the status.

Fix: Either:

Call r.Status().Update(ctx, mcpServer) after validation, or
Document that validation is informational and status update happens in the main reconciliation loop

Looking at line 187 in the Reconcile function, it appears validation is called but status is only updated much later. This could lead to race conditions.

Low: Missing Printer Columns in MCPGroup CRD

Issue: MCPGroup CRD has printer columns defined in the Go types but they could be improved for better UX.

Location: cmd/thv-operator/api/v1alpha1/mcpgroup_types.go:60-64

Current:

//+kubebuilder:printcolumn:name="Servers",type="integer",JSONPath=".status.serverCount"
//+kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

Suggested addition:

//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='MCPServersChecked')].status"

This would give users quick visibility into whether the group is healthy.

Low: Inconsistent Constant Naming

Issue: The condition type constant uses "Validated" while the reason uses "Succeeded".

Location: cmd/thv-operator/api/v1alpha1/mcpgroup_types.go:50-58

const (
    ConditionTypeMCPServersChecked = "MCPServersChecked"
)

const (
    ConditionReasonListMCPServersFailed    = "ListMCPServersFailed"
    ConditionReasonListMCPServersSucceeded = "ListMCPServersSucceeded"  // inconsistent with type name
)

Minor improvement: Consider MCPServersCheckSucceeded to match the type name pattern.

Low: CRD Missing Short Names

Issue: Neither CRD defines short names for kubectl convenience.

Location: CRD manifests

Suggestion:

# In mcpgroups CRD
names:
  kind: MCPGroup
  plural: mcpgroups
  singular: mcpgroup
  shortNames:  # ADD THIS
    - mcpg
    - mcpgroup

Users could then use kubectl get mcpg instead of kubectl get mcpgroups.

📝 Documentation & API Design Notes

1. GroupRef is Optional but Has Conditions

The GroupRef field is optional, but when omitted, no condition is set. This is correct, but consider documenting this behavior more explicitly in the API comments.

2. Cross-Namespace References Intentionally Blocked

The design correctly restricts GroupRef to same-namespace only. This follows Kubernetes security best practices. ✅

3. No Field Selector Support Yet

The proposal mentions field selectors for querying servers by group (spec.groupRef=foo), but I don't see the field selector registration in the controller setup. This might be a future enhancement, but worth noting.

🎯 Kubernetes Best Practices Assessment

Practice	Status	Notes
Status subresources	✅	Properly defined
Conditions with standard fields	⚠️	Missing `observedGeneration`
RBAC rules	✅	Complete and minimal
Watch patterns	✅	Correct cross-controller watches
Namespace scoping	✅	Properly scoped
Finalizers	❌	Missing for cleanup
Printer columns	⚠️	Present but could be improved
API versioning	✅	v1alpha1 appropriate
Generated code	✅	Properly generated
Test coverage	✅	Excellent coverage

🚀 Recommendations

Must Fix Before Merge:

Add ObservedGeneration to all status condition updates
Ensure GroupRef validation status is persisted

Should Fix:

Add finalizer for graceful MCPGroup deletion
Handle status update conflicts
Sort server names in status for determinism

Nice to Have:

Add CRD short names
Consider additional printer columns
Add field selector support per the proposal

✨ Conclusion

This is a well-implemented feature that follows the proposal closely and includes comprehensive testing. The main concerns are around status handling best practices (observedGeneration, conflict handling) and missing finalizers for graceful deletion. These are important for production-readiness but don't block the feature from functioning correctly.

Great work on the test coverage - the integration and e2e tests demonstrate thorough validation of the feature!

Job Run | Branch: k8s-groups_2025-10-15

tgrunnagle requested review from ChrisJBurns and JAORMX October 17, 2025 15:59

tgrunnagle linked an issue Oct 17, 2025 that may be closed by this pull request

Add MCP group support to k8s operator #2234

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MCPGroup support in k8s #2247

MCPGroup support in k8s #2247

tgrunnagle commented Oct 17, 2025 •

edited

Loading

Uh oh!

JAORMX commented Oct 20, 2025

Uh oh!

claude bot commented Oct 20, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MCPGroup support in k8s #2247

Are you sure you want to change the base?

MCPGroup support in k8s #2247

Conversation

tgrunnagle commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JAORMX commented Oct 20, 2025

Uh oh!

claude bot commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Summary

✅ Strengths

1. Excellent Design Alignment

2. Strong Test Coverage

3. Good Controller Patterns

🔍 Issues & Improvements

Critical: Missing observedGeneration in Status Updates

High Priority: Lack of Finalizers for Cascade Handling

Medium: Status Update Without Conflict Handling

Medium: Non-Deterministic Server List Order

Medium: MCPServer Controller Doesn't Update Status After GroupRef Validation

Low: Missing Printer Columns in MCPGroup CRD

Low: Inconsistent Constant Naming

Low: CRD Missing Short Names

📝 Documentation & API Design Notes

1. GroupRef is Optional but Has Conditions

2. Cross-Namespace References Intentionally Blocked

3. No Field Selector Support Yet

🎯 Kubernetes Best Practices Assessment

🚀 Recommendations

Must Fix Before Merge:

Should Fix:

Nice to Have:

✨ Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tgrunnagle commented Oct 17, 2025 •

edited

Loading

claude bot commented Oct 20, 2025 •

edited

Loading