Skip to content

Conversation

tgrunnagle
Copy link

@tgrunnagle tgrunnagle commented Oct 17, 2025

Implement MCPGroup types and k8s controller per this proposal, tracked by #2234.

The MCP group controller reconciles MCPGroup resources and watches for changes to MCPServer resources. When the latter happens, it requests reconciliation of the server's referenced MCPGroup.

The MCP server controller checks the groupRef during reconciliation and sets status conditions accordingly (no fatal error if the groupRef is invalid).

Generated code and manifests.

toolhive % task operator-generate
toolhive % task task operator-manifests

Added unit, integration, and e2e (chainsaw) tests.

toolhive % task operator-test
toolhive % task operator-test-integration
toolhive % task operator-e2e-test

Many added lines are generated (CRD) or test cases.

Implements MCPGroup types and k8s controller per [this proposal](https://github.com/stacklok/toolhive/blob/73707af47db039f51b2bda07d1ef8eccb25ba819/docs/proposals/kubernetes-mcpgroup-crd.md).

The MCP group controller reconciles MCPGroup resources and watches for changes to MCPServer resources. When the latter happens, it requests reconciliation of the server's referenced MCPGroup.

The MCP server controller checks the groupRef during reconciliation and sets status conditions accordingly (no fatal error if the groupRef is invalid).

Generated code and manifests.
```console
toolhive % task operator-generate
toolhive % task task operator-manifests
```

Added unit, integration, and e2e (chainsaw) tests.
```console
toolhive % task operator-test
toolhive % task operator-test-integration
toolhive % task operator-e2e-test
```
@tgrunnagle tgrunnagle linked an issue Oct 17, 2025 that may be closed by this pull request
@JAORMX
Copy link
Collaborator

JAORMX commented Oct 20, 2025

@claude please review this and take into account kubernetes best practices while you're at it.

Copy link
Contributor

claude bot commented Oct 20, 2025

Claude finished @JAORMX's task —— View job


Code Review Complete ✅

I've completed my review of the MCPGroup implementation with a focus on Kubernetes best practices. Here's my comprehensive analysis:


Summary

This PR implements MCPGroup CRD support in the Kubernetes operator following the approved proposal. The implementation is solid overall with good adherence to Kubernetes best practices. I've identified several improvements related to best practices, edge cases, and API design.


✅ Strengths

1. Excellent Design Alignment

  • The implementation faithfully follows the proposal document
  • Clear separation of concerns between MCPGroup and MCPServer controllers
  • Proper use of status subresources for both CRDs

2. Strong Test Coverage

  • Unit tests for controller logic (mcpgroup_controller_test.go, mcpserver_groupref_test.go)
  • Integration tests using envtest (mcpgroup_controller_integration_test.go)
  • End-to-end tests using Chainsaw
  • Tests cover happy paths, edge cases, and cross-namespace scenarios

3. Good Controller Patterns

  • Proper use of controller-runtime watches for cross-resource reconciliation
  • MCPGroup controller correctly watches MCPServer changes via findMCPGroupForMCPServer
  • Appropriate use of status conditions

🔍 Issues & Improvements

Critical: Missing observedGeneration in Status Updates

Issue: Both controllers update status conditions without setting observedGeneration, which is a Kubernetes best practice for tracking whether conditions reflect the latest spec.

Location:

  • cmd/thv-operator/controllers/mcpgroup_controller.go:52-72
  • cmd/thv-operator/controllers/mcpserver_controller.go:410-429

Fix:

// In mcpgroup_controller.go
meta.SetStatusCondition(&mcpGroup.Status.Conditions, metav1.Condition{
    Type:               mcpv1alpha1.ConditionTypeMCPServersChecked,
    Status:             metav1.ConditionFalse,
    Reason:             mcpv1alpha1.ConditionReasonListMCPServersFailed,
    Message:            "Failed to list MCPServers in namespace",
    ObservedGeneration: mcpGroup.Generation, // ADD THIS
})

// Similar fixes needed in mcpserver_controller.go validateGroupRef()

Why: observedGeneration allows clients to determine if a condition reflects the current spec or an outdated one. This is critical for automation and proper status interpretation.


High Priority: Lack of Finalizers for Cascade Handling

Issue: The proposal mentions that when an MCPGroup is deleted, MCPServers should continue running but will have dangling references. However, there's no finalizer to clean up the group gracefully or update server conditions before deletion.

Location: cmd/thv-operator/controllers/mcpgroup_controller.go

Recommendation:

  1. Add a finalizer to MCPGroup to ensure clean deletion
  2. Before removing the finalizer, optionally add a condition to all referencing MCPServers indicating their group is being deleted
  3. Or at minimum, document the current behavior clearly in the CRD status

Current behavior risk: MCPServers will have GroupRefValidated=False conditions appear suddenly when the group is deleted, which could be confusing for users.


Medium: Status Update Without Conflict Handling

Issue: Status updates in the MCPGroup controller don't handle conflicts properly. If two reconciliations happen simultaneously, one could overwrite the other.

Location: cmd/thv-operator/controllers/mcpgroup_controller.go:98-101

// Current code
if err := r.Status().Update(ctx, mcpGroup); err != nil {
    ctxLogger.Error(err, "Failed to update MCPGroup status")
    return ctrl.Result{}, err
}

Better approach:

if err := r.Status().Update(ctx, mcpGroup); err != nil {
    if errors.IsConflict(err) {
        // Requeue to retry with fresh data
        return ctrl.Result{Requeue: true}, nil
    }
    ctxLogger.Error(err, "Failed to update MCPGroup status")
    return ctrl.Result{}, err
}

Medium: Non-Deterministic Server List Order

Issue: The status.servers list order is not deterministic, which could cause unnecessary status updates and confuse users watching changes.

Location: cmd/thv-operator/controllers/mcpgroup_controller.go:88-91

Fix:

mcpGroup.Status.Servers = make([]string, len(filteredServers))
for i, server := range filteredServers {
    mcpGroup.Status.Servers[i] = server.Name
}
// Add this:
sort.Strings(mcpGroup.Status.Servers)

Medium: MCPServer Controller Doesn't Update Status After GroupRef Validation

Issue: The validateGroupRef function updates conditions in memory but doesn't persist them to the API server.

Location: cmd/thv-operator/controllers/mcpserver_controller.go:398-431

Impact: The GroupRef validation conditions are lost unless another reconciliation updates the status.

Fix: Either:

  1. Call r.Status().Update(ctx, mcpServer) after validation, or
  2. Document that validation is informational and status update happens in the main reconciliation loop

Looking at line 187 in the Reconcile function, it appears validation is called but status is only updated much later. This could lead to race conditions.


Low: Missing Printer Columns in MCPGroup CRD

Issue: MCPGroup CRD has printer columns defined in the Go types but they could be improved for better UX.

Location: cmd/thv-operator/api/v1alpha1/mcpgroup_types.go:60-64

Current:

//+kubebuilder:printcolumn:name="Servers",type="integer",JSONPath=".status.serverCount"
//+kubebuilder:printcolumn:name="Phase",type="string",JSONPath=".status.phase"
//+kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

Suggested addition:

//+kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.conditions[?(@.type=='MCPServersChecked')].status"

This would give users quick visibility into whether the group is healthy.


Low: Inconsistent Constant Naming

Issue: The condition type constant uses "Validated" while the reason uses "Succeeded".

Location: cmd/thv-operator/api/v1alpha1/mcpgroup_types.go:50-58

const (
    ConditionTypeMCPServersChecked = "MCPServersChecked"
)

const (
    ConditionReasonListMCPServersFailed    = "ListMCPServersFailed"
    ConditionReasonListMCPServersSucceeded = "ListMCPServersSucceeded"  // inconsistent with type name
)

Minor improvement: Consider MCPServersCheckSucceeded to match the type name pattern.


Low: CRD Missing Short Names

Issue: Neither CRD defines short names for kubectl convenience.

Location: CRD manifests

Suggestion:

# In mcpgroups CRD
names:
  kind: MCPGroup
  plural: mcpgroups
  singular: mcpgroup
  shortNames:  # ADD THIS
    - mcpg
    - mcpgroup

Users could then use kubectl get mcpg instead of kubectl get mcpgroups.


📝 Documentation & API Design Notes

1. GroupRef is Optional but Has Conditions

The GroupRef field is optional, but when omitted, no condition is set. This is correct, but consider documenting this behavior more explicitly in the API comments.

2. Cross-Namespace References Intentionally Blocked

The design correctly restricts GroupRef to same-namespace only. This follows Kubernetes security best practices. ✅

3. No Field Selector Support Yet

The proposal mentions field selectors for querying servers by group (spec.groupRef=foo), but I don't see the field selector registration in the controller setup. This might be a future enhancement, but worth noting.


🎯 Kubernetes Best Practices Assessment

Practice Status Notes
Status subresources Properly defined
Conditions with standard fields ⚠️ Missing observedGeneration
RBAC rules Complete and minimal
Watch patterns Correct cross-controller watches
Namespace scoping Properly scoped
Finalizers Missing for cleanup
Printer columns ⚠️ Present but could be improved
API versioning v1alpha1 appropriate
Generated code Properly generated
Test coverage Excellent coverage

🚀 Recommendations

Must Fix Before Merge:

  1. Add ObservedGeneration to all status condition updates
  2. Ensure GroupRef validation status is persisted

Should Fix:

  1. Add finalizer for graceful MCPGroup deletion
  2. Handle status update conflicts
  3. Sort server names in status for determinism

Nice to Have:

  1. Add CRD short names
  2. Consider additional printer columns
  3. Add field selector support per the proposal

✨ Conclusion

This is a well-implemented feature that follows the proposal closely and includes comprehensive testing. The main concerns are around status handling best practices (observedGeneration, conflict handling) and missing finalizers for graceful deletion. These are important for production-readiness but don't block the feature from functioning correctly.

Great work on the test coverage - the integration and e2e tests demonstrate thorough validation of the feature!


Job Run | Branch: k8s-groups_2025-10-15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add MCP group support to k8s operator

2 participants