Skip to content

Conversation

machichima
Copy link
Collaborator

@machichima machichima commented Aug 15, 2025

Why are these changes needed?

APIServer v1 introduced a unique and mandatory feature, Compute Template, which requires users to predefine the amount of CPU and memory resources they need in templates.

APIServer v2 doesn’t require users to predefine templates anymore, but we still want to continue supporting the feature to help v1 users migrate to v2 more easily.

Implementation

  1. Unmarshal the incoming request body in a new apiserversdk proxy handler.
  2. Extract computeTemplate from the headGroupSpec and workerGroupSpec.
  3. Mutate the request body according to the template before marshaling back.

We need 3 new proxy handlers for RayCluster, RayJob, and RayService, respectively. They may be able to share some parts of the implementation.

Added e2e test in compute_template_e2e_test.go

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Just a structure, test does not work now

Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
@machichima machichima marked this pull request as ready for review September 5, 2025 15:50
//
// Store http request handling function for unit test purpose.
executeHttpRequest func(httpRequest *http.Request, URL string) ([]byte, *rpcStatus.Status, error)
ExecuteHttpRequest func(httpRequest *http.Request, URL string) ([]byte, *rpcStatus.Status, error)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this public so that we can use in apiserversdk

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@machichima, can we revert this now?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// Compute template
func (r *ResourceManager) populateComputeTemplate(ctx context.Context, clusterSpec *api.ClusterSpec, nameSpace string) (map[string]*api.ComputeTemplate, error) {
func (r *ResourceManager) PopulateComputeTemplate(ctx context.Context, clusterSpec *api.ClusterSpec, nameSpace string) (map[string]*api.ComputeTemplate, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this public so that we can use in apiserversdk

}

// execCommandWithCurlInPod executes a curl command inside the specified pod's container by `kubectl exec`
func (rec *RemoteExecuteClient) execCommandWithCurlInPod(pod *corev1.Pod, url string, method string, body string, contentType string) ([]byte, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take this function from apiserver/test/e2e/exec.go and addcontentType args

return e2etc.ctx
}

func (e2etc *End2EndTestingContext) GetK8sHttpClient() *kubernetes.Clientset {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there's duplicate k8sHttpClient and k8sClient that are both of type *kubernetes.Clientset. We removed k8sHttpClient and only keep k8sClient

)

// compute_template_middleware.go
func NewComputeTemplateMiddleware(clientManager manager.ClientManagerInterface) func(http.Handler) http.Handler {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make clientManager as the args so that we can mock this out in unit test

Signed-off-by: machichima <[email protected]>
Signed-off-by: machichima <[email protected]>
@machichima
Copy link
Collaborator Author

Not sure why in "build apiserversdk" check, the test failed. It works on my side

image

@rueian
Copy link
Collaborator

rueian commented Sep 7, 2025

Not sure why in "build apiserversdk" check, the test failed. It works on my side

image

That is a new flaky failure happened frequently recently. I am wondering if it could be related to the newly added retry roundtripper. Could you help verify in another PR?

@rueian
Copy link
Collaborator

rueian commented Sep 7, 2025

Or the flaky may be related to #4061

@machichima
Copy link
Collaborator Author

Or the flaky may be related to #4061

I'll try fixing this flaky test in another PR if the error persist after the #4061 PR merged

}

// verifyPodSpecResources verifies that the PodSpec has the expected resources from the compute template
func verifyPodSpecResources(t *testing.T, podSpec *corev1.PodSpec, _, groupType string, computeTemplate *api.ComputeTemplate) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is any specific reason to put a _ in the function signature?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed, I think I forgot to remove it. Thank you for pointing this out!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in edc99ad

Comment on lines 132 to 133
func convertRequestBodyToMap(requestBody []byte, contentType string) (map[string]any, error) {
var requestMap map[string]interface{}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: choose one between any and interface{} to make it consistent

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I changed interface{} to any as suggested in my linter in 8ad0573

Comment on lines 257 to 259
} else {
tolerations = make([]interface{}, 0)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The else branch might not be necessary.

https://go.dev/play/p/ZB0ZrroHJl_C

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Removed the block in 559f2ae

computeTemplate, err := getComputeTemplate(context.Background(), resourceManager, headGroupMap, namespace)
if err != nil {
klog.Errorf("ComputeTemplate middleware: Failed to get compute template for head group: %v", err)
http.Error(w, err.Error(), http.StatusInternalServerError)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be a client error. Returning HTTP 422 will be better.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix in 967c314

func applyComputeTemplateToRequest(computeTemplate *api.ComputeTemplate, clusterSpecMap *map[string]any, group string) {
// calculate resources
cpu := fmt.Sprint(computeTemplate.GetCpu())
memory := fmt.Sprintf("%d%s", computeTemplate.GetMemory(), "Gi")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to support the new memory unit field.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in b56867d


// RemoteExecuteClient allows executing HTTP requests against a service running inside a Kubernetes pod
// by using `kubectl exec`-style command execution, without requiring a NodePort for external access.
type RemoteExecuteClient struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the existing ProxyRoundTripper instead of introducing a RemoteExecuteClient?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 9561d29, thank you!

@machichima machichima requested a review from rueian September 24, 2025 12:58
expectedMemory := fmt.Sprintf("%dGi", computeTemplate.GetMemory())
memoryLimit := rayContainer.Resources.Limits[corev1.ResourceMemory]
memoryRequest := rayContainer.Resources.Requests[corev1.ResourceMemory]
require.Equal(t, expectedMemory, memoryLimit.String(), "Expected memory limit to be 4Gi")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you remove the hard-coded values in these assertion messages in a follow-up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! no problem


klog.Infof("ComputeTemplate middleware: Successfully processed request, sending to next handler")
// Update Content-Type to application/json and Content-Length header to match the new body size
r.Header.Set("Content-Type", "application/json")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you keep the original content type in a follow-up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because after processing, we convert the map to JSON. Therefor I change the content-type here to be json

https://github.com/ray-project/kuberay/pull/3959/files#diff-a310e70ce3c41e0d4d682d053af475d9d67168224dacd5eaa7bf4d3e92479419R112

@rueian rueian merged commit 536ca35 into ray-project:master Sep 28, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants