Skip to content

Conversation

rueian
Copy link
Collaborator

@rueian rueian commented Jul 14, 2024

This PR uses a new RayClusterReplicaFailure condition to reflect the result of creating/deleting Pods when the features.RayClusterStatusConditions gate is enabled.

The idea is borrowed from the ReplicaSetReplicaFailure:

// These are valid conditions of a replica set.
const (
	// ReplicaSetReplicaFailure is added in a replica set when one of its pods fails to be created
	// due to insufficient quota, limit ranges, pod security policy, node selectors, etc. or deleted
	// due to kubelet being down or finalizers are failing.
	ReplicaSetReplicaFailure ReplicaSetConditionType = "ReplicaFailure"
)

which will be set by the Kubernetes ReplicaSet controller when there is a manageReplicasErr. The error will only occur when its API call to create or delete pods has failed. (ref)

We mirror the above behavior to make the new RayClusterReplicaFailure condition. More specifically, we turn the condition on when the controller fails to create or delete pods in the reconcilePods function. Additionally, We have covered all five kinds of failure reasons in the function:

  • FailedDeleteAllPods
  • FailedDeleteHeadPod
  • FailedCreateHeadPod
  • FailedDeleteWorkerPod
  • FailedCreateWorkerPod

These reasons are more fine-grained than the ReplicaSetReplicaFailure which has only two kinds of reasons: FailedCreate, and FailedDelete.

Related issue number

Closes #2232

ray-project/enhancements#54

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@rueian rueian changed the title [Feat][RayCluster] Use a new RayClusterReplicaFailure condition to re… [Feat][RayCluster] new RayClusterReplicaFailure condition Jul 14, 2024
@kevin85421
Copy link
Member

I think we should try to consolidate the code path to set up the condition. It is hard to maintain if we set the condition in multiple places.

@rueian
Copy link
Collaborator Author

rueian commented Jul 16, 2024

I think we should try to consolidate the code path to set up the condition. It is hard to maintain if we set the condition in multiple places.

Indeed, we should keep them in one place. Actually, I think the place should be the defer block in this draft and we will consolidate them to the defer block in the #2235

@kevin85421 kevin85421 self-requested a review July 17, 2024 00:14
@kevin85421 kevin85421 self-assigned this Jul 17, 2024
@rueian rueian force-pushed the replica-failure-condition branch 2 times, most recently from dfcf8a6 to 67d300a Compare July 18, 2024 14:54
…the result of creating/deleting Pods

Signed-off-by: Rueian <[email protected]>
@rueian rueian force-pushed the replica-failure-condition branch from 67d300a to 6a5afa1 Compare July 18, 2024 15:00
}

// conditions should be mutated by the following reconcileXXX functions.
conditions := defaultRayClusterConditions()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The defaultRayClusterConditions returns a simple map. I made it a function because it can be reused in tests.

if !r.inconsistentRayClusterStatus(ctx, originalRayClusterInstance.Status, newInstance.Status) {

inconsistent := false
if features.Enabled(features.RayClusterStatusConditions) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditions will only be set if the gate is enabled.

// Calculate the new status for the RayCluster. Note that the function will deep copy `instance` instead of mutating it.
newInstance, calculateErr := r.calculateStatus(ctx, instance, reconcileErr)
var updateErr error
if calculateErr != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer not to set conditions in the reconcile functions. Instead, we should set conditions in calculateStatus based on reconcileErr.

func calculateStatus(...) {
    if reconcileErr is PodError {
       // set condition
    }

}

Copy link
Member

@kevin85421 kevin85421 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • It is not very important for me to support fine-grained condition reasons. If we can easily support it, that is good. Otherwise, it is not necessary to support it, and use something like PodFailure instead.

  • I prefer not to set conditions in the reconcile functions. Instead, we should set conditions in calculateStatus based on reconcileErr.

    func calculateStatus(...) {
        if reconcileErr is PodError {
           // set condition
        }
    }
  • I am not sure whether there is an easy way to determine whether an error is thrown by the Pod creation or deletion. If not, we can consider another way.

@rueian
Copy link
Collaborator Author

rueian commented Jul 18, 2024

Hi @kevin85421,

I prefer not to set conditions in the reconcile functions. Instead, we should set conditions in calculateStatus based on reconcileErr.

I am afraid that is not always feasible. We have multiple reconcileFuncs but have only one reconcileErr. Consider a case that the reconcileErr is not nil and is not an error from the reconcilePods function either, we will have difficulty with whether to turn the RayClusterReplicaFailure condition off or not.

If the reconcileErr comes from other preceding reconcileFuncs, we should leave the condition untouched.

If the reconcileErr comes from other succeeding reconcileFuncs, we should turn the condition off.

In other words, we will need to keep track of each execution of reconcileFuncs additionally if we set conditions based on the reconcileErr.

@kevin85421
Copy link
Member

@rueian can we close this one?

@rueian
Copy link
Collaborator Author

rueian commented Jul 22, 2024

Sure. No need to keep this.

@rueian rueian closed this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] REP 54: Implement the ReplicaFailures condition
2 participants