-
Notifications
You must be signed in to change notification settings - Fork 625
[Feat][RayCluster] new RayClusterReplicaFailure condition #2245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think we should try to consolidate the code path to set up the condition. It is hard to maintain if we set the condition in multiple places. |
Indeed, we should keep them in one place. Actually, I think the place should be the defer block in this draft and we will consolidate them to the defer block in the #2235 |
dfcf8a6
to
67d300a
Compare
…the result of creating/deleting Pods Signed-off-by: Rueian <[email protected]>
67d300a
to
6a5afa1
Compare
} | ||
|
||
// conditions should be mutated by the following reconcileXXX functions. | ||
conditions := defaultRayClusterConditions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The defaultRayClusterConditions
returns a simple map. I made it a function because it can be reused in tests.
if !r.inconsistentRayClusterStatus(ctx, originalRayClusterInstance.Status, newInstance.Status) { | ||
|
||
inconsistent := false | ||
if features.Enabled(features.RayClusterStatusConditions) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conditions will only be set if the gate is enabled.
// Calculate the new status for the RayCluster. Note that the function will deep copy `instance` instead of mutating it. | ||
newInstance, calculateErr := r.calculateStatus(ctx, instance, reconcileErr) | ||
var updateErr error | ||
if calculateErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer not to set conditions in the reconcile functions. Instead, we should set conditions in calculateStatus
based on reconcileErr
.
func calculateStatus(...) {
if reconcileErr is PodError {
// set condition
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
It is not very important for me to support fine-grained condition reasons. If we can easily support it, that is good. Otherwise, it is not necessary to support it, and use something like
PodFailure
instead. -
I prefer not to set conditions in the reconcile functions. Instead, we should set conditions in calculateStatus based on
reconcileErr
.func calculateStatus(...) { if reconcileErr is PodError { // set condition } }
-
I am not sure whether there is an easy way to determine whether an error is thrown by the Pod creation or deletion. If not, we can consider another way.
Hi @kevin85421,
I am afraid that is not always feasible. We have multiple reconcileFuncs but have only one If the If the In other words, we will need to keep track of each execution of reconcileFuncs additionally if we set conditions based on the |
@rueian can we close this one? |
Sure. No need to keep this. |
This PR uses a new
RayClusterReplicaFailure
condition to reflect the result of creating/deleting Pods when thefeatures.RayClusterStatusConditions
gate is enabled.The idea is borrowed from the
ReplicaSetReplicaFailure
:which will be set by the Kubernetes ReplicaSet controller when there is a
manageReplicasErr
. The error will only occur when its API call to create or delete pods has failed. (ref)We mirror the above behavior to make the new
RayClusterReplicaFailure
condition. More specifically, we turn the condition on when the controller fails to create or delete pods in thereconcilePods
function. Additionally, We have covered all five kinds of failure reasons in the function:These reasons are more fine-grained than the
ReplicaSetReplicaFailure
which has only two kinds of reasons:FailedCreate
, andFailedDelete
.Related issue number
Closes #2232
ray-project/enhancements#54
Checks