Skip to content

Support scale to zero rabbitMQ #1899

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jonathanCaamano
Copy link

@jonathanCaamano jonathanCaamano commented Jun 26, 2025

This closes #1876

As talked with @Zerpet and @mkuratczyk in the issue we add some logic to allow scale to zero the rabbitMQ.

Also we add some logic to prevent the scale down when opt-out from zero.

We add new annotation rabbitmq.com/before-zero-replicas-configured to save the replicas configured before put rabbitMQ to zero.

With this annotation we verify if the desired replicas after zero state are equals or greater than replicas before zero state.
If the replicas don't pass the verification it will works like scaleDown.

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

Additional Context

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

@mkuratczyk
Copy link
Contributor

Thanks for the PR. Just FYI, I will certainly test this soon, but need to finish some other things first

@mkuratczyk
Copy link
Contributor

Some initial feedback:

  1. ALLREPLICASREADY shows "true" when all replicas are stopped
# deploy a cluster, set replicas to 0, and then get the cluster:
> kubectl get rmq
NAME   ALLREPLICASREADY   RECONCILESUCCESS   AGE
rmq    True               True               13m

I think it should be set to False when scaled to 0.

  1. Attempt to scale up from zero to a lower number of replicas than it was before scaling to zero, leads to an error:
2025-07-08T17:33:30+02:00	ERROR	Cluster Scale down not supported; tried to scale cluster from 3 nodes to 1 nodes	{"controller": "rabbitmqcluster", "controllerGroup": "rabbitmq.com", "controllerKind": "RabbitmqCluster", "RabbitmqCluster": {"name":"rmq","namespace":"default"}, "namespace": "default", "name": "rmq", "reconcileID": "338516a2-8aeb-447e-97fd-92e1774ae64d", "error": "UnsupportedOperation"}
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).recordEventsAndSetCondition
	/Users/mkuratczyk/workspace/cluster-operator/controllers/reconcile_scale_zero.go:90
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).scaleDownFromZero
	/Users/mkuratczyk/workspace/cluster-operator/controllers/reconcile_scale_zero.go:57
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile
	/Users/mkuratczyk/workspace/cluster-operator/controllers/rabbitmqcluster_controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202

(it is not expected to work as we discussed, but the stacktrace shouldn't be there, unless there's a good reason for it)

  1. Attempt to scale from zero up to a number of replicas higher than before scaling down to zero works, which surprised me:
    steps: deploy a 3 node cluster, set replicas to 0, then set replicas to 5. I don't see any reason for this to cause problems on 4.1+ thanks to the new peer discovery mechanism, but I guess it could cause issues with older RabbitMQ versions. Not sure what to do about this one yet. Perhaps we should keep it like that and just warn that using with older RabbitMQ versions is risky

@jonathanCaamano
Copy link
Author

Hello @mkuratczyk,

  1. Sure, I do some changes to have ALLREPLICASREADY as false

  2. About this, I follow the same flow as a scale down just defined in the code ( because if the replicas before configured are 3 and now you try to put 1 it represents a scale down). So, if you'd like, I can change it and remove the stack trace.

  3. About the RabbitMQ versions, the version in my cluster is 3.13 and it's working properly Maybe you see something I don't?

Thank you for the feedback

@mkuratczyk
Copy link
Contributor

  1. If there's a stack trace when a scale down is attempted (without scaling down to zero) then I think ideally we should just fix that for both cases. Alternatively, you can ignore it and we can deal with this separately.

  2. I'm not saying it will never work, more that it could lead to random problems. Say we have 1 node, scale to zero and then scale to 3. What if the two new nodes start first for some reason? I think they could form a new cluster, at least in some cases. With 4.1+, that should not happen, since all nodes will wait for the node/pod with -0 suffix:
    https://www.rabbitmq.com/blog/2025/04/04/new-k8s-peer-discovery

@jonathanCaamano
Copy link
Author

jonathanCaamano commented Jul 14, 2025

Hello @mkuratczyk !

I did some change.

1- Now the ALLREPLICASREADY is false when is scaled to zero.
2- I tried to change this, but it should be analyzed and maybe change the way in the global logger.
3- About this, we change the way, now when you scale the rabbitMQ from zero have to be the same replicas than before zero, if you want scale up first have to put replicas before zero configured. This avoid the problems you told us, always respect the annotation.

Kind regards

@mkuratczyk
Copy link
Contributor

Thanks. My only additional feedback is that the error message is a bit cryptic ("Cluster Scale from zero to other replicas than before configured not supported; tried to scale cluster from 3 nodes to 5 nodes"). Perhaps "unsupported operation: when scaling from zero, you can only restore the previous number of replicas (3)"?

@Zerpet @ansd @MirahImage any thoughts about this PR?

@jonathanCaamano
Copy link
Author

Hello,

i changed the logger.

@Zerpet Zerpet self-requested a review July 18, 2025 12:08
Copy link
Member

@Zerpet Zerpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for contributing this PR! I left some comments with feedback that I would like to be addressed before merging.

Comment on lines +20 to +24
func (r *RabbitmqClusterReconciler) scaleToZero(current, sts *appsv1.StatefulSet) bool {
currentReplicas := *current.Spec.Replicas
desiredReplicas := *sts.Spec.Replicas
return desiredReplicas == 0 && currentReplicas > 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not need to be part of the RabbitmqClusterReconciler, because it doesn't make any use of the struct functions or fields, it takes all its information from the arguments.

Comment on lines +27 to +31
func (r *RabbitmqClusterReconciler) scaleFromZero(current, sts *appsv1.StatefulSet) bool {
currentReplicas := *current.Spec.Replicas
desiredReplicas := *sts.Spec.Replicas
return currentReplicas == 0 && desiredReplicas > 0
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does not need to be part of the RabbitmqClusterReconciler, because it doesn't make any use of the struct functions or fields, it takes all its information from the arguments.

Comment on lines +57 to +59
if err != nil {
return true
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a debug log line here indicating there was an error emitting the event and/or setting the status condition. At it is right now, it's silently ignoring the error.

In fact, since the function returns true at this point in either case, you could simply log a debug message (without returning inside the if) and return true after the if conditional

@@ -213,7 +227,6 @@ func (r *RabbitmqClusterReconciler) Reconcile(ctx context.Context, req ctrl.Requ
return ctrl.Result{}, err
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this line break removal improves readability of this function. In fact, I think the opposite. This empty line makes the function cluttered. This line break separates high level steps of the reconcile process. Prior to this line break, it's the logic to reconcile the STS (plus other things earlier on), post this line break is the logic to update the STS and emit the relevant metadata. I believe this separation makes sense, and I want to keep unless there's a compelling argument for the opposite.

Comment on lines -252 to +267

// Set ReconcileSuccess to true and update observedGeneration after all reconciliation steps have finished with no error
rabbitmqCluster.Status.ObservedGeneration = rabbitmqCluster.GetGeneration()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before with both the empty line removal and the new line addition.

Comment on lines +48 to +51
if err != nil {
return true
}
return true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to my other comment:

We should add a debug log line here indicating there was an error emitting the event and/or setting the status condition. At it is right now, it's silently ignoring the error.

Since the function returns true at this point in either case, you could simply log a debug message (without returning inside the if err) and return true after the if conditional

var err error
currentReplicas := *current.Spec.Replicas
logger := ctrl.LoggerFrom(ctx)
msg := "Cluster Scale down to 0 replicas."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
msg := "Cluster Scale down to 0 replicas."
msg := "Cluster Scale down to 0 replicas"

Log and/or event messages should not end with a dot .

Comment on lines +75 to +77
err = r.updateAnnotation(ctx, cluster, cluster.Namespace, cluster.Name, beforeZeroReplicasConfigured, fmt.Sprint(currentReplicas))
r.Recorder.Event(cluster, corev1.EventTypeNormal, reason, msg)
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err = r.updateAnnotation(ctx, cluster, cluster.Namespace, cluster.Name, beforeZeroReplicasConfigured, fmt.Sprint(currentReplicas))
r.Recorder.Event(cluster, corev1.EventTypeNormal, reason, msg)
return err
r.Recorder.Event(cluster, corev1.EventTypeNormal, reason, msg)
return r.updateAnnotation(ctx, cluster, cluster.Namespace, cluster.Name, beforeZeroReplicasConfigured, fmt.Sprint(currentReplicas))

Event() function does not use err. We can simply return the function call.

Comment on lines +88 to +90
logger := ctrl.LoggerFrom(ctx)
var statusErr error
logger.Error(errors.New(reason), msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused. Why do we log an error unconditionally here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scale to zero
3 participants