Support scale to zero rabbitMQ #1899

jonathanCaamano · 2025-06-26T13:47:16Z

This closes #1876

As talked with @Zerpet and @mkuratczyk in the issue we add some logic to allow scale to zero the rabbitMQ.

Also we add some logic to prevent the scale down when opt-out from zero.

We add new annotation rabbitmq.com/before-zero-replicas-configured to save the replicas configured before put rabbitMQ to zero.

With this annotation we verify if the desired replicas after zero state are equals or greater than replicas before zero state.
If the replicas don't pass the verification it will works like scaleDown.

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed

Summary Of Changes

Additional Context

Local Testing

Please ensure you run the unit, integration and system tests before approving the PR.

To run the unit and integration tests:

$ make unit-tests integration-tests

You will need to target a k8s cluster and have the operator deployed for running the system tests.

For example, for a Kubernetes context named dev-bunny:

$ kubectx dev-bunny
$ make destroy deploy-dev
# wait for operator to be deployed
$ make system-tests

mkuratczyk · 2025-07-02T07:30:38Z

Thanks for the PR. Just FYI, I will certainly test this soon, but need to finish some other things first

mkuratczyk · 2025-07-08T15:49:20Z

Some initial feedback:

ALLREPLICASREADY shows "true" when all replicas are stopped

# deploy a cluster, set replicas to 0, and then get the cluster:
> kubectl get rmq
NAME   ALLREPLICASREADY   RECONCILESUCCESS   AGE
rmq    True               True               13m

I think it should be set to False when scaled to 0.

Attempt to scale up from zero to a lower number of replicas than it was before scaling to zero, leads to an error:

2025-07-08T17:33:30+02:00	ERROR	Cluster Scale down not supported; tried to scale cluster from 3 nodes to 1 nodes	{"controller": "rabbitmqcluster", "controllerGroup": "rabbitmq.com", "controllerKind": "RabbitmqCluster", "RabbitmqCluster": {"name":"rmq","namespace":"default"}, "namespace": "default", "name": "rmq", "reconcileID": "338516a2-8aeb-447e-97fd-92e1774ae64d", "error": "UnsupportedOperation"}
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).recordEventsAndSetCondition
	/Users/mkuratczyk/workspace/cluster-operator/controllers/reconcile_scale_zero.go:90
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).scaleDownFromZero
	/Users/mkuratczyk/workspace/cluster-operator/controllers/reconcile_scale_zero.go:57
github.com/rabbitmq/cluster-operator/v2/controllers.(*RabbitmqClusterReconciler).Reconcile
	/Users/mkuratczyk/workspace/cluster-operator/controllers/rabbitmqcluster_controller.go:216
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Reconcile
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:340
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:300
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.1
	/Users/mkuratczyk/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:202

(it is not expected to work as we discussed, but the stacktrace shouldn't be there, unless there's a good reason for it)

Attempt to scale from zero up to a number of replicas higher than before scaling down to zero works, which surprised me:
steps: deploy a 3 node cluster, set replicas to 0, then set replicas to 5. I don't see any reason for this to cause problems on 4.1+ thanks to the new peer discovery mechanism, but I guess it could cause issues with older RabbitMQ versions. Not sure what to do about this one yet. Perhaps we should keep it like that and just warn that using with older RabbitMQ versions is risky

jonathanCaamano · 2025-07-10T08:32:58Z

Hello @mkuratczyk,

Sure, I do some changes to have ALLREPLICASREADY as false
About this, I follow the same flow as a scale down just defined in the code ( because if the replicas before configured are 3 and now you try to put 1 it represents a scale down). So, if you'd like, I can change it and remove the stack trace.
About the RabbitMQ versions, the version in my cluster is 3.13 and it's working properly Maybe you see something I don't?

Thank you for the feedback

mkuratczyk · 2025-07-10T09:05:38Z

If there's a stack trace when a scale down is attempted (without scaling down to zero) then I think ideally we should just fix that for both cases. Alternatively, you can ignore it and we can deal with this separately.
I'm not saying it will never work, more that it could lead to random problems. Say we have 1 node, scale to zero and then scale to 3. What if the two new nodes start first for some reason? I think they could form a new cluster, at least in some cases. With 4.1+, that should not happen, since all nodes will wait for the node/pod with -0 suffix:
https://www.rabbitmq.com/blog/2025/04/04/new-k8s-peer-discovery

jonathanCaamano · 2025-07-14T10:08:37Z

Hello @mkuratczyk !

I did some change.

1- Now the ALLREPLICASREADY is false when is scaled to zero.
2- I tried to change this, but it should be analyzed and maybe change the way in the global logger.
3- About this, we change the way, now when you scale the rabbitMQ from zero have to be the same replicas than before zero, if you want scale up first have to put replicas before zero configured. This avoid the problems you told us, always respect the annotation.

Kind regards

mkuratczyk · 2025-07-17T12:37:56Z

Thanks. My only additional feedback is that the error message is a bit cryptic ("Cluster Scale from zero to other replicas than before configured not supported; tried to scale cluster from 3 nodes to 5 nodes"). Perhaps "unsupported operation: when scaling from zero, you can only restore the previous number of replicas (3)"?

@Zerpet @ansd @MirahImage any thoughts about this PR?

…b.com/InditexTech/cluster-operator into rabbitmqGH-1876-support-scale-to-zero

jonathanCaamano · 2025-07-18T10:19:51Z

Hello,

i changed the logger.

Zerpet

Thank you for contributing this PR! I left some comments with feedback that I would like to be addressed before merging.

Zerpet · 2025-07-18T12:30:09Z

controllers/reconcile_scale_zero.go

+func (r *RabbitmqClusterReconciler) scaleToZero(current, sts *appsv1.StatefulSet) bool {
+	currentReplicas := *current.Spec.Replicas
+	desiredReplicas := *sts.Spec.Replicas
+	return desiredReplicas == 0 && currentReplicas > 0
+}


This function does not need to be part of the RabbitmqClusterReconciler, because it doesn't make any use of the struct functions or fields, it takes all its information from the arguments.

Zerpet · 2025-07-18T12:30:19Z

controllers/reconcile_scale_zero.go

+func (r *RabbitmqClusterReconciler) scaleFromZero(current, sts *appsv1.StatefulSet) bool {
+	currentReplicas := *current.Spec.Replicas
+	desiredReplicas := *sts.Spec.Replicas
+	return currentReplicas == 0 && desiredReplicas > 0
+}


This function does not need to be part of the RabbitmqClusterReconciler, because it doesn't make any use of the struct functions or fields, it takes all its information from the arguments.

Zerpet · 2025-07-18T12:32:42Z

controllers/reconcile_scale_zero.go

+		if err != nil {
+			return true
+		}


We should add a debug log line here indicating there was an error emitting the event and/or setting the status condition. At it is right now, it's silently ignoring the error.

In fact, since the function returns true at this point in either case, you could simply log a debug message (without returning inside the if) and return true after the if conditional

Zerpet · 2025-07-18T12:46:06Z

controllers/rabbitmqcluster_controller.go

@@ -213,7 +227,6 @@ func (r *RabbitmqClusterReconciler) Reconcile(ctx context.Context, req ctrl.Requ
 				return ctrl.Result{}, err
 			}
 		}
-


I don't think this line break removal improves readability of this function. In fact, I think the opposite. This empty line makes the function cluttered. This line break separates high level steps of the reconcile process. Prior to this line break, it's the logic to reconcile the STS (plus other things earlier on), post this line break is the logic to update the STS and emit the relevant metadata. I believe this separation makes sense, and I want to keep unless there's a compelling argument for the opposite.

Zerpet · 2025-07-18T12:46:25Z

controllers/rabbitmqcluster_controller.go

-
 	// Set ReconcileSuccess to true and update observedGeneration after all reconciliation steps have finished with no error
 	rabbitmqCluster.Status.ObservedGeneration = rabbitmqCluster.GetGeneration()
+


Same as before with both the empty line removal and the new line addition.

Zerpet · 2025-07-18T12:51:36Z

controllers/reconcile_scale_zero.go

+		if err != nil {
+			return true
+		}
+		return true


Similar to my other comment:

We should add a debug log line here indicating there was an error emitting the event and/or setting the status condition. At it is right now, it's silently ignoring the error.

Since the function returns true at this point in either case, you could simply log a debug message (without returning inside the if err) and return true after the if conditional

Zerpet · 2025-07-18T12:52:41Z

controllers/reconcile_scale_zero.go

+	var err error
+	currentReplicas := *current.Spec.Replicas
+	logger := ctrl.LoggerFrom(ctx)
+	msg := "Cluster Scale down to 0 replicas."


Suggested change

msg := "Cluster Scale down to 0 replicas."

msg := "Cluster Scale down to 0 replicas"

Log and/or event messages should not end with a dot .

Zerpet · 2025-07-18T13:05:29Z

controllers/reconcile_scale_zero.go

+	err = r.updateAnnotation(ctx, cluster, cluster.Namespace, cluster.Name, beforeZeroReplicasConfigured, fmt.Sprint(currentReplicas))
+	r.Recorder.Event(cluster, corev1.EventTypeNormal, reason, msg)
+	return err


Suggested change

err = r.updateAnnotation(ctx, cluster, cluster.Namespace, cluster.Name, beforeZeroReplicasConfigured, fmt.Sprint(currentReplicas))

r.Recorder.Event(cluster, corev1.EventTypeNormal, reason, msg)

return err

r.Recorder.Event(cluster, corev1.EventTypeNormal, reason, msg)

return r.updateAnnotation(ctx, cluster, cluster.Namespace, cluster.Name, beforeZeroReplicasConfigured, fmt.Sprint(currentReplicas))

Event() function does not use err. We can simply return the function call.

Zerpet · 2025-07-18T14:36:16Z

controllers/reconcile_scale_zero.go

+	logger := ctrl.LoggerFrom(ctx)
+	var statusErr error
+	logger.Error(errors.New(reason), msg)


I am confused. Why do we log an error unconditionally here?

jonathanCaamano and others added 2 commits June 26, 2025 14:48

Support scale to zero rabbitMQ

2a2465d

Merge branch 'main' into rabbitmqGH-1876-support-scale-to-zero

ed8f4c5

Merge branch 'main' into rabbitmqGH-1876-support-scale-to-zero

45541d0

Merge branch 'main' into rabbitmqGH-1876-support-scale-to-zero

95c614a

jonathanCaamano and others added 2 commits July 14, 2025 12:04

Merge branch 'main' into rabbitmqGH-1876-support-scale-to-zero

0a57c86

feat: update code to align with pr comments

cc9cc6a

jonathanCaamano added 2 commits July 14, 2025 12:53

Merge branch 'main' into rabbitmqGH-1876-support-scale-to-zero

52b55b7

Merge branch 'main' into rabbitmqGH-1876-support-scale-to-zero

035aca0

jonathanCaamano added 3 commits July 18, 2025 12:02

feat: change logger

535851e

Merge branch 'rabbitmqGH-1876-support-scale-to-zero' of https://githu…

8d5d81d

…b.com/InditexTech/cluster-operator into rabbitmqGH-1876-support-scale-to-zero

fix: fix error on test

71e14c4

Zerpet self-requested a review July 18, 2025 12:08

Zerpet requested changes Jul 18, 2025

View reviewed changes


		// Set ReconcileSuccess to true and update observedGeneration after all reconciliation steps have finished with no error
		rabbitmqCluster.Status.ObservedGeneration = rabbitmqCluster.GetGeneration()

	msg := "Cluster Scale down to 0 replicas."
	msg := "Cluster Scale down to 0 replicas"

Support scale to zero rabbitMQ #1899

Are you sure you want to change the base?

Support scale to zero rabbitMQ #1899

Uh oh!

Conversation

jonathanCaamano commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary Of Changes

Additional Context

Local Testing

Uh oh!

mkuratczyk commented Jul 2, 2025

Uh oh!

mkuratczyk commented Jul 8, 2025

Uh oh!

jonathanCaamano commented Jul 10, 2025

Uh oh!

mkuratczyk commented Jul 10, 2025

Uh oh!

jonathanCaamano commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkuratczyk commented Jul 17, 2025

Uh oh!

jonathanCaamano commented Jul 18, 2025

Uh oh!

Zerpet left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jonathanCaamano commented Jun 26, 2025 •

edited

Loading

jonathanCaamano commented Jul 14, 2025 •

edited

Loading