Skip to content

[Enhancement]: Allow to edit broker.session.timeout.ms for Strimzi Operator 0.45.x version due to brokers being fenced during pod rolling restart. #11736

@alicvsroxas

Description

@alicvsroxas

Related problem

After upgrading our Kafka clusters to Kafka 3.9.0, Strimzi Operator 0.45.0, and transitioning our production cluster to KRaft mode enabled.
We started witnessing a strange behaviour when initiating pod rolling restart through the operator, mid-migration to KRaft, as well as after it.

We witnessed that during the pod rolling restart procedure, when a broker gracefully shuts down by the Kraft Controller, and start ups, and rejoining the cluster.
Shortly afterward, while it is busy replicating missing data, the broker suddenly freezes for approximately 20–50 seconds.
During this time, it produces no logs, no metrics, and no heartbeat messages to the controllers.
But since it managed to replicate some of the partitions in the cluster for the next broker, Strimzi Operator continues to next broker to be rolled in the pod rolling restart procedure.

2025-07-30 09:21:27,224 INFO [Broker id=8] Skipped the become-follower state change for my-topic-215 with topic id Some(yO4CQayIRbyESrHHVPdOrQ) and partition state LeaderAndIsrPartitionState(topicName='my-topic', partitionIndex=215, controllerEpoch=-1, leader=4, leaderEpoch=54, isr=[4, 8], partitionEpoch=102, replicas=[4, 8], addingReplicas=[], removingReplicas=[], isNew=false, leaderRecoveryState=0) since it is already a follower with leader epoch 54. (state.change.logger) [kafka-8-metadata-loader-event-handler]

After this “hanging” period, the broker resumes normal operation without emitting any error or warning messages — as if nothing happened. However, during this gap, because the broker fails to send heartbeats to the KRaft controllers, it gets fenced out of the cluster (9s timeout) during the next broker rollout, which leads to partitions going offline and, ultimately, data loss.

2025-07-30 09:21:40,325 INFO [QuorumController id=300] Fencing broker 8 because its session has timed out. (org.apache.kafka.controller.ReplicationControlManager) [quorum-controller-300-event-handler]

We haven’t been able to reproduce or simulate this behavior in a test environment. However, we can confirm that it consistently occurs in our production clusters every time we perform a rolling restart of the brokers. No strange spikes in metrics inside the pod or the node it sits on were produced before the issue has occurred, leading us to believe it is not a hardware issue.

We opened an issue on Kafka (KAFKA-19586) - and actively investigating the root cause for this issue.
But, in the meantime we thought of ways of mitigating by increasing the broker.session.timeout.ms configuration - which is not available to be configured in Strimzi Operator version 0.45.0.

Suggested solution

To avoid the broker being fenced by the Kraft Controller during the pod rolling restart procedure - we are thinking of increasing the default config of broker.session.timeout.ms (9s) to up to a minute to hopefully mitigate this issue.
Since we are using Strimzi Operator, it looks like this config will only be available to be configured at version 0.48.0.

Since we can't initiate a pod rolling restart on our clusters until we find the reason for this strange behaviour we will need the ability to configure broker.session.timeout.ms at the current operator version we are using - 0.45.X.

Alternatives

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions