3.13.7 Shovel's rabbit_shovel_dyn_worker_sup_sup can fail without being restarted (exceed supervisor restart intensity?) #14791

kubrakaraman6 · 2025-10-23T07:05:36Z

kubrakaraman6
Oct 23, 2025

Describe the bug

Environment

RabbitMQ Version: 3.13.7
**Erlang (SMP,ASYNC_THREADS) (BEAM) emulator version 14.2.5.10
Cluster Type: Multi-node ( 3 node )
OS: (e.g., Ubuntu 22.04 )
Plugins Enabled:
- rabbitmq_shovel
- rabbitmq_shovel_management

We observed that a dynamic shovel supervisor process (rabbit_shovel_dyn_worker_sup_sup) on one cluster node crashes and does not recover automatically.
The affected node (rabbit@rabbit-01) cannot start any shovel workers until the node or shovel application is manually restarted.
Meanwhile, the same shovel definition works successfully when created on the second node (rabbit@rabbit-02).
[error] <0.211588995.1> Shovel with the name 'Move from Queque_error' was not found on virtual host '/'
[error] <0.211588995.1> Could not find shovel data for shovel 'Move from Queque_error' in vhost: '/'
[error] <0.211588995.1> Failed to delete shovel 'Move from Queque_error' on vhost '/', reason: {exception,
{noproc,
{gen_server,
call,
[rabbit_shovel_dyn_worker_sup_sup,which_children,infinity]}}}
[info] <0.211582083.1> Waiting for Mnesia tables for 30000 ms, 9 retries left

2025-10-23 05:31:49.714145+00:00 [info] <0.211347036.1> 2025-10-23 05:31:49.714264+00:00 [info] <0.211347036.1> 2025-10-23 05:31:57.144238+00:00 [warning] <0.167982866.1> 2025-10-23 05:31:57.144238+00:00 [warning] <0.167982866.1> 2025-10-23 05:31:57.144238+00:00 [warning] <0.167982866.1> 2025-10-23 05:31:57.144238+00:00 [warning] <0.167982866.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820388+00:00 [error] <0.211166023.1> 2025-10-23 05:32:43.820961+00:00 [error] <0.211394560.1> 2025-10-23 05:32:43.820961+00:00 [error] <0.211394560.1> 2025-10-23 05:32:57.145259+00:00 [warning] <0.167982866.1> 2025-10-23 05:32:57.145259+00:00 [warning] <0.167982866.1> 2025-10-23 05:32:57.145259+00:00 [warning] <0.167982866.1> 2025-10-23 05:32:57.145259+00:00 [warning] <0.167982866.1> 2025-10-23 05:33:57.146258+00:00 [warning] <0.167982866.1> 2025-10-23 05:33:57.146258+00:00 [warning] <0.167982866.1> 2025-10-23 05:33:57.146258+00:00 [warning] <0.167982866.1> 2025-10-23 05:33:57.146258+00:00 [warning] <0.167982866.1> 2025-10-23 05:34:57.147260+00:00 [warning] <0.167982866.1> 2025-10-23 05:34:57.147260+00:00 [warning] <0.167982866.1> 2025-10-23 05:34:57.147260+00:00 [warning] <0.167982866.1> 2025-10-23 05:34:57.147260+00:00 [warning] <0.167982866.1> 2025-10-23 05:35:14.815840+00:00 [info] <0.211412137.1> 2025-10-23 05:35:14.815968+00:00 [info] <0.211412137.1> 2025-10-23 05:35:22.407824+00:00 [info] <0.211412483.1> 2025-10-23 05:35:22.407949+00:00 [info] <0.211412483.1> 2025-10-23 05:35:23.590839+00:00 [info] <0.211412854.1> Waiting for Mnesia tables for 30000 ms, 9 retries left
Successfully synced tables from a peer
Recurring shovel spec clean up failed with exit:{noproc,
{gen_server,call,
[rabbit_shovel_dyn_worker_sup_sup,
which_children,infinity]}}
crasher:
initial call: cowboy_stream_h:request_process/3
pid: <0.211166023.1>
registered_name: []
exception exit: {{noproc,
{gen_server,call,
[rabbit_shovel_dyn_worker_sup_sup,which_children,
infinity]}},
[{gen_server,call,3,[{file,"gen_server.erl"},{line,419}]},
{mirrored_supervisor,child,2,
[{file,"mirrored_supervisor.erl"},{line,226}]},
{mirrored_supervisor,call,2,
[{file,"mirrored_supervisor.erl"},{line,203}]},
{mirrored_supervisor,fold,3,
[{file,"mirrored_supervisor.erl"},{line,219}]},
{rabbit_shovel_dyn_worker_sup_sup,child_exists,1,
[{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},
{line,69}]},
{rabbit_shovel_dyn_worker_sup_sup,adjust,2,
[{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},
{line,34}]},
{rabbit_runtime_parameters,set_any0,5,
[{file,"rabbit_runtime_parameters.erl"},{line,147}]},
{rabbit_runtime_parameters,set_any,5,
[{file,"rabbit_runtime_parameters.erl"},
{line,122}]}]}
in function gen_server:call/3 (gen_server.erl, line 419)
in call from mirrored_supervisor:child/2 (mirrored_supervisor.erl, line 226)
in call from mirrored_supervisor:call/2 (mirrored_supervisor.erl, line 203)
in call from mirrored_supervisor:fold/3 (mirrored_supervisor.erl, line 219)
in call from rabbit_shovel_dyn_worker_sup_sup:child_exists/1 (rabbit_shovel_dyn_worker_sup_sup.erl, line 69)
in call from rabbit_shovel_dyn_worker_sup_sup:adjust/2 (rabbit_shovel_dyn_worker_sup_sup.erl, line 34)
in call from rabbit_runtime_parameters:set_any0/5 (rabbit_runtime_parameters.erl, line 147)
in call from rabbit_runtime_parameters:set_any/5 (rabbit_runtime_parameters.erl, line 122)
ancestors: [<0.211394560.1>,<0.168020937.1>,<0.167981760.1>,
<0.167984701.1>,<0.168020606.1>,rabbit_web_dispatch_sup,
<0.628.0>]
message_queue_len: 0
messages: []
links: [<0.211394560.1>]
dictionary: []
trap_exit: false
status: running
heap_size: 2586
stack_size: 28
reductions: 4435
neighbours:

Ranch listener {acceptor,{0,0,0,0,0,0,0,0},15672}, connection process <0.211394560.1>, stream 1 had its request process <0.211166023.1> exit with reason {noproc,{gen_server,call,[rabbit_shovel_dyn_worker_sup_sup,which_children,infinity]}} and stacktrace [{gen_server,call,3,[{file,"gen_server.erl"},{line,419}]},{mirrored_supervisor,child,2,[{file,"mirrored_supervisor.erl"},{line,226}]},{mirrored_supervisor,call,2,[{file,"mirrored_supervisor.erl"},{line,203}]},{mirrored_supervisor,fold,3,[{file,"mirrored_supervisor.erl"},{line,219}]},{rabbit_shovel_dyn_worker_sup_sup,child_exists,1,[{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},{line,69}]},{rabbit_shovel_dyn_worker_sup_sup,adjust,2,[{file,"rabbit_shovel_dyn_worker_sup_sup.erl"},{line,34}]},{rabbit_runtime_parameters,set_any0,5,[{file,"rabbit_runtime_parameters.erl"},{line,147}]},{rabbit_runtime_parameters,set_any,5,[{file,"rabbit_runtime_parameters.erl"},{line,122}]}]

Recurring shovel spec clean up failed with exit:{noproc,
{gen_server,call,
[rabbit_shovel_dyn_worker_sup_sup,
which_children,infinity]}}
Recurring shovel spec clean up failed with exit:{noproc,
{gen_server,call,
[rabbit_shovel_dyn_worker_sup_sup,
which_children,infinity]}}
Recurring shovel spec clean up failed with exit:{noproc,
{gen_server,call,
[rabbit_shovel_dyn_worker_sup_sup,
which_children,infinity]}}
Waiting for Mnesia tables for 30000 ms, 9 retries left
Successfully synced tables from a peer
Waiting for Mnesia tables for 30000 ms, 9 retries left
Successfully synced tables from a peer
Waiting for Mnesia tables for 30000 ms, 9 retries left

After this point:

rabbitmqctl eval 'supervisor2:which_children(rabbit_shovel_dyn_worker_sup_sup).'
# => {:noproc, {:gen_server, :call, [:rabbit_shovel_dyn_worker_sup_sup, :which_children, :infinity]}}



### Reproduction steps

Create a dynamic shovel in vhost / (e.g. “Move from Queque_error”).

Trigger a cluster sync event (e.g., restart one node or cause Mnesia table resync).

Attempt to delete and recreate the shovel on the affected node (rabbit@rabbit-01).

Observe noproc errors in logs and that the supervisor process no longer exists.

Apply the same shovel on another node (rabbit@rabbit-02) → shovel works correctly.

### Expected behavior

The shovel supervisor should remain running or automatically recover after Mnesia sync or parameter update.
New shovels should start normally on all nodes without requiring a restart.

### Additional context

On one node, the shovel supervisor crashes and stays dead.

New shovel definitions cannot start workers.

The same definition works fine on a different cluster node.

Only restarting the rabbitmq_shovel application or node recovers functionality.

Answered by michaelklishin

Oct 23, 2025

@kubrakaraman6 RabbitMQ 3.13.x is out of community support.

But the Same Fundamental Problem Applies to 4.x?

That's correct, 4.x still uses mirrored_supervisor and it hosts all shovels on a single node.
Like all Mnesia-based features, mirrored_supervisor's fundamental limitations are considered to be unfixable by our team. You can only replace it with something else, like Mnesia was replaced by/with Khepri and new 4.2.0 clusters will use Khepri by default.

This is one of the reasons why in Tanzu RabbitMQ 4.2.0, we have a new plugin that offers distributed shovels that, as the name suggests, distributes shovels on all nodes. As you can imagine, the key corporate sponsor of RabbitMQ is not …

View full answer

michaelklishin · 2025-10-23T07:29:46Z

michaelklishin
Oct 23, 2025
Maintainer

@kubrakaraman6 RabbitMQ 3.13.x is out of community support.

But the Same Fundamental Problem Applies to 4.x?

That's correct, 4.x still uses mirrored_supervisor and it hosts all shovels on a single node.
Like all Mnesia-based features, mirrored_supervisor's fundamental limitations are considered to be unfixable by our team. You can only replace it with something else, like Mnesia was replaced by/with Khepri and new 4.2.0 clusters will use Khepri by default.

This is one of the reasons why in Tanzu RabbitMQ 4.2.0, we have a new plugin that offers distributed shovels that, as the name suggests, distributes shovels on all nodes. As you can imagine, the key corporate sponsor of RabbitMQ is not going to open source that work any time soon, or likely ever.

What that means for rabbitmq_shovel we have in the open source edition, time will tell. Its current limitations are perfectly acceptable for a lot of users, as rabbitmq_shovel's 16 year long history proves.

Those who want distributed shovels will have to get a Tanzu RabbitMQ license. This is the collective price we pay for core RabbitMQ being open source and a very significant amount of effort being invested into the open source edition.

0 replies

michaelklishin · 2025-10-23T07:37:31Z

michaelklishin
Oct 23, 2025
Maintainer

Only restarting the rabbitmq_shovel application or node recovers functionality.

This means that in your specific case, it likely was only the local rabbit_shovel_dyn_worker_sup_sup process that has terminated. There are scenarios that are worse than that, where you'd have to restart a specific node, but they are extremely rare to see.

This could potentially be a matter of configuring the supervisor tree restart intensity settings around rabbit_shovel_dyn_worker_sup_sup, for example, its parent supervisor is rabbit_shovel_sup.

And it just happens to be the case that Tanzu RabbitMQ 4.2.0 distributed shovels address this specific scenario as well, even though it wasn't the reason why distributed shovels were originally developed, and the goal was not to make regular shovels worse (in fact, 4.2.0 ships with regular shovels improvements such as the local "protocol" support).

Be that as it may, a significant shovel or mirrored_supervisor design changes are not in the cards for open source RabbitMQ, whether we like it or not. Everything else around shovels in the open source edition is still developed and maintained as any other feature.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

3.13.7 Shovel's rabbit_shovel_dyn_worker_sup_sup can fail without being restarted (exceed supervisor restart intensity?) #14791

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

3.13.7 Shovel's rabbit_shovel_dyn_worker_sup_sup can fail without being restarted (exceed supervisor restart intensity?) #14791

Uh oh!

Uh oh!

kubrakaraman6 Oct 23, 2025

Describe the bug

Environment

But the Same Fundamental Problem Applies to 4.x?

Replies: 2 comments

Uh oh!

Uh oh!

michaelklishin Oct 23, 2025 Maintainer

But the Same Fundamental Problem Applies to 4.x?

Uh oh!

Uh oh!

michaelklishin Oct 23, 2025 Maintainer

kubrakaraman6
Oct 23, 2025

michaelklishin
Oct 23, 2025
Maintainer

michaelklishin
Oct 23, 2025
Maintainer