[Questions] 4.1 Enabling feature flag khepri_db while log exchange is enabled results in OOM #14069
-
Community Support Policy
RabbitMQ version used4.1.0 Erlang version used27.3.x Operating system (distribution) usedubuntu How is RabbitMQ deployed?Debian package Logs from node 1 (with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs Logs from node 2 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs Logs from node 3 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs Steps to deploy RabbitMQ clusterInstall deb package on 3 servers. Steps to reproduce the behavior in questionlocally from a git repo
advanced.configWhat problem are you trying to solve?On a 3-node cluster on RabbitMQ 4.1.0 the log exchange was enabled on info level. When enabling feature flag The issue is easily reproducible locally with a fresh, empty cluster, if log exchange level is set to debug (so there is more logging) but it sometimes also happens on higher levels (I suspect some timing race there, might depend the number of objects in Mnesia) The OOM happens because of a cyclic dependency between I managed to capture the below partial stacktraces before the OOM: This is a similar issue to which happened on 3.13 with the message container feature flag (exchange logging also depended on mc FF, sorry I cannot find the discussion link). If it is hard to solve this properly, a workaround is to disable exchange logging while enabling the Another issue is that when the first node restarted after an OOM, the cluster couldn't recover. First we've seen the below crash And after it The first node remained in a cyclic reboot, while the other nodes only logged: The khepri_db feature flag got stuck in state_changing. My question is if there is any advice how to recover the cluster from this state. (I imagine enabling khepri can be interrupted for various reasons, hardware failure or if there is too much metadata in mnesia on a server with small memory, migration can cause an OOM). Im not sure what state khepri could be (badmatch |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
@gomoripeti the easiest option is to remove that node from the cluster and re-create it. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.

Fixed by @dumbbell via #14796