Skip to content

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Nov 16, 2025

Summary

Update several default configurations for Celery in the Django settings, in coordination with several server settings made directly to the Redis & RabbitMQ instances.

List of Changes

  • Primary change we are testing is to enable worker_cancel_long_running_tasks_on_connection_loss for Celery, which apparently should automatically cancel & requeue tasks when the worker looses connection. This is exactly what is happening to use now, Antenna/celery seems to loose connection to the worker and then the tasks are NOT canceled but rather stay in a running state and get no further updates (no new captures are processed). Celery will enable this by default in the next major version change.
  • Settings to try to reduce worker disconnections
  • Increase memory on shared Redis & RabbitMQ server from 16 to 45GB
  • Stop Redis from persisting data to disk (was always running and running with 100% CPU) now that we are only using Redis for cache.
  • Decrease very long timeouts on Redis & RabbitMQ. I believe these were set thinking they accommodated our long-running tasks. But I believe that was mistaken and the timeouts are about communication timeouts.

See these config files which are not tracked in the main Antenna app repo:
/etc/redis/redis.conf
/etc/rabbitmq/rabbitmq.conf

Related Issues

#1025
#721 (this PR is part of the many fixes for this one)
#1041
#1051

How to Test the Changes

Run several small, medium & large jobs in production

Screenshots

If applicable, add screenshots to help explain this PR (ex. Before and after for UI changes).

worker_cancel_long_running_tasks_on_connection_loss

Redis with snapshot saving on, and original 16GB server
image

With snapshot saving disabled and on new 45GB instance.
We are just using Redis for cache and a few simple locks now. Saving snapshots was using 100% CPU and almost always running.
image

Logs that highlighted the new Celery setting and heartbeats being missed:

celeryworker-1  | [2025-11-15 23:42:39,708: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-15 23:45:23,108: INFO/MainProcess] sync with celery@b9a52de2e99c
celeryworker-1  | [2025-11-15 23:48:39,822: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-15 23:50:00,021: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[7a1f72e5-aa62-4ca9-a127-d021bca51109] received
celeryworker-1  | [2025-11-15 23:50:00,025: INFO/ForkPoolWorker-18] Checking if processing services are online.
celeryworker-1  | [2025-11-15 23:50:00,066: INFO/ForkPoolWorker-18] Checking service #5 "AMI Data Companion" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-15 23:50:00,144: INFO/ForkPoolWorker-18] Checking service #12 "AMI Data Companion" at https://ml.dev.insectai.org/
celeryworker-1  | [2025-11-15 23:50:00,201: INFO/ForkPoolWorker-18] Checking service #13 "Zero Shot Detector Pipelines" at https://ml-zs.dev.insectai.org/
celeryworker-1  | [2025-11-15 23:50:00,269: INFO/ForkPoolWorker-18] Checking service #14 "Zero Shot" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-15 23:50:00,325: INFO/ForkPoolWorker-18] Task ami.ml.tasks.check_processing_services_online[7a1f72e5-aa62-4ca9-a127-d021bca51109] succeeded in 0.29996979236602783s: None
celeryworker-1  | [2025-11-15 23:54:39,923: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:00:00,020: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[8c63441c-9432-4fce-a8ff-8f7d3824fb00] received
celeryworker-1  | [2025-11-16 00:00:00,025: INFO/ForkPoolWorker-18] Checking if processing services are online.
celeryworker-1  | [2025-11-16 00:00:00,054: INFO/ForkPoolWorker-18] Checking service #5 "AMI Data Companion" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:00:00,673: INFO/ForkPoolWorker-18] Checking service #12 "AMI Data Companion" at https://ml.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:00:00,770: INFO/ForkPoolWorker-18] Checking service #13 "Zero Shot Detector Pipelines" at https://ml-zs.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:00:01,010: INFO/ForkPoolWorker-18] Checking service #14 "Zero Shot" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:00:01,059: INFO/ForkPoolWorker-18] Task ami.ml.tasks.check_processing_services_online[8c63441c-9432-4fce-a8ff-8f7d3824fb00] succeeded in 1.0348858758807182s: None
celeryworker-1  | [2025-11-16 00:00:30,046: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:01:04,287: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
celeryworker-1  | Traceback (most recent call last):
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 340, in start
celeryworker-1  |     blueprint.start(self)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/bootsteps.py", line 116, in start
celeryworker-1  |     step.start(parent)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 746, in start
celeryworker-1  |     c.loop(*c.loop_args())
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/worker/loops.py", line 97, in asynloop
celeryworker-1  |     next(loop)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/kombu/asynchronous/hub.py", line 373, in create_loop
celeryworker-1  |     cb(*cbargs)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/kombu/transport/base.py", line 248, in on_readable
celeryworker-1  |     reader(loop)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/kombu/transport/base.py", line 230, in _read
celeryworker-1  |     drain_events(timeout=0)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/connection.py", line 526, in drain_events
celeryworker-1  |     while not self.blocking_read(timeout):
celeryworker-1  |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/connection.py", line 531, in blocking_read
celeryworker-1  |     frame = self.transport.read_frame()
celeryworker-1  |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/transport.py", line 297, in read_frame
celeryworker-1  |     frame_header = read(7, True)
celeryworker-1  |                    ^^^^^^^^^^^^^
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/transport.py", line 632, in _read
celeryworker-1  |     s = recv(n - len(rbuf))
celeryworker-1  |         ^^^^^^^^^^^^^^^^^^^
celeryworker-1  | ConnectionResetError: [Errno 104] Connection reset by peer
celeryworker-1  | [2025-11-16 00:01:04,290: WARNING/MainProcess] /usr/local/lib/python3.11/site-packages/celery/worker/consumer/consumer.py:391: CPendingDeprecationWarning: 
celeryworker-1  | In Celery 5.1 we introduced an optional breaking change which
celeryworker-1  | on connection loss cancels all currently executed tasks with late acknowledgement enabled.
celeryworker-1  | These tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered
celeryworker-1  | back to the queue. You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss
celeryworker-1  | setting. In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.
celeryworker-1  | 
celeryworker-1  |   warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)
celeryworker-1  | 
celeryworker-1  | [2025-11-16 00:01:04,291: INFO/MainProcess] Temporarily reducing the prefetch count to 15 to avoid over-fetching since 1 tasks are currently being processed.
celeryworker-1  | The prefetch count will be gradually restored to 16 as the tasks complete processing.
celeryworker-1  | [2025-11-16 00:01:04,300: INFO/MainProcess] Connected to amqp://antenna:**@rabbitmq:5672//
celeryworker-1  | [2025-11-16 00:01:04,509: INFO/MainProcess] mingle: searching for neighbors
celeryworker-1  | [2025-11-16 00:01:05,537: INFO/MainProcess] mingle: sync with 1 nodes
celeryworker-1  | [2025-11-16 00:01:05,537: INFO/MainProcess] mingle: sync complete
celeryworker-1  | [2025-11-16 00:07:00,638: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:10:00,019: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[761f5c57-a39e-4055-acd1-e86e7579a436] received
celeryworker-1  | [2025-11-16 00:13:00,729: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:19:00,806: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:20:00,015: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[817edb21-8302-411b-9efb-608f4e01ebdb] received
celeryworker-1  | [2025-11-16 00:20:00,018: INFO/MainProcess] Resuming normal operations following a restart.
celeryworker-1  | Prefetch count has been restored to the maximum of 16
celeryworker-1  | [2025-11-16 00:20:00,021: INFO/ForkPoolWorker-18] Checking if processing services are online.
celeryworker-1  | [2025-11-16 00:20:00,058: INFO/ForkPoolWorker-18] Checking service #5 "AMI Data Companion" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:20:00,224: INFO/ForkPoolWorker-18] Checking service #12 "AMI Data Companion" at https://ml.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:20:00,366: INFO/ForkPoolWorker-18] Checking service #13 "Zero Shot Detector Pipelines" at https://ml-zs.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:20:00,458: INFO/ForkPoolWorker-18] Checking service #14 "Zero Shot" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:20:00,497: INFO/ForkPoolWorker-18] Task ami.ml.tasks.check_processing_services_online[817edb21-8302-411b-9efb-608f4e01ebdb] succeeded in 0.47902000695466995s: None
celeryworker-1  | [2025-11-16 00:24:55,907: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:30:00,016: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[5fcc23a9-5648-4de0-90f0-5fcb8f39668f] received

Summary by CodeRabbit

  • New Features

    • Default task results backend switched to RPC (RabbitMQ).
    • Added Redis health checks and connection tuning options.
  • Bug Fixes

    • Increased background job connection timeout (30s → 40s) for improved reliability.
    • Reduced heartbeat interval (60s → 30s) to detect connection issues faster.
    • Enabled automatic cancellation of long-running tasks on connection loss and added broker retry/startup resilience.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 16, 2025

Caution

Review failed

The pull request is closed.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Changed Celery defaults and health/transport settings: result backend default set to rpc://, new Redis and Celery worker health-related settings added, broker transport socket_connect_timeout increased to 40s and heartbeat reduced to 30s, and worker cancel-on-connection-loss enabled.

Changes

Cohort / File(s) Change Summary
Celery settings
config/settings/base.py
Set CELERY_RESULT_BACKEND default to "rpc://"; added CELERY_WORKER_CANCEL_LONG_RUNNING_TASKS_ON_CONNECTION_LOSS = True; added Redis-related settings CELERY_REDIS_MAX_CONNECTIONS, CELERY_REDIS_SOCKET_TIMEOUT, CELERY_REDIS_SOCKET_KEEPALIVE, CELERY_REDIS_BACKEND_HEALTH_CHECK_INTERVAL; added CELERY_BROKER_CONNECTION_RETRY/startup retry behavior; updated CELERY_BROKER_TRANSPORT_OPTIONS (socket_connect_timeout: 30→40, heartbeat: 60→30).
Environment files
envs/.ci/.django, envs/.local/.django
Added CELERY_RESULT_BACKEND=rpc:// with comment indicating RabbitMQ is used for results backend.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify impact of switching CELERY_RESULT_BACKEND to rpc:// on deployments and migrations of existing task result storage.
  • Confirm RabbitMQ transport option changes (socket_connect_timeout, heartbeat) align with infra/network expectations.
  • Review Redis health setting defaults and conditional usage to ensure monitoring/healthchecks remain correct.
  • Check CELERY_WORKER_CANCEL_LONG_RUNNING_TASKS_ON_CONNECTION_LOSS semantics for safe task termination and any required worker flags.

Possibly related PRs

Suggested reviewers

  • carlosgjs

Poem

🐰 A little hop in config land,

Results ride Rabbit’s steady hand.
Heartbeat faster, timeouts grown,
Redis whispers, settings sown.
Tasks will stop if links are blown. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Celery options to improve stability' clearly summarizes the main objective of updating Celery settings to address worker connection loss and stability issues.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description comprehensively covers all major required template sections with detailed explanations and supporting evidence.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34ce837 and d6e5044.

📒 Files selected for processing (3)
  • .envs/.ci/.django (1 hunks)
  • .envs/.local/.django (1 hunks)
  • config/settings/base.py (3 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@netlify
Copy link

netlify bot commented Nov 16, 2025

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit d6e5044
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6919808b374aca0008f842a7

@mihow mihow marked this pull request as ready for review November 16, 2025 07:07
Copilot AI review requested due to automatic review settings November 16, 2025 07:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR updates Celery configuration settings to improve worker stability and reduce connection issues with Redis and RabbitMQ brokers. The primary goal is to enable automatic cancellation and requeuing of tasks when workers lose connection to the broker, preventing tasks from staying in a perpetual "running" state.

Key Changes:

  • Commenting out Redis-specific connection settings that are no longer needed with RabbitMQ as the broker
  • Adjusting RabbitMQ heartbeat interval from 60 to 30 seconds for faster connection issue detection
  • Attempting to enable the worker_cancel_long_running_tasks_on_connection_loss feature to handle worker disconnections

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b19ca04 and 34ce837.

📒 Files selected for processing (1)
  • config/settings/base.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (3)
config/settings/base.py (3)

357-357: LGTM! Connection timeout increase improves stability.

Increasing the connection timeout from 30 to 40 seconds allows more time for connection establishment in variable network conditions, reducing connection failures. This aligns well with the PR's stability objectives.


361-361: LGTM! Shorter heartbeat improves connection loss detection.

Reducing the heartbeat interval from 60 to 30 seconds enables faster detection of broken connections, which is especially important given the PR logs showing missed heartbeats. The trade-off of slightly increased network traffic is worthwhile for improved stability and faster recovery.


349-351: Code change is correct and ready.

The setting worker_cancel_long_running_tasks_on_connection_loss is available from Celery v5.1 onward, and the project uses celery==5.4.0, which fully supports this configuration. The placement as a top-level worker setting is now correct.

@mihow mihow merged commit 562f734 into main Nov 16, 2025
6 of 7 checks passed
@mihow mihow deleted the fix/celery-task-connections branch November 16, 2025 07:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants