Celery options to improve stability #1055

mihow · 2025-11-16T05:53:06Z

Summary

Update several default configurations for Celery in the Django settings, in coordination with several server settings made directly to the Redis & RabbitMQ instances.

List of Changes

Primary change we are testing is to enable worker_cancel_long_running_tasks_on_connection_loss for Celery, which apparently should automatically cancel & requeue tasks when the worker looses connection. This is exactly what is happening to use now, Antenna/celery seems to loose connection to the worker and then the tasks are NOT canceled but rather stay in a running state and get no further updates (no new captures are processed). Celery will enable this by default in the next major version change.
Settings to try to reduce worker disconnections
Increase memory on shared Redis & RabbitMQ server from 16 to 45GB
Stop Redis from persisting data to disk (was always running and running with 100% CPU) now that we are only using Redis for cache.
Decrease very long timeouts on Redis & RabbitMQ. I believe these were set thinking they accommodated our long-running tasks. But I believe that was mistaken and the timeouts are about communication timeouts.

See these config files which are not tracked in the main Antenna app repo:
/etc/redis/redis.conf
/etc/rabbitmq/rabbitmq.conf

Related Issues

#1025
#721 (this PR is part of the many fixes for this one)
#1041
#1051

How to Test the Changes

Run several small, medium & large jobs in production

Screenshots

If applicable, add screenshots to help explain this PR (ex. Before and after for UI changes).

worker_cancel_long_running_tasks_on_connection_loss

Redis with snapshot saving on, and original 16GB server

With snapshot saving disabled and on new 45GB instance.
We are just using Redis for cache and a few simple locks now. Saving snapshots was using 100% CPU and almost always running.

Logs that highlighted the new Celery setting and heartbeats being missed:

celeryworker-1  | [2025-11-15 23:42:39,708: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-15 23:45:23,108: INFO/MainProcess] sync with celery@b9a52de2e99c
celeryworker-1  | [2025-11-15 23:48:39,822: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-15 23:50:00,021: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[7a1f72e5-aa62-4ca9-a127-d021bca51109] received
celeryworker-1  | [2025-11-15 23:50:00,025: INFO/ForkPoolWorker-18] Checking if processing services are online.
celeryworker-1  | [2025-11-15 23:50:00,066: INFO/ForkPoolWorker-18] Checking service #5 "AMI Data Companion" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-15 23:50:00,144: INFO/ForkPoolWorker-18] Checking service #12 "AMI Data Companion" at https://ml.dev.insectai.org/
celeryworker-1  | [2025-11-15 23:50:00,201: INFO/ForkPoolWorker-18] Checking service #13 "Zero Shot Detector Pipelines" at https://ml-zs.dev.insectai.org/
celeryworker-1  | [2025-11-15 23:50:00,269: INFO/ForkPoolWorker-18] Checking service #14 "Zero Shot" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-15 23:50:00,325: INFO/ForkPoolWorker-18] Task ami.ml.tasks.check_processing_services_online[7a1f72e5-aa62-4ca9-a127-d021bca51109] succeeded in 0.29996979236602783s: None
celeryworker-1  | [2025-11-15 23:54:39,923: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:00:00,020: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[8c63441c-9432-4fce-a8ff-8f7d3824fb00] received
celeryworker-1  | [2025-11-16 00:00:00,025: INFO/ForkPoolWorker-18] Checking if processing services are online.
celeryworker-1  | [2025-11-16 00:00:00,054: INFO/ForkPoolWorker-18] Checking service #5 "AMI Data Companion" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:00:00,673: INFO/ForkPoolWorker-18] Checking service #12 "AMI Data Companion" at https://ml.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:00:00,770: INFO/ForkPoolWorker-18] Checking service #13 "Zero Shot Detector Pipelines" at https://ml-zs.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:00:01,010: INFO/ForkPoolWorker-18] Checking service #14 "Zero Shot" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:00:01,059: INFO/ForkPoolWorker-18] Task ami.ml.tasks.check_processing_services_online[8c63441c-9432-4fce-a8ff-8f7d3824fb00] succeeded in 1.0348858758807182s: None
celeryworker-1  | [2025-11-16 00:00:30,046: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:01:04,287: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
celeryworker-1  | Traceback (most recent call last):
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 340, in start
celeryworker-1  |     blueprint.start(self)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/bootsteps.py", line 116, in start
celeryworker-1  |     step.start(parent)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/worker/consumer/consumer.py", line 746, in start
celeryworker-1  |     c.loop(*c.loop_args())
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/celery/worker/loops.py", line 97, in asynloop
celeryworker-1  |     next(loop)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/kombu/asynchronous/hub.py", line 373, in create_loop
celeryworker-1  |     cb(*cbargs)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/kombu/transport/base.py", line 248, in on_readable
celeryworker-1  |     reader(loop)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/kombu/transport/base.py", line 230, in _read
celeryworker-1  |     drain_events(timeout=0)
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/connection.py", line 526, in drain_events
celeryworker-1  |     while not self.blocking_read(timeout):
celeryworker-1  |               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/connection.py", line 531, in blocking_read
celeryworker-1  |     frame = self.transport.read_frame()
celeryworker-1  |             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/transport.py", line 297, in read_frame
celeryworker-1  |     frame_header = read(7, True)
celeryworker-1  |                    ^^^^^^^^^^^^^
celeryworker-1  |   File "/usr/local/lib/python3.11/site-packages/amqp/transport.py", line 632, in _read
celeryworker-1  |     s = recv(n - len(rbuf))
celeryworker-1  |         ^^^^^^^^^^^^^^^^^^^
celeryworker-1  | ConnectionResetError: [Errno 104] Connection reset by peer
celeryworker-1  | [2025-11-16 00:01:04,290: WARNING/MainProcess] /usr/local/lib/python3.11/site-packages/celery/worker/consumer/consumer.py:391: CPendingDeprecationWarning: 
celeryworker-1  | In Celery 5.1 we introduced an optional breaking change which
celeryworker-1  | on connection loss cancels all currently executed tasks with late acknowledgement enabled.
celeryworker-1  | These tasks cannot be acknowledged as the connection is gone, and the tasks are automatically redelivered
celeryworker-1  | back to the queue. You can enable this behavior using the worker_cancel_long_running_tasks_on_connection_loss
celeryworker-1  | setting. In Celery 5.1 it is set to False by default. The setting will be set to True by default in Celery 6.0.
celeryworker-1  | 
celeryworker-1  |   warnings.warn(CANCEL_TASKS_BY_DEFAULT, CPendingDeprecationWarning)
celeryworker-1  | 
celeryworker-1  | [2025-11-16 00:01:04,291: INFO/MainProcess] Temporarily reducing the prefetch count to 15 to avoid over-fetching since 1 tasks are currently being processed.
celeryworker-1  | The prefetch count will be gradually restored to 16 as the tasks complete processing.
celeryworker-1  | [2025-11-16 00:01:04,300: INFO/MainProcess] Connected to amqp://antenna:**@rabbitmq:5672//
celeryworker-1  | [2025-11-16 00:01:04,509: INFO/MainProcess] mingle: searching for neighbors
celeryworker-1  | [2025-11-16 00:01:05,537: INFO/MainProcess] mingle: sync with 1 nodes
celeryworker-1  | [2025-11-16 00:01:05,537: INFO/MainProcess] mingle: sync complete
celeryworker-1  | [2025-11-16 00:07:00,638: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:10:00,019: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[761f5c57-a39e-4055-acd1-e86e7579a436] received
celeryworker-1  | [2025-11-16 00:13:00,729: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:19:00,806: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:20:00,015: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[817edb21-8302-411b-9efb-608f4e01ebdb] received
celeryworker-1  | [2025-11-16 00:20:00,018: INFO/MainProcess] Resuming normal operations following a restart.
celeryworker-1  | Prefetch count has been restored to the maximum of 16
celeryworker-1  | [2025-11-16 00:20:00,021: INFO/ForkPoolWorker-18] Checking if processing services are online.
celeryworker-1  | [2025-11-16 00:20:00,058: INFO/ForkPoolWorker-18] Checking service #5 "AMI Data Companion" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:20:00,224: INFO/ForkPoolWorker-18] Checking service #12 "AMI Data Companion" at https://ml.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:20:00,366: INFO/ForkPoolWorker-18] Checking service #13 "Zero Shot Detector Pipelines" at https://ml-zs.dev.insectai.org/
celeryworker-1  | [2025-11-16 00:20:00,458: INFO/ForkPoolWorker-18] Checking service #14 "Zero Shot" at https://ml.dev.insectai.org
celeryworker-1  | [2025-11-16 00:20:00,497: INFO/ForkPoolWorker-18] Task ami.ml.tasks.check_processing_services_online[817edb21-8302-411b-9efb-608f4e01ebdb] succeeded in 0.47902000695466995s: None
celeryworker-1  | [2025-11-16 00:24:55,907: INFO/MainProcess] missed heartbeat from celery@b9a52de2e99c
celeryworker-1  | [2025-11-16 00:30:00,016: INFO/MainProcess] Task ami.ml.tasks.check_processing_services_online[5fcc23a9-5648-4de0-90f0-5fcb8f39668f] received

Summary by CodeRabbit

New Features
- Default task results backend switched to RPC (RabbitMQ).
- Added Redis health checks and connection tuning options.
Bug Fixes
- Increased background job connection timeout (30s → 40s) for improved reliability.
- Reduced heartbeat interval (60s → 30s) to detect connection issues faster.
- Enabled automatic cancellation of long-running tasks on connection loss and added broker retry/startup resilience.

coderabbitai · 2025-11-16T05:53:13Z

Caution

Review failed

The pull request is closed.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Changed Celery defaults and health/transport settings: result backend default set to rpc://, new Redis and Celery worker health-related settings added, broker transport socket_connect_timeout increased to 40s and heartbeat reduced to 30s, and worker cancel-on-connection-loss enabled.

Changes

Cohort / File(s)	Change Summary
Celery settings `config/settings/base.py`	Set `CELERY_RESULT_BACKEND` default to `"rpc://"`; added `CELERY_WORKER_CANCEL_LONG_RUNNING_TASKS_ON_CONNECTION_LOSS = True`; added Redis-related settings `CELERY_REDIS_MAX_CONNECTIONS`, `CELERY_REDIS_SOCKET_TIMEOUT`, `CELERY_REDIS_SOCKET_KEEPALIVE`, `CELERY_REDIS_BACKEND_HEALTH_CHECK_INTERVAL`; added `CELERY_BROKER_CONNECTION_RETRY`/startup retry behavior; updated `CELERY_BROKER_TRANSPORT_OPTIONS` (`socket_connect_timeout`: 30→40, `heartbeat`: 60→30).
Environment files `envs/.ci/.django`, `envs/.local/.django`	Added `CELERY_RESULT_BACKEND=rpc://` with comment indicating RabbitMQ is used for results backend.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Verify impact of switching CELERY_RESULT_BACKEND to rpc:// on deployments and migrations of existing task result storage.
Confirm RabbitMQ transport option changes (socket_connect_timeout, heartbeat) align with infra/network expectations.
Review Redis health setting defaults and conditional usage to ensure monitoring/healthchecks remain correct.
Check CELERY_WORKER_CANCEL_LONG_RUNNING_TASKS_ON_CONNECTION_LOSS semantics for safe task termination and any required worker flags.

Possibly related PRs

Add Celery worker memory and task limits for leak prevention #1051 — modifies Celery broker/worker configuration in the same settings file (broker transport and connection-retry settings).
Set up RabbitMQ as Celery broker #1041 — changes Celery/RabbitMQ configuration including result backend and broker transport options.

Suggested reviewers

carlosgjs

Poem

🐰 A little hop in config land,

Results ride Rabbit’s steady hand.
Heartbeat faster, timeouts grown,
Redis whispers, settings sown.
Tasks will stop if links are blown. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Celery options to improve stability' clearly summarizes the main objective of updating Celery settings to address worker connection loss and stability issues.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The PR description comprehensively covers all major required template sections with detailed explanations and supporting evidence.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 34ce837 and d6e5044.

📒 Files selected for processing (3)

.envs/.ci/.django (1 hunks)
.envs/.local/.django (1 hunks)
config/settings/base.py (3 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

netlify · 2025-11-16T05:53:16Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`d6e5044`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6919808b374aca0008f842a7

Copilot

Pull Request Overview

This PR updates Celery configuration settings to improve worker stability and reduce connection issues with Redis and RabbitMQ brokers. The primary goal is to enable automatic cancellation and requeuing of tasks when workers lose connection to the broker, preventing tasks from staying in a perpetual "running" state.

Key Changes:

Commenting out Redis-specific connection settings that are no longer needed with RabbitMQ as the broker
Adjusting RabbitMQ heartbeat interval from 60 to 30 seconds for faster connection issue detection
Attempting to enable the worker_cancel_long_running_tasks_on_connection_loss feature to handle worker disconnections

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

config/settings/base.py

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b19ca04 and 34ce837.

📒 Files selected for processing (1)

config/settings/base.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

🔇 Additional comments (3)

config/settings/base.py (3)

357-357: LGTM! Connection timeout increase improves stability.

Increasing the connection timeout from 30 to 40 seconds allows more time for connection establishment in variable network conditions, reducing connection failures. This aligns well with the PR's stability objectives.

361-361: LGTM! Shorter heartbeat improves connection loss detection.

Reducing the heartbeat interval from 60 to 30 seconds enables faster detection of broken connections, which is especially important given the PR logs showing missed heartbeats. The trade-off of slightly increased network traffic is worthwhile for improved stability and faster recovery.

349-351: Code change is correct and ready.

The setting worker_cancel_long_running_tasks_on_connection_loss is available from Celery v5.1 onward, and the project uses celery==5.4.0, which fully supports this configuration. The placement as a top-level worker setting is now correct.

config/settings/base.py

This was already in place for production

for the celery broker or results backend

feat: test new celery option for auto canceling disconnected tasks

539417b

mihow added 2 commits November 15, 2025 22:19

feat: try to better handle poor network connections

ddc5e36

chore: no longer using redis as broker

b19ca04

mihow marked this pull request as ready for review November 16, 2025 07:07

Copilot AI review requested due to automatic review settings November 16, 2025 07:07

Copilot started reviewing on behalf of mihow November 16, 2025 07:07 View session

Copilot finished reviewing on behalf of mihow November 16, 2025 07:08

Copilot AI reviewed Nov 16, 2025

View reviewed changes

config/settings/base.py Outdated Show resolved Hide resolved

fix: move setting to proper place!

34ce837

coderabbitai bot reviewed Nov 16, 2025

View reviewed changes

config/settings/base.py Outdated Show resolved Hide resolved

mihow added 2 commits November 15, 2025 23:40

fix: use RabbitMQ for results backend locally & by default

c2c48a6

This was already in place for production

chore: bring back redis defaults, if a environment is using redis

d6e5044

for the celery broker or results backend

mihow merged commit 562f734 into main Nov 16, 2025
6 of 7 checks passed

mihow deleted the fix/celery-task-connections branch November 16, 2025 07:44

coderabbitai bot mentioned this pull request Nov 20, 2025

Increase the length a job can run in RabbitMQ #1060

Merged

5 tasks

coderabbitai bot mentioned this pull request Dec 4, 2025

Celery settings to prevent long running tasks from being silently canceled #1073

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Celery options to improve stability #1055

Celery options to improve stability #1055

Uh oh!

mihow commented Nov 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 16, 2025 •

edited

Loading

Review failed

Other AI code review bot(s) detected

Uh oh!

netlify bot commented Nov 16, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Celery options to improve stability #1055

Celery options to improve stability #1055

Uh oh!

Conversation

mihow commented Nov 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Related Issues

How to Test the Changes

Screenshots

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

netlify bot commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Nov 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 16, 2025 •

edited

Loading

netlify bot commented Nov 16, 2025 •

edited

Loading