-
Notifications
You must be signed in to change notification settings - Fork 11
Celery options to improve stability #1055
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Caution Review failedThe pull request is closed. Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughChanged Celery defaults and health/transport settings: result backend default set to Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Deploy Preview for antenna-preview canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates Celery configuration settings to improve worker stability and reduce connection issues with Redis and RabbitMQ brokers. The primary goal is to enable automatic cancellation and requeuing of tasks when workers lose connection to the broker, preventing tasks from staying in a perpetual "running" state.
Key Changes:
- Commenting out Redis-specific connection settings that are no longer needed with RabbitMQ as the broker
- Adjusting RabbitMQ heartbeat interval from 60 to 30 seconds for faster connection issue detection
- Attempting to enable the
worker_cancel_long_running_tasks_on_connection_lossfeature to handle worker disconnections
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
config/settings/base.py(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (3)
config/settings/base.py (3)
357-357: LGTM! Connection timeout increase improves stability.Increasing the connection timeout from 30 to 40 seconds allows more time for connection establishment in variable network conditions, reducing connection failures. This aligns well with the PR's stability objectives.
361-361: LGTM! Shorter heartbeat improves connection loss detection.Reducing the heartbeat interval from 60 to 30 seconds enables faster detection of broken connections, which is especially important given the PR logs showing missed heartbeats. The trade-off of slightly increased network traffic is worthwhile for improved stability and faster recovery.
349-351: Code change is correct and ready.The setting
worker_cancel_long_running_tasks_on_connection_lossis available from Celery v5.1 onward, and the project uses celery==5.4.0, which fully supports this configuration. The placement as a top-level worker setting is now correct.
This was already in place for production
for the celery broker or results backend
Summary
Update several default configurations for Celery in the Django settings, in coordination with several server settings made directly to the Redis & RabbitMQ instances.
List of Changes
worker_cancel_long_running_tasks_on_connection_lossfor Celery, which apparently should automatically cancel & requeue tasks when the worker looses connection. This is exactly what is happening to use now, Antenna/celery seems to loose connection to the worker and then the tasks are NOT canceled but rather stay in a running state and get no further updates (no new captures are processed). Celery will enable this by default in the next major version change.See these config files which are not tracked in the main Antenna app repo:
/etc/redis/redis.conf/etc/rabbitmq/rabbitmq.confRelated Issues
#1025
#721 (this PR is part of the many fixes for this one)
#1041
#1051
How to Test the Changes
Run several small, medium & large jobs in production
Screenshots
If applicable, add screenshots to help explain this PR (ex. Before and after for UI changes).
worker_cancel_long_running_tasks_on_connection_loss
Redis with snapshot saving on, and original 16GB server

With snapshot saving disabled and on new 45GB instance.

We are just using Redis for cache and a few simple locks now. Saving snapshots was using 100% CPU and almost always running.
Logs that highlighted the new Celery setting and heartbeats being missed:
Summary by CodeRabbit
New Features
Bug Fixes