-
Notifications
You must be signed in to change notification settings - Fork 11
Celery settings to prevent long running tasks from being silently canceled #1073
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for antenna-preview canceled.
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughImport socket and update Celery broker transport options in config/settings/base.py: enable TCP keepalives with specific TCP_KEEPIDLE/TCP_KEEPINTVL/TCP_KEEPCNT settings, remove socket_timeout, add socket_connect_timeout, retain retry_on_timeout/max_connections and heartbeat settings. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Celery broker transport options to prevent long-running tasks from being silently canceled due to idle connection timeouts in OpenStack environments. The key change is implementing TCP keepalive settings to maintain active connections.
Key Changes:
- Adds TCP keepalive configuration with custom socket settings to prevent 1-hour OpenStack timeout
- Removes
socket_timeoutfrom broker transport options to avoid workers terminating during long operations - Adds explicit
CELERY_BROKER_HEARTBEATsetting
Comments suppressed due to low confidence (1)
config/settings/base.py:358
- This assignment to 'CELERY_BROKER_TRANSPORT_OPTIONS' is unnecessary as it is redefined before this value is used.
CELERY_BROKER_TRANSPORT_OPTIONS = {
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@coderabbitai update PR description using the existing template |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 32 🏁 Script executed: Length of output: 1289 🏁 Script executed: Length of output: 1431 🏁 Script executed: Length of output: 877 @mihow, I'm unable to update the PR description directly due to workflow restrictions on the SummaryConfigures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts. List of Changes
Related IssuesFixes #1072 Detailed DescriptionThis PR addresses the "Zombie Tasks" issue where Celery tasks become stuck in "Started" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections. The ProblemWhen long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:
The SolutionThis PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:
This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops. Additionally:
Potential Side Effects
Long-term RecommendationConsider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker. How to Test the Changes
# This lists TCP connections to port 5672 with timer details
sudo ss -tonp | grep 5672Expected output should show a keepalive timer counting down from ~60 seconds. Deployment NotesImportant: Each worker must be restarted after deploying for the new connection settings to take effect. Checklist
You can edit the PR description by clicking the "..." menu next to the title on the PR page and selecting "Edit". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
config/settings/base.py (1)
375-375: Nit: fix quoting in thesocket_timeoutremoval commentThe comment currently reads:
# REMOVED "socket_timeout: 120" to prevent workers self-destructing during long blocking operations.For clarity and consistency with dictionary syntax, consider:
# REMOVED "socket_timeout": 120 to prevent workers self-destructing during long blocking operations.This matches the actual key/value form that was removed.
🧹 Nitpick comments (2)
config/settings/base.py (2)
359-369: Make TCP keepalive settings safe on non-Linux platformsUsing
socket.TCP_KEEPIDLE,socket.TCP_KEEPINTVL, andsocket.TCP_KEEPCNTdirectly at import time can raiseAttributeErroron platforms where these constants are not defined (e.g., some macOS/Windows setups). That would break importing the Django settings even if Celery isn’t used on those machines.Consider constructing
socket_settingsdefensively, only adding options that exist on the current platform, e.g.:-CELERY_BROKER_TRANSPORT_OPTIONS = { - # Custom TCP Keepalives to ensure network stack doesn't silently drop connections - "socket_keepalive": True, - "socket_settings": { - # Start sending Keepalive packets after 60 seconds of silence. - # This forces traffic on the wire, preventing the OpenStack 1-hour timeout. - socket.TCP_KEEPIDLE: 60, - # If no response, retry every 10 seconds. - socket.TCP_KEEPINTVL: 10, - # Give up and close connection after 9 failed attempts. - socket.TCP_KEEPCNT: 9, - }, +_celery_socket_settings: dict[int, int] = {} +for _name, _value in ( + ("TCP_KEEPIDLE", 60), + ("TCP_KEEPINTVL", 10), + ("TCP_KEEPCNT", 9), +): + if hasattr(socket, _name): + _celery_socket_settings[getattr(socket, _name)] = _value + +CELERY_BROKER_TRANSPORT_OPTIONS = { + # Custom TCP Keepalives to ensure network stack doesn't silently drop connections + "socket_keepalive": True, + "socket_settings": _celery_socket_settings,This keeps the intended Linux behavior while avoiding crashes on platforms that don’t expose these constants.
370-375: Consider usingCELERY_BROKER_HEARTBEATfor clearer configuration intentThe heartbeat is currently set in
CELERY_BROKER_TRANSPORT_OPTIONSas"heartbeat": 30. While this is a valid approach that does configure the AMQP heartbeat for py-amqp transport, using the canonical Celery settingCELERY_BROKER_HEARTBEAT = 30is the recommended pattern and more explicitly documents the intention.Both methods ultimately negotiate the same client-side heartbeat with RabbitMQ, so the current configuration is functional at 30 seconds. However, migrating to the high-level Celery setting would improve clarity and align with Celery's documented best practices.
Optional refactoring:
- Add
CELERY_BROKER_HEARTBEAT = 30to the Celery configuration- Remove
"heartbeat": 30fromCELERY_BROKER_TRANSPORT_OPTIONSto avoid duplication
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
config/settings/base.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: test
🔇 Additional comments (1)
config/settings/base.py (1)
5-5:socketimport is appropriate for TCP keepalive settingsThis import is required for the
socket.TCP_*constants used inCELERY_BROKER_TRANSPORT_OPTIONSbelow; no issues here.
|
The fixes have been deployed
Check this job after Dec 3rd 9pm PST to see if there is still activity in the logs We can see that the new keepalive IS active! here the timeout counter are always under 60 seconds and counting down. debian@ami-worker-2:~/ami-platform$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
af4d42049405 insectai/ami_backend "/entrypoint /start-…" 29 minutes ago Up 29 minutes ami-platform-celeryworker-1
debian@ami-worker-2:~/ami-platform$ PID=$(docker inspect -f '{{.State.Pid}}' af4d42049405)
debian@ami-worker-2:~/ami-platform$ sudo nsenter -t $PID -n ss -ton state established '( dport = :5672 )'
Recv-Q Send-Q Local Address:Port Peer Address:Port Process
0 0 172.18.0.2:45800 192.168.123.176:5672 timer:(keepalive,52sec,0)
8 0 172.18.0.2:45812 192.168.123.176:5672 timer:(keepalive,46sec,0)
8 0 172.18.0.2:37234 192.168.123.176:5672 timer:(keepalive,10sec,0)
24 0 172.18.0.2:60826 192.168.123.176:5672 timer:(keepalive,36sec,0)
debian@ami-worker-2:~/ami-platform$ This can be tested again with |
Summary
Configures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts.
List of Changes
socketimport toconfig/settings/base.pyCELERY_BROKER_TRANSPORT_OPTIONSwith TCP keepalive settings:socket_keepalivesocket_settingswithTCP_KEEPIDLE=60,TCP_KEEPINTVL=10,TCP_KEEPCNT=9socket_connect_timeout=10,retry_on_timeout=True,max_connections=10socket_timeoutto prevent worker self-termination during long operationsCELERY_BROKER_HEARTBEAT = 30for application-level heartbeat monitoringCELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = Truefor connection retry on startupCELERY_BROKER_CONNECTION_MAX_RETRIES = Noneto allow unlimited connection retriesRelated Issues
Fixes #1072
Detailed Description
This PR addresses the "Zombie Tasks" issue where Celery tasks become stuck in "Started" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections.
The Problem
When long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:
The Solution
This PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:
This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops.
Additionally:
socket_timeoutto prevent workers from self-terminating during long blocking operationsCELERY_BROKER_HEARTBEAT = 30for application-level health monitoringPotential Side Effects
Long-term Recommendation
Consider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker.
How to Test the Changes
Expected output should show a keepalive timer counting down from ~60 seconds.
Deployment Notes
Each worker must be restarted after deploying for the new connection settings to take effect (happens automatically with deployment script). Then a long running job should be started.
Checklist