Skip to content

Conversation

@mihow
Copy link
Collaborator

@mihow mihow commented Dec 4, 2025

Summary

Configures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts.

List of Changes

  • Added socket import to config/settings/base.py
  • Configured CELERY_BROKER_TRANSPORT_OPTIONS with TCP keepalive settings:
    • Enabled socket_keepalive
    • Set socket_settings with TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9
    • Added socket_connect_timeout=10, retry_on_timeout=True, max_connections=10
  • Removed socket_timeout to prevent worker self-termination during long operations
  • Added CELERY_BROKER_HEARTBEAT = 30 for application-level heartbeat monitoring
  • Added CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True for connection retry on startup
  • Added CELERY_BROKER_CONNECTION_MAX_RETRIES = None to allow unlimited connection retries

Related Issues

Fixes #1072

Detailed Description

This PR addresses the "Zombie Tasks" issue where Celery tasks become stuck in "Started" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections.

The Problem

When long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:

  1. Hard Close: The firewall drops the connection without notification, causing worker crashes and orphaned tasks
  2. Soft Close: Missed application-level heartbeats lead RabbitMQ to close connections after detecting worker unavailability

The Solution

This PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:

  • TCP_KEEPIDLE = 60 seconds: Start sending keepalive probes after 60 seconds of idle time
  • TCP_KEEPINTVL = 10 seconds: Send probes every 10 seconds if no response
  • TCP_KEEPCNT = 9: Send up to 9 probes before declaring the connection dead

This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops.

Additionally:

  • Removed socket_timeout to prevent workers from self-terminating during long blocking operations
  • Configured connection retry parameters for better stability and resilience
  • Maintained CELERY_BROKER_HEARTBEAT = 30 for application-level health monitoring

Potential Side Effects

  • Slightly increased network traffic due to keepalive probes (minimal impact)
  • Workers will retry connections indefinitely on startup failures (may need monitoring)

Long-term Recommendation

Consider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker.

How to Test the Changes

  1. Start a new ML processing job in production
  2. SSH into a Worker VM after deployment
  3. Verify TCP keepalive timers are active:
# This lists TCP connections to port 5672 with timer details
sudo ss -tonp | grep 5672

Expected output should show a keepalive timer counting down from ~60 seconds.

Deployment Notes

Each worker must be restarted after deploying for the new connection settings to take effect (happens automatically with deployment script). Then a long running job should be started.

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

Copilot AI review requested due to automatic review settings December 4, 2025 02:25
@netlify
Copy link

netlify bot commented Dec 4, 2025

Deploy Preview for antenna-preview canceled.

Name Link
🔨 Latest commit dc2cbe5
🔍 Latest deploy log https://app.netlify.com/projects/antenna-preview/deploys/6930f2712acbbb00084aab4a

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Import socket and update Celery broker transport options in config/settings/base.py: enable TCP keepalives with specific TCP_KEEPIDLE/TCP_KEEPINTVL/TCP_KEEPCNT settings, remove socket_timeout, add socket_connect_timeout, retain retry_on_timeout/max_connections and heartbeat settings.

Changes

Cohort / File(s) Summary
Celery Broker Connection Settings
config/settings/base.py
Imported socket; replaced CELERY_BROKER_TRANSPORT_OPTIONS to remove socket_timeout, enable "socket_keepalive": True, add socket_settings with socket.TCP_KEEPIDLE=60, socket.TCP_KEEPINTVL=10, socket.TCP_KEEPCNT=9, set socket_connect_timeout=40, keep retry_on_timeout=True, max_connections=20, and retain heartbeat=30. Added explanatory comment about removing socket_timeout.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Check correctness of the socket constants and import placement.
  • Confirm the removed socket_timeout was intentionally dropped and that no code depends on it.
  • Validate values against deployment environment/network idle time.

Possibly related PRs

Suggested reviewers

  • carlosgjs

Poem

"I'm a rabbit in the config field, nibbling the idle blight,
I poke the wire every sixty ticks, keeping tasks in sight.
No more silent severed threads, no zombies in the log,
With keepalives hopping steady, the broker walks the jog. 🐇✨"

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: updating Celery settings to prevent long-running tasks from being silently canceled by adding TCP keepalives.
Linked Issues check ✅ Passed The PR implements all coding-related requirements from issue #1072: enabling TCP keepalives with socket_settings (TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9), removing socket_timeout, setting socket_connect_timeout=40, and preserving CELERY_BROKER_HEARTBEAT=30.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #1072: modifications are limited to CELERY_BROKER_TRANSPORT_OPTIONS configuration in config/settings/base.py and adding the required socket import.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description is comprehensive and follows the required template with all major sections completed.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/jobs-tcp-keepalive

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Celery broker transport options to prevent long-running tasks from being silently canceled due to idle connection timeouts in OpenStack environments. The key change is implementing TCP keepalive settings to maintain active connections.

Key Changes:

  • Adds TCP keepalive configuration with custom socket settings to prevent 1-hour OpenStack timeout
  • Removes socket_timeout from broker transport options to avoid workers terminating during long operations
  • Adds explicit CELERY_BROKER_HEARTBEAT setting
Comments suppressed due to low confidence (1)

config/settings/base.py:358

  • This assignment to 'CELERY_BROKER_TRANSPORT_OPTIONS' is unnecessary as it is redefined before this value is used.
CELERY_BROKER_TRANSPORT_OPTIONS = {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mihow
Copy link
Collaborator Author

mihow commented Dec 4, 2025

@coderabbitai update PR description using the existing template

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find PR template files
fd -t f "PULL_REQUEST_TEMPLATE" .github

# Also check for common template locations
fd -t f -e md "pull_request" .github

# If not found in .github, check root
fd -t f "PULL_REQUEST_TEMPLATE"

Length of output: 32


🏁 Script executed:

#!/bin/bash
# Get the current PR description
gh pr view 1073 --json title,body --jq '.body'

Length of output: 1289


🏁 Script executed:

#!/bin/bash
# Read the PR template
cat .github/pull_request_template.md

Length of output: 1431


🏁 Script executed:

#!/bin/bash
# Update the PR description with the template format
gh pr edit 1073 --body "## Summary

Configures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts.

### List of Changes

* Added \`socket\` import to \`config/settings/base.py\`
* Configured \`CELERY_BROKER_TRANSPORT_OPTIONS\` with TCP keepalive settings:
  - Enabled \`socket_keepalive\`
  - Set \`socket_settings\` with \`TCP_KEEPIDLE=60\`, \`TCP_KEEPINTVL=10\`, \`TCP_KEEPCNT=9\`
  - Added \`socket_connect_timeout=10\`, \`retry_on_timeout=True\`, \`max_connections=10\`
* Removed \`socket_timeout\` to prevent worker self-termination during long operations
* Added \`CELERY_BROKER_HEARTBEAT = 30\` for application-level heartbeat monitoring
* Added \`CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True\` for connection retry on startup
* Added \`CELERY_BROKER_CONNECTION_MAX_RETRIES = None\` to allow unlimited connection retries

### Related Issues

Fixes #1072

## Detailed Description

This PR addresses the \"Zombie Tasks\" issue where Celery tasks become stuck in \"Started\" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections.

### The Problem

When long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:

1. **Hard Close**: The firewall drops the connection without notification, causing worker crashes and orphaned tasks
2. **Soft Close**: Missed application-level heartbeats lead RabbitMQ to close connections after detecting worker unavailability

### The Solution

This PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:

- **TCP_KEEPIDLE = 60 seconds**: Start sending keepalive probes after 60 seconds of idle time
- **TCP_KEEPINTVL = 10 seconds**: Send probes every 10 seconds if no response
- **TCP_KEEPCNT = 9**: Send up to 9 probes before declaring the connection dead

This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops.

Additionally:
- Removed \`socket_timeout\` to prevent workers from self-terminating during long blocking operations
- Configured connection retry parameters for better stability and resilience
- Maintained \`CELERY_BROKER_HEARTBEAT = 30\` for application-level health monitoring

### Potential Side Effects

- Slightly increased network traffic due to keepalive probes (minimal impact)
- Workers will retry connections indefinitely on startup failures (may need monitoring)

### Long-term Recommendation

Consider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker.

### How to Test the Changes

1. Start a new ML processing job in production
2. SSH into a Worker VM after deployment
3. Verify TCP keepalive timers are active:

\`\`\`bash
# This lists TCP connections to port 5672 with timer details
sudo ss -tonp | grep 5672
\`\`\`

Expected output should show a keepalive timer counting down from ~60 seconds.

## Deployment Notes

**Important**: Each worker must be restarted after deploying for the new connection settings to take effect.

## Checklist

- [ ] I have tested these changes appropriately.
- [ ] I have added and/or modified relevant tests.
- [ ] I updated relevant documentation or comments.
- [ ] I have verified that this PR follows the project's coding standards.
- [ ] Any dependent changes have already been merged to main."

Length of output: 877


@mihow, I'm unable to update the PR description directly due to workflow restrictions on the gh pr edit command. However, I've prepared the updated description using your existing template. You can copy and paste this into the PR description:


Summary

Configures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts.

List of Changes

  • Added socket import to config/settings/base.py
  • Configured CELERY_BROKER_TRANSPORT_OPTIONS with TCP keepalive settings:
    • Enabled socket_keepalive
    • Set socket_settings with TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9
    • Added socket_connect_timeout=10, retry_on_timeout=True, max_connections=10
  • Removed socket_timeout to prevent worker self-termination during long operations
  • Added CELERY_BROKER_HEARTBEAT = 30 for application-level heartbeat monitoring
  • Added CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True for connection retry on startup
  • Added CELERY_BROKER_CONNECTION_MAX_RETRIES = None to allow unlimited connection retries

Related Issues

Fixes #1072

Detailed Description

This PR addresses the "Zombie Tasks" issue where Celery tasks become stuck in "Started" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections.

The Problem

When long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:

  1. Hard Close: The firewall drops the connection without notification, causing worker crashes and orphaned tasks
  2. Soft Close: Missed application-level heartbeats lead RabbitMQ to close connections after detecting worker unavailability

The Solution

This PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:

  • TCP_KEEPIDLE = 60 seconds: Start sending keepalive probes after 60 seconds of idle time
  • TCP_KEEPINTVL = 10 seconds: Send probes every 10 seconds if no response
  • TCP_KEEPCNT = 9: Send up to 9 probes before declaring the connection dead

This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops.

Additionally:

  • Removed socket_timeout to prevent workers from self-terminating during long blocking operations
  • Configured connection retry parameters for better stability and resilience
  • Maintained CELERY_BROKER_HEARTBEAT = 30 for application-level health monitoring

Potential Side Effects

  • Slightly increased network traffic due to keepalive probes (minimal impact)
  • Workers will retry connections indefinitely on startup failures (may need monitoring)

Long-term Recommendation

Consider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker.

How to Test the Changes

  1. Start a new ML processing job in production
  2. SSH into a Worker VM after deployment
  3. Verify TCP keepalive timers are active:
# This lists TCP connections to port 5672 with timer details
sudo ss -tonp | grep 5672

Expected output should show a keepalive timer counting down from ~60 seconds.

Deployment Notes

Important: Each worker must be restarted after deploying for the new connection settings to take effect.

Checklist

  • I have tested these changes appropriately.
  • I have added and/or modified relevant tests.
  • I updated relevant documentation or comments.
  • I have verified that this PR follows the project's coding standards.
  • Any dependent changes have already been merged to main.

You can edit the PR description by clicking the "..." menu next to the title on the PR page and selecting "Edit".

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
config/settings/base.py (1)

375-375: Nit: fix quoting in the socket_timeout removal comment

The comment currently reads:

# REMOVED "socket_timeout: 120" to prevent workers self-destructing during long blocking operations.

For clarity and consistency with dictionary syntax, consider:

# REMOVED "socket_timeout": 120 to prevent workers self-destructing during long blocking operations.

This matches the actual key/value form that was removed.

🧹 Nitpick comments (2)
config/settings/base.py (2)

359-369: Make TCP keepalive settings safe on non-Linux platforms

Using socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL, and socket.TCP_KEEPCNT directly at import time can raise AttributeError on platforms where these constants are not defined (e.g., some macOS/Windows setups). That would break importing the Django settings even if Celery isn’t used on those machines.

Consider constructing socket_settings defensively, only adding options that exist on the current platform, e.g.:

-CELERY_BROKER_TRANSPORT_OPTIONS = {
-    # Custom TCP Keepalives to ensure network stack doesn't silently drop connections
-    "socket_keepalive": True,
-    "socket_settings": {
-        # Start sending Keepalive packets after 60 seconds of silence.
-        # This forces traffic on the wire, preventing the OpenStack 1-hour timeout.
-        socket.TCP_KEEPIDLE: 60,
-        # If no response, retry every 10 seconds.
-        socket.TCP_KEEPINTVL: 10,
-        # Give up and close connection after 9 failed attempts.
-        socket.TCP_KEEPCNT: 9,
-    },
+_celery_socket_settings: dict[int, int] = {}
+for _name, _value in (
+    ("TCP_KEEPIDLE", 60),
+    ("TCP_KEEPINTVL", 10),
+    ("TCP_KEEPCNT", 9),
+):
+    if hasattr(socket, _name):
+        _celery_socket_settings[getattr(socket, _name)] = _value
+
+CELERY_BROKER_TRANSPORT_OPTIONS = {
+    # Custom TCP Keepalives to ensure network stack doesn't silently drop connections
+    "socket_keepalive": True,
+    "socket_settings": _celery_socket_settings,

This keeps the intended Linux behavior while avoiding crashes on platforms that don’t expose these constants.


370-375: Consider using CELERY_BROKER_HEARTBEAT for clearer configuration intent

The heartbeat is currently set in CELERY_BROKER_TRANSPORT_OPTIONS as "heartbeat": 30. While this is a valid approach that does configure the AMQP heartbeat for py-amqp transport, using the canonical Celery setting CELERY_BROKER_HEARTBEAT = 30 is the recommended pattern and more explicitly documents the intention.

Both methods ultimately negotiate the same client-side heartbeat with RabbitMQ, so the current configuration is functional at 30 seconds. However, migrating to the high-level Celery setting would improve clarity and align with Celery's documented best practices.

Optional refactoring:

  • Add CELERY_BROKER_HEARTBEAT = 30 to the Celery configuration
  • Remove "heartbeat": 30 from CELERY_BROKER_TRANSPORT_OPTIONS to avoid duplication
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35d2c6c and dc2cbe5.

📒 Files selected for processing (1)
  • config/settings/base.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: test
🔇 Additional comments (1)
config/settings/base.py (1)

5-5: socket import is appropriate for TCP keepalive settings

This import is required for the socket.TCP_* constants used in CELERY_BROKER_TRANSPORT_OPTIONS below; no issues here.

@mihow mihow merged commit f899482 into main Dec 4, 2025
7 checks passed
@mihow mihow deleted the fix/jobs-tcp-keepalive branch December 4, 2025 02:41
@mihow
Copy link
Collaborator Author

mihow commented Dec 4, 2025

The fixes have been deployed

  • Jobs should now continue showing activity after 1+ hours (check job logs within the UI)
  • We should stop seeing ConnectionResetError in the Celery logs (must check logs on the celery server)

Check this job after Dec 3rd 9pm PST to see if there is still activity in the logs
https://antenna.insectai.org/projects/90/jobs/2047

We can see that the new keepalive IS active! here the timeout counter are always under 60 seconds and counting down.

debian@ami-worker-2:~/ami-platform$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS          PORTS     NAMES
af4d42049405   insectai/ami_backend   "/entrypoint /start-…"   29 minutes ago   Up 29 minutes             ami-platform-celeryworker-1
debian@ami-worker-2:~/ami-platform$ PID=$(docker inspect -f '{{.State.Pid}}' af4d42049405)
debian@ami-worker-2:~/ami-platform$ sudo nsenter -t $PID -n ss -ton state established '( dport = :5672 )'
Recv-Q               Send-Q                             Local Address:Port                                  Peer Address:Port               Process                                  
0                    0                                     172.18.0.2:45800                              192.168.123.176:5672                timer:(keepalive,52sec,0)               
8                    0                                     172.18.0.2:45812                              192.168.123.176:5672                timer:(keepalive,46sec,0)               
8                    0                                     172.18.0.2:37234                              192.168.123.176:5672                timer:(keepalive,10sec,0)               
24                   0                                     172.18.0.2:60826                              192.168.123.176:5672                timer:(keepalive,36sec,0)               
debian@ami-worker-2:~/ami-platform$ 

This can be tested again with docker inspect -f '{{.State.Pid}}' $(docker ps -qf "name=celery" | head -n1) | xargs -I {} sudo nsenter -t {} -n ss -ton state established '( dport = :5672 )'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix dangling processing jobs by adding TCP keepalives

2 participants