Celery settings to prevent long running tasks from being silently canceled #1073

mihow · 2025-12-04T02:25:28Z

Summary

Configures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts.

List of Changes

Added socket import to config/settings/base.py
Configured CELERY_BROKER_TRANSPORT_OPTIONS with TCP keepalive settings:
- Enabled socket_keepalive
- Set socket_settings with TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9
- Added socket_connect_timeout=10, retry_on_timeout=True, max_connections=10
Removed socket_timeout to prevent worker self-termination during long operations
Added CELERY_BROKER_HEARTBEAT = 30 for application-level heartbeat monitoring
Added CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True for connection retry on startup
Added CELERY_BROKER_CONNECTION_MAX_RETRIES = None to allow unlimited connection retries

Related Issues

Fixes #1072

Detailed Description

This PR addresses the "Zombie Tasks" issue where Celery tasks become stuck in "Started" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections.

The Problem

When long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:

Hard Close: The firewall drops the connection without notification, causing worker crashes and orphaned tasks
Soft Close: Missed application-level heartbeats lead RabbitMQ to close connections after detecting worker unavailability

The Solution

This PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:

TCP_KEEPIDLE = 60 seconds: Start sending keepalive probes after 60 seconds of idle time
TCP_KEEPINTVL = 10 seconds: Send probes every 10 seconds if no response
TCP_KEEPCNT = 9: Send up to 9 probes before declaring the connection dead

This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops.

Additionally:

Removed socket_timeout to prevent workers from self-terminating during long blocking operations
Configured connection retry parameters for better stability and resilience
Maintained CELERY_BROKER_HEARTBEAT = 30 for application-level health monitoring

Potential Side Effects

Slightly increased network traffic due to keepalive probes (minimal impact)
Workers will retry connections indefinitely on startup failures (may need monitoring)

Long-term Recommendation

Consider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker.

How to Test the Changes

Start a new ML processing job in production
SSH into a Worker VM after deployment
Verify TCP keepalive timers are active:

# This lists TCP connections to port 5672 with timer details
sudo ss -tonp | grep 5672

Expected output should show a keepalive timer counting down from ~60 seconds.

Deployment Notes

Each worker must be restarted after deploying for the new connection settings to take effect (happens automatically with deployment script). Then a long running job should be started.

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

netlify · 2025-12-04T02:25:33Z

✅ Deploy Preview for antenna-preview canceled.

Name	Link
🔨 Latest commit	`dc2cbe5`
🔍 Latest deploy log	https://app.netlify.com/projects/antenna-preview/deploys/6930f2712acbbb00084aab4a

coderabbitai · 2025-12-04T02:25:38Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Import socket and update Celery broker transport options in config/settings/base.py: enable TCP keepalives with specific TCP_KEEPIDLE/TCP_KEEPINTVL/TCP_KEEPCNT settings, remove socket_timeout, add socket_connect_timeout, retain retry_on_timeout/max_connections and heartbeat settings.

Changes

Cohort / File(s)	Summary
Celery Broker Connection Settings `config/settings/base.py`	Imported `socket`; replaced `CELERY_BROKER_TRANSPORT_OPTIONS` to remove `socket_timeout`, enable `"socket_keepalive": True`, add `socket_settings` with `socket.TCP_KEEPIDLE=60`, `socket.TCP_KEEPINTVL=10`, `socket.TCP_KEEPCNT=9`, set `socket_connect_timeout=40`, keep `retry_on_timeout=True`, `max_connections=20`, and retain `heartbeat=30`. Added explanatory comment about removing `socket_timeout`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Check correctness of the socket constants and import placement.
Confirm the removed socket_timeout was intentionally dropped and that no code depends on it.
Validate values against deployment environment/network idle time.

Possibly related PRs

Set up RabbitMQ as Celery broker #1041 — Modifies Celery broker transport and connection settings in config/settings/base.py.
Celery options to improve stability #1055 — Updates Celery broker transport options and retry/heartbeat configuration.
Add Celery worker memory and task limits for leak prevention #1051 — Changes CELERY_BROKER_TRANSPORT_OPTIONS and related Celery settings.

Suggested reviewers

carlosgjs

Poem

"I'm a rabbit in the config field, nibbling the idle blight,
I poke the wire every sixty ticks, keeping tasks in sight.
No more silent severed threads, no zombies in the log,
With keepalives hopping steady, the broker walks the jog. 🐇✨"

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: updating Celery settings to prevent long-running tasks from being silently canceled by adding TCP keepalives.
Linked Issues check	✅ Passed	The PR implements all coding-related requirements from issue #1072: enabling TCP keepalives with socket_settings (TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9), removing socket_timeout, setting socket_connect_timeout=40, and preserving CELERY_BROKER_HEARTBEAT=30.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue #1072: modifications are limited to CELERY_BROKER_TRANSPORT_OPTIONS configuration in config/settings/base.py and adding the required socket import.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check	✅ Passed	The pull request description is comprehensive and follows the required template with all major sections completed.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/jobs-tcp-keepalive

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adds Celery broker transport options to prevent long-running tasks from being silently canceled due to idle connection timeouts in OpenStack environments. The key change is implementing TCP keepalive settings to maintain active connections.

Key Changes:

Adds TCP keepalive configuration with custom socket settings to prevent 1-hour OpenStack timeout
Removes socket_timeout from broker transport options to avoid workers terminating during long operations
Adds explicit CELERY_BROKER_HEARTBEAT setting

Comments suppressed due to low confidence (1)

config/settings/base.py:358

This assignment to 'CELERY_BROKER_TRANSPORT_OPTIONS' is unnecessary as it is redefined before this value is used.

CELERY_BROKER_TRANSPORT_OPTIONS = {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

config/settings/base.py

mihow · 2025-12-04T02:32:22Z

@coderabbitai update PR description using the existing template

coderabbitai · 2025-12-04T02:33:31Z

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find PR template files
fd -t f "PULL_REQUEST_TEMPLATE" .github

# Also check for common template locations
fd -t f -e md "pull_request" .github

# If not found in .github, check root
fd -t f "PULL_REQUEST_TEMPLATE"

Length of output: 32

🏁 Script executed:

#!/bin/bash
# Get the current PR description
gh pr view 1073 --json title,body --jq '.body'

Length of output: 1289

🏁 Script executed:

#!/bin/bash
# Read the PR template
cat .github/pull_request_template.md

Length of output: 1431

🏁 Script executed:

#!/bin/bash
# Update the PR description with the template format
gh pr edit 1073 --body "## Summary

Configures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts.

### List of Changes

* Added \`socket\` import to \`config/settings/base.py\`
* Configured \`CELERY_BROKER_TRANSPORT_OPTIONS\` with TCP keepalive settings:
  - Enabled \`socket_keepalive\`
  - Set \`socket_settings\` with \`TCP_KEEPIDLE=60\`, \`TCP_KEEPINTVL=10\`, \`TCP_KEEPCNT=9\`
  - Added \`socket_connect_timeout=10\`, \`retry_on_timeout=True\`, \`max_connections=10\`
* Removed \`socket_timeout\` to prevent worker self-termination during long operations
* Added \`CELERY_BROKER_HEARTBEAT = 30\` for application-level heartbeat monitoring
* Added \`CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True\` for connection retry on startup
* Added \`CELERY_BROKER_CONNECTION_MAX_RETRIES = None\` to allow unlimited connection retries

### Related Issues

Fixes #1072

## Detailed Description

This PR addresses the \"Zombie Tasks\" issue where Celery tasks become stuck in \"Started\" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections.

### The Problem

When long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:

1. **Hard Close**: The firewall drops the connection without notification, causing worker crashes and orphaned tasks
2. **Soft Close**: Missed application-level heartbeats lead RabbitMQ to close connections after detecting worker unavailability

### The Solution

This PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:

- **TCP_KEEPIDLE = 60 seconds**: Start sending keepalive probes after 60 seconds of idle time
- **TCP_KEEPINTVL = 10 seconds**: Send probes every 10 seconds if no response
- **TCP_KEEPCNT = 9**: Send up to 9 probes before declaring the connection dead

This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops.

Additionally:
- Removed \`socket_timeout\` to prevent workers from self-terminating during long blocking operations
- Configured connection retry parameters for better stability and resilience
- Maintained \`CELERY_BROKER_HEARTBEAT = 30\` for application-level health monitoring

### Potential Side Effects

- Slightly increased network traffic due to keepalive probes (minimal impact)
- Workers will retry connections indefinitely on startup failures (may need monitoring)

### Long-term Recommendation

Consider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker.

### How to Test the Changes

1. Start a new ML processing job in production
2. SSH into a Worker VM after deployment
3. Verify TCP keepalive timers are active:

\`\`\`bash
# This lists TCP connections to port 5672 with timer details
sudo ss -tonp | grep 5672
\`\`\`

Expected output should show a keepalive timer counting down from ~60 seconds.

## Deployment Notes

**Important**: Each worker must be restarted after deploying for the new connection settings to take effect.

## Checklist

- [ ] I have tested these changes appropriately.
- [ ] I have added and/or modified relevant tests.
- [ ] I updated relevant documentation or comments.
- [ ] I have verified that this PR follows the project's coding standards.
- [ ] Any dependent changes have already been merged to main."

Length of output: 877

@mihow, I'm unable to update the PR description directly due to workflow restrictions on the gh pr edit command. However, I've prepared the updated description using your existing template. You can copy and paste this into the PR description:

Summary

Configures Celery broker transport options to prevent long-running tasks from being silently canceled due to OpenStack networking idle timeouts.

List of Changes

Added socket import to config/settings/base.py
Configured CELERY_BROKER_TRANSPORT_OPTIONS with TCP keepalive settings:
- Enabled socket_keepalive
- Set socket_settings with TCP_KEEPIDLE=60, TCP_KEEPINTVL=10, TCP_KEEPCNT=9
- Added socket_connect_timeout=10, retry_on_timeout=True, max_connections=10
Removed socket_timeout to prevent worker self-termination during long operations
Added CELERY_BROKER_HEARTBEAT = 30 for application-level heartbeat monitoring
Added CELERY_BROKER_CONNECTION_RETRY_ON_STARTUP = True for connection retry on startup
Added CELERY_BROKER_CONNECTION_MAX_RETRIES = None to allow unlimited connection retries

Related Issues

Fixes #1072

Detailed Description

This PR addresses the "Zombie Tasks" issue where Celery tasks become stuck in "Started" state due to OpenStack Neutron's 1-hour idle timeout on TCP connections.

The Problem

When long-running ML processing tasks idle on RabbitMQ connections, the OpenStack firewall silently severs these connections after 1 hour of inactivity. This causes two failure modes:

Hard Close: The firewall drops the connection without notification, causing worker crashes and orphaned tasks
Soft Close: Missed application-level heartbeats lead RabbitMQ to close connections after detecting worker unavailability

The Solution

This PR implements OS-level TCP keepalive probes that ping the connection more frequently than the 1-hour timeout:

TCP_KEEPIDLE = 60 seconds: Start sending keepalive probes after 60 seconds of idle time
TCP_KEEPINTVL = 10 seconds: Send probes every 10 seconds if no response
TCP_KEEPCNT = 9: Send up to 9 probes before declaring the connection dead

This ensures the OS detects and handles connection failures within ~2.5 minutes (60 + 9×10 seconds) rather than relying on application-level timeouts or firewall drops.

Additionally:

Removed socket_timeout to prevent workers from self-terminating during long blocking operations
Configured connection retry parameters for better stability and resilience
Maintained CELERY_BROKER_HEARTBEAT = 30 for application-level health monitoring

Potential Side Effects

Slightly increased network traffic due to keepalive probes (minimal impact)
Workers will retry connections indefinitely on startup failures (may need monitoring)

Long-term Recommendation

Consider refactoring long master tasks into Celery chords to avoid prolonged idle connections on the broker.

How to Test the Changes

Start a new ML processing job in production
SSH into a Worker VM after deployment
Verify TCP keepalive timers are active:

# This lists TCP connections to port 5672 with timer details
sudo ss -tonp | grep 5672

Expected output should show a keepalive timer counting down from ~60 seconds.

Deployment Notes

Important: Each worker must be restarted after deploying for the new connection settings to take effect.

Checklist

I have tested these changes appropriately.
I have added and/or modified relevant tests.
I updated relevant documentation or comments.
I have verified that this PR follows the project's coding standards.
Any dependent changes have already been merged to main.

You can edit the PR description by clicking the "..." menu next to the title on the PR page and selecting "Edit".

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

config/settings/base.py (1)
375-375: Nit: fix quoting in the socket_timeout removal comment

The comment currently reads:
# REMOVED "socket_timeout: 120" to prevent workers self-destructing during long blocking operations.
For clarity and consistency with dictionary syntax, consider:
# REMOVED "socket_timeout": 120 to prevent workers self-destructing during long blocking operations.
This matches the actual key/value form that was removed.

🧹 Nitpick comments (2)

config/settings/base.py (2)
359-369: Make TCP keepalive settings safe on non-Linux platforms

Using socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL, and socket.TCP_KEEPCNT directly at import time can raise AttributeError on platforms where these constants are not defined (e.g., some macOS/Windows setups). That would break importing the Django settings even if Celery isn’t used on those machines.

Consider constructing socket_settings defensively, only adding options that exist on the current platform, e.g.:
-CELERY_BROKER_TRANSPORT_OPTIONS = {
-    # Custom TCP Keepalives to ensure network stack doesn't silently drop connections
-    "socket_keepalive": True,
-    "socket_settings": {
-        # Start sending Keepalive packets after 60 seconds of silence.
-        # This forces traffic on the wire, preventing the OpenStack 1-hour timeout.
-        socket.TCP_KEEPIDLE: 60,
-        # If no response, retry every 10 seconds.
-        socket.TCP_KEEPINTVL: 10,
-        # Give up and close connection after 9 failed attempts.
-        socket.TCP_KEEPCNT: 9,
-    },
+_celery_socket_settings: dict[int, int] = {}
+for _name, _value in (
+    ("TCP_KEEPIDLE", 60),
+    ("TCP_KEEPINTVL", 10),
+    ("TCP_KEEPCNT", 9),
+):
+    if hasattr(socket, _name):
+        _celery_socket_settings[getattr(socket, _name)] = _value
+
+CELERY_BROKER_TRANSPORT_OPTIONS = {
+    # Custom TCP Keepalives to ensure network stack doesn't silently drop connections
+    "socket_keepalive": True,
+    "socket_settings": _celery_socket_settings,
This keeps the intended Linux behavior while avoiding crashes on platforms that don’t expose these constants.

370-375: Consider using CELERY_BROKER_HEARTBEAT for clearer configuration intent

The heartbeat is currently set in CELERY_BROKER_TRANSPORT_OPTIONS as "heartbeat": 30. While this is a valid approach that does configure the AMQP heartbeat for py-amqp transport, using the canonical Celery setting CELERY_BROKER_HEARTBEAT = 30 is the recommended pattern and more explicitly documents the intention.

Both methods ultimately negotiate the same client-side heartbeat with RabbitMQ, so the current configuration is functional at 30 seconds. However, migrating to the high-level Celery setting would improve clarity and align with Celery's documented best practices.

Optional refactoring:

Add CELERY_BROKER_HEARTBEAT = 30 to the Celery configuration

Remove "heartbeat": 30 from CELERY_BROKER_TRANSPORT_OPTIONS to avoid duplication

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35d2c6c and dc2cbe5.

📒 Files selected for processing (1)

config/settings/base.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: test

🔇 Additional comments (1)

config/settings/base.py (1)

5-5: socket import is appropriate for TCP keepalive settings

This import is required for the socket.TCP_* constants used in CELERY_BROKER_TRANSPORT_OPTIONS below; no issues here.

mihow · 2025-12-04T03:16:29Z

The fixes have been deployed

Jobs should now continue showing activity after 1+ hours (check job logs within the UI)
We should stop seeing ConnectionResetError in the Celery logs (must check logs on the celery server)

Check this job after Dec 3rd 9pm PST to see if there is still activity in the logs
https://antenna.insectai.org/projects/90/jobs/2047

We can see that the new keepalive IS active! here the timeout counter are always under 60 seconds and counting down.

debian@ami-worker-2:~/ami-platform$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS          PORTS     NAMES
af4d42049405   insectai/ami_backend   "/entrypoint /start-…"   29 minutes ago   Up 29 minutes             ami-platform-celeryworker-1
debian@ami-worker-2:~/ami-platform$ PID=$(docker inspect -f '{{.State.Pid}}' af4d42049405)
debian@ami-worker-2:~/ami-platform$ sudo nsenter -t $PID -n ss -ton state established '( dport = :5672 )'
Recv-Q               Send-Q                             Local Address:Port                                  Peer Address:Port               Process                                  
0                    0                                     172.18.0.2:45800                              192.168.123.176:5672                timer:(keepalive,52sec,0)               
8                    0                                     172.18.0.2:45812                              192.168.123.176:5672                timer:(keepalive,46sec,0)               
8                    0                                     172.18.0.2:37234                              192.168.123.176:5672                timer:(keepalive,10sec,0)               
24                   0                                     172.18.0.2:60826                              192.168.123.176:5672                timer:(keepalive,36sec,0)               
debian@ami-worker-2:~/ami-platform$

This can be tested again with docker inspect -f '{{.State.Pid}}' $(docker ps -qf "name=celery" | head -n1) | xargs -I {} sudo nsenter -t {} -n ss -ton state established '( dport = :5672 )'

feat: custom TCP keeplalive to keep long running tasks alive

35d2c6c

Copilot AI review requested due to automatic review settings December 4, 2025 02:25

Copilot started reviewing on behalf of mihow December 4, 2025 02:25 View session

Copilot finished reviewing on behalf of mihow December 4, 2025 02:26

Copilot AI reviewed Dec 4, 2025

View reviewed changes

config/settings/base.py Show resolved Hide resolved

config/settings/base.py Outdated Show resolved Hide resolved

config/settings/base.py Show resolved Hide resolved

mihow added 2 commits December 3, 2025 18:29

chore: remove old Celery settings

23b5782

chore: cleanup settings

dc2cbe5

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

mihow merged commit f899482 into main Dec 4, 2025
7 checks passed

mihow deleted the fix/jobs-tcp-keepalive branch December 4, 2025 02:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Celery settings to prevent long running tasks from being silently canceled #1073

Celery settings to prevent long running tasks from being silently canceled #1073

Uh oh!

mihow commented Dec 4, 2025 •

edited

Loading

Uh oh!

netlify bot commented Dec 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Dec 4, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihow commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

mihow commented Dec 4, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Celery settings to prevent long running tasks from being silently canceled #1073

Celery settings to prevent long running tasks from being silently canceled #1073

Uh oh!

Conversation

mihow commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

List of Changes

Related Issues

Detailed Description

The Problem

The Solution

Potential Side Effects

Long-term Recommendation

How to Test the Changes

Deployment Notes

Checklist

Uh oh!

netlify bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for antenna-preview canceled.

Uh oh!

coderabbitai bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihow commented Dec 4, 2025

Uh oh!

coderabbitai bot commented Dec 4, 2025

Summary

List of Changes

Related Issues

Detailed Description

The Problem

The Solution

Potential Side Effects

Long-term Recommendation

How to Test the Changes

Deployment Notes

Checklist

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mihow commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mihow commented Dec 4, 2025 •

edited

Loading

netlify bot commented Dec 4, 2025 •

edited

Loading

coderabbitai bot commented Dec 4, 2025 •

edited

Loading

mihow commented Dec 4, 2025 •

edited

Loading