Skip to content

Fix/2229 webhook service down failures #2477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

paulnegz
Copy link

Fix: Webhook Service Down No Longer Blocks Async Predictions or Cancellation
Webhooks now run in background threads using a ThreadPoolExecutor, so failures or timeouts do not block the main prediction flow.
Timeouts and retries improved: Webhook calls have a default 10s timeout (configurable via COG_WEBHOOK_TIMEOUT), and terminal status webhooks now retry up to 6 times (down from 12), reducing worst-case wait from ~320s to ~60s.
Graceful error handling: Connection errors, timeouts, and HTTP errors are logged but do not block or crash the worker.
Comprehensive tests added: New and improved tests simulate webhook timeouts, connection failures, retry logic, and verify that cancellation and health checks are never blocked by webhook issues.
Bonus: Fixed a bug in Dockerfile generation where GOARCH was incorrectly set to runtime.GOOS instead of runtime.GOARCH.
Closes #2229.
Async predictions and cancellation are now robust to webhook service outages.

paulnegz added 2 commits July 29, 2025 04:07
This change replaces pip with uv for Python package installation in container builds.
Key changes:
- Update StandardGenerator to use uv for package installation
- Add proper uv caching configuration
- Update tests to expect uv-based commands
- Update documentation to reflect uv usage

Fixes replicate#2167

Signed-off-by: Paul Negedu <[email protected]>
…cancellation

- Add webhook timeout (10s default, configurable via COG_WEBHOOK_TIMEOUT)
- Use ThreadPoolExecutor for webhook calls to prevent blocking main thread
- Reduce max retries from 12 to 6 to avoid blocking too long (~60s vs 320s)
- Add comprehensive tests for timeout, retry behavior, and background execution
- Fix GOARCH assignment bug in dockerfile generation

This fixes issue replicate#2229 where webhook service being down would:
1. Block async /predictions requests indefinitely
2. Prevent cancellation of stuck requests
3. Leave health check stuck in 'BUSY' state

The fix ensures webhook failures are handled gracefully in background threads
without blocking the main prediction workflow.

Signed-off-by: Paul Negedu <[email protected]>
@paulnegz
Copy link
Author

@zeke I'll appreciate if I can get a review and feedback for this. Open to making adjustments to move this forward

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Webhook Service Down Causes Async /predictions to Fail & Blocks Cancellation with exceptions ConnectionRefusedError and MaxRetryError
1 participant