-
Notifications
You must be signed in to change notification settings - Fork 455
fix(iast): gevent flaky timeouts #14062
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Bootstrap import analysisComparison of import times between this PR and base. SummaryThe average import time from this PR is: 279 ± 4 ms. The average import time from base is: 285 ± 6 ms. The import time difference between this PR and base is: -5.9 ± 0.2 ms. Import time breakdownThe following import paths have appeared:
|
Performance SLOsCandidate: avara1986/APPSEC-58276_iast_standalone (2f56e75) 🔵 No Baseline Data (24 suites)🔵 coreapiscenario - 12/12 (2 unstable)🔵 No baseline data available for this suite
|
Gevent Worker Timeouts with Gunicorn and IAST
We identified an issue where applications using Gunicorn with the Gevent worker class may experience random worker timeouts, particularly during shutdown sequences. A typical command setup might look like:
After extensive testing, we found the issue occurs sporadically in a specific scenario: when endpoint A in the application sends an internal HTTP request to endpoint B using
urllib3
, as shown below:And later, when calling a
/shutdown
endpoint to force the tracer to flush spans before exit:Occasionally, this results in a worker timeout, and the trace spans are never sent:
Root Cause: Gevent monkey patching and IAST modules
IAST was relying on modules like
inspect
,importlib
, andsubprocess
, which were not correctly released from memory. This led to conflicts between the in-memory versions of these modules and Gevent’s monkey patching mechanism.✅ Solution 1: Early Initialization via
product.post_preload
All IAST initialization logic (including AST-based propagation hooks and monkey patching of sink points) has been moved to the
post_preload
function to ensure it runs before:Additionally, we’ve removed modules like
importlib.metadata
fromsys.modules
to avoid potential conflicts withgevent
.✅ Solution 2: Avoid Late Imports from C Extensions
We also updated a line in the native C code that previously triggered a delayed import:
This usage caused
importlib.metadata.packages_distributions
to be lazily loaded in a way that could not be released or patched properly by Gevent, leading to sporadic blocking and timeouts.To address this, we refactored the C module to expose a function
set_packages_distributions_func
, which is now explicitly set from Python space. This gives us better control over when and howimportlib.metadata
is imported, ensuring compatibility with Gevent’s concurrency model.✅ Solution 3: Simplifying Taint Sink Initialization in IAST
Finally, these changes also prompted a review of how IAST taint sinks were being loaded. Previously, we relied on a legacy pattern where all taint sink patches were triggered only when
hashlib
was imported, using the following logic:However, this mechanism is no longer necessary. Each taint sink module already defines its own lazy instrumentation logic using
ModuleWatchdog
, like so:Since this approach watches for individual modules (e.g.
os
,subprocess
, etc.) and applies patches independently as they are imported, there's no longer a need to defer sink initialization via a central import likehashlib
.Therefore, we’ve removed the old
when_imported("hashlib")(...)
setup and now invoke each taint sink’s patch function directly at startup, ensuring:This results in a more robust and predictable IAST behavior across all use cases.
Summary
Gevent requires monkey patching to avoid blocking operations on the main thread.
IAST’s dynamic code instrumentation was interfering with this when not initialized early enough.
We fixed the issue by:
sitecustomize
preload hooks.This change improves stability and prevents data loss caused by worker timeouts during shutdown.
Since this issue stems from modules used by
ddtrace
that weren't being properly released, adding an incorrectfrom A import B
could bring back that flaky error.To help debug the problem in case it resurfaces, we’ve introduced two private environment variables that proved very useful during investigation:
you’ll notice there are several
TODO
s documenting broken interactions — and, in many cases, the tests hang indefinitely, just like what happens inAPPSEC-58276
.Checklist
Reviewer Checklist