-
Notifications
You must be signed in to change notification settings - Fork 369
Fix zombie process accumulation from git operations in cloud environments #1419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @krassowski can you help review the PR? |
Git commands spawn helper processes (git-credential-helper, git-remote-https, ssh) that become zombies when the main git process exits before children complete. This is problematic in cloud/container environments where the application runs as PID 1 or under minimal init systems that don't reliably reap orphaned processes. Added SIGCHLD signal handler to automatically reap zombie processes system-wide using non-blocking waitpid(), preventing resource leaks without affecting normal git operations.
yes, it is related. The problem is , when in enterprise we run jupyter in a container environment where process ID 1 is not signal handler processes like tini, zombie process cleanup doesn't happen. Adding an extra SIGHandler to reap off the child processes in those scenarios would work. Added the PR for those cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening this PR @nsingl00 and sorry for late reply - I missed the notification after coming back from holidays.
I believe that signal.SIGCHLD
will raise an AttributeError
exception on Windows and break the extension, it is only available on Unix as per documentation https://docs.python.org/3/library/signal.html#signal.SIGCHLD. At a minimum this would need to be wrapped in some code sniffing if signal.SIGCHLD
is available
In general, it feels like maybe there is something we could improve in the execute()
function to make sure we avoid leaving zombies?
jupyterlab-git/jupyterlab_git/git.py
Lines 81 to 207 in 861374f
async def execute( | |
cmdline: "List[str]", | |
cwd: "str", | |
timeout: "float" = 20, | |
env: "Optional[Dict[str, str]]" = None, | |
username: "Optional[str]" = None, | |
password: "Optional[str]" = None, | |
is_binary=False, | |
) -> "Tuple[int, str, str]": | |
"""Asynchronously execute a command. | |
Args: | |
cmdline (List[str]): Command line to be executed | |
cwd (Optional[str]): Current working directory | |
env (Optional[Dict[str, str]]): Defines the environment variables for the new process | |
username (Optional[str]): User name | |
password (Optional[str]): User password | |
Returns: | |
(int, str, str): (return code, stdout, stderr) | |
""" | |
async def call_subprocess_with_authentication( | |
cmdline: "List[str]", | |
username: "str", | |
password: "str", | |
cwd: "Optional[str]" = None, | |
env: "Optional[Dict[str, str]]" = None, | |
) -> "Tuple[int, str, str]": | |
try: | |
p = pexpect.spawn( | |
cmdline[0], | |
cmdline[1:], | |
cwd=cwd, | |
env=env, | |
encoding="utf-8", | |
timeout=None, | |
) | |
# We expect a prompt from git | |
# In most of cases git will prompt for username and | |
# then for password | |
# In some cases (Bitbucket) username is included in | |
# remote URL, so git will not ask for username | |
i = await p.expect(["Username for .*: ", "Password for .*:"], async_=True) | |
if i == 0: # ask for username then password | |
p.sendline(username) | |
await p.expect("Password for .*:", async_=True) | |
p.sendline(password) | |
elif i == 1: # only ask for password | |
p.sendline(password) | |
await p.expect(pexpect.EOF, async_=True) | |
response = p.before | |
returncode = p.wait() | |
p.close() | |
return returncode, "", response | |
except pexpect.exceptions.EOF: # In case of pexpect failure | |
response = p.before | |
returncode = p.exitstatus | |
p.close() # close process | |
return returncode, "", response | |
def call_subprocess( | |
cmdline: "List[str]", | |
cwd: "Optional[str]" = None, | |
env: "Optional[Dict[str, str]]" = None, | |
is_binary=is_binary, | |
) -> "Tuple[int, str, str]": | |
process = subprocess.Popen( | |
cmdline, stdout=subprocess.PIPE, stderr=subprocess.PIPE, cwd=cwd, env=env | |
) | |
output, error = process.communicate() | |
if is_binary: | |
return ( | |
process.returncode, | |
base64.encodebytes(output).decode("ascii"), | |
error.decode("utf-8"), | |
) | |
else: | |
return (process.returncode, output.decode("utf-8"), error.decode("utf-8")) | |
try: | |
await execution_lock.acquire(timeout=datetime.timedelta(seconds=timeout)) | |
except tornado.util.TimeoutError: | |
return (1, "", "Unable to get the lock on the directory") | |
try: | |
# Ensure our execution operation will succeed by first checking and waiting for the lock to be removed | |
time_slept = 0 | |
lockfile = os.path.join(cwd, ".git", "index.lock") | |
while os.path.exists(lockfile) and time_slept < MAX_WAIT_FOR_LOCK_S: | |
await tornado.gen.sleep(CHECK_LOCK_INTERVAL_S) | |
time_slept += CHECK_LOCK_INTERVAL_S | |
# If the lock still exists at this point, we will likely fail anyway, but let's try anyway | |
get_logger().debug("Execute {!s} in {!s}.".format(cmdline, cwd)) | |
if username is not None and password is not None: | |
code, output, error = await call_subprocess_with_authentication( | |
cmdline, | |
username, | |
password, | |
cwd, | |
env, | |
) | |
else: | |
current_loop = tornado.ioloop.IOLoop.current() | |
code, output, error = await current_loop.run_in_executor( | |
None, call_subprocess, cmdline, cwd, env | |
) | |
log_output = ( | |
output[:MAX_LOG_OUTPUT] + "..." if len(output) > MAX_LOG_OUTPUT else output | |
) | |
log_error = ( | |
error[:MAX_LOG_OUTPUT] + "..." if len(error) > MAX_LOG_OUTPUT else error | |
) | |
get_logger().debug( | |
"Code: {}\nOutput: {}\nError: {}".format(code, log_output, log_error) | |
) | |
except BaseException as e: | |
code, output, error = -1, "", traceback.format_exc() | |
get_logger().warning("Fail to execute {!s}".format(cmdline), exc_info=True) | |
finally: | |
execution_lock.release() | |
return code, output, error |
Problem
Fixes #975 and #1418
Git commands spawn helper processes (git-credential-helper, git-remote-https, ssh, git-upload-pack, git-receive-pack) that can become zombie processes when the main git process exits before its children complete. This is particularly
problematic in cloud/container environments where:
Root Cause
In containerized environments, our Jupyter application often becomes PID 1 due to exec usage in startup scripts, making it responsible for zombie process cleanup. Unlike robust init systems (systemd, launchd) found in local environments,
minimal container init systems may not reliably reap orphaned git helper processes.
Solution
Added a SIGCHLD signal handler that automatically reaps zombie processes system-wide. The handler:
Testing