Fix zombie process accumulation from git operations in cloud environments #1419

nsingl00 · 2025-07-16T20:05:28Z

Problem

Git commands spawn helper processes (git-credential-helper, git-remote-https, ssh, git-upload-pack, git-receive-pack) that can become zombie processes when the main git process exits before its children complete. This is particularly
problematic in cloud/container environments where:

The application runs as PID 1 or under minimal init systems
Standard init process zombie reaping may be unreliable or slow
Resource constraints can cause zombie accumulation over time

Root Cause

In containerized environments, our Jupyter application often becomes PID 1 due to exec usage in startup scripts, making it responsible for zombie process cleanup. Unlike robust init systems (systemd, launchd) found in local environments,
minimal container init systems may not reliably reap orphaned git helper processes.

Solution

Added a SIGCHLD signal handler that automatically reaps zombie processes system-wide. The handler:

Uses non-blocking waitpid(-1, os.WNOHANG) to reap any zombie children
Runs whenever any child process terminates (SIGCHLD signal)
Prevents zombie accumulation without affecting normal git operations
Logs reaped processes at debug level for monitoring

Testing

Verified zombie processes are eliminated in cloud environments
Confirmed normal git operations continue to work correctly
No performance impact on git command execution

github-actions · 2025-07-16T20:05:44Z

👈 Launch a Binder on branch nsingl00/jupyterlab-git/subprocess

nsingl00 · 2025-07-17T18:42:57Z

Hey @krassowski can you help review the PR?

ellisonbg · 2025-07-21T23:02:29Z

Thanks for working on this @nsingl00 - wonder if this is related to #975

Git commands spawn helper processes (git-credential-helper, git-remote-https, ssh) that become zombies when the main git process exits before children complete. This is problematic in cloud/container environments where the application runs as PID 1 or under minimal init systems that don't reliably reap orphaned processes. Added SIGCHLD signal handler to automatically reap zombie processes system-wide using non-blocking waitpid(), preventing resource leaks without affecting normal git operations.

nsingl00 · 2025-07-24T21:08:37Z

Thanks for working on this @nsingl00 - wonder if this is related to #975

yes, it is related. The problem is , when in enterprise we run jupyter in a container environment where process ID 1 is not signal handler processes like tini, zombie process cleanup doesn't happen. Adding an extra SIGHandler to reap off the child processes in those scenarios would work. Added the PR for those cases.

krassowski

Thank you for opening this PR @nsingl00 and sorry for late reply - I missed the notification after coming back from holidays.

I believe that signal.SIGCHLD will raise an AttributeError exception on Windows and break the extension, it is only available on Unix as per documentation https://docs.python.org/3/library/signal.html#signal.SIGCHLD. At a minimum this would need to be wrapped in some code sniffing if signal.SIGCHLD is available

In general, it feels like maybe there is something we could improve in the execute() function to make sure we avoid leaving zombies?

jupyterlab-git/jupyterlab_git/git.py

Lines 81 to 207 in 861374f

    
           async def execute( 
        
               cmdline: "List[str]", 
        
               cwd: "str", 
        
               timeout: "float" = 20, 
        
               env: "Optional[Dict[str, str]]" = None, 
        
               username: "Optional[str]" = None, 
        
               password: "Optional[str]" = None, 
        
               is_binary=False, 
        
           ) -> "Tuple[int, str, str]": 
        
               """Asynchronously execute a command. 
        
               Args: 
        
                   cmdline (List[str]): Command line to be executed 
        
                   cwd (Optional[str]): Current working directory 
        
                   env (Optional[Dict[str, str]]): Defines the environment variables for the new process 
        
                   username (Optional[str]): User name 
        
                   password (Optional[str]): User password 
        
               Returns: 
        
                   (int, str, str): (return code, stdout, stderr) 
        
               """ 
        
               async def call_subprocess_with_authentication( 
        
                   cmdline: "List[str]", 
        
                   username: "str", 
        
                   password: "str", 
        
                   cwd: "Optional[str]" = None, 
        
                   env: "Optional[Dict[str, str]]" = None, 
        
               ) -> "Tuple[int, str, str]": 
        
                   try: 
        
                       p = pexpect.spawn( 
        
                           cmdline[0], 
        
                           cmdline[1:], 
        
                           cwd=cwd, 
        
                           env=env, 
        
                           encoding="utf-8", 
        
                           timeout=None, 
        
                       ) 
        
                       # We expect a prompt from git 
        
                       # In most of cases git will prompt for username and 
        
                       #  then for password 
        
                       # In some cases (Bitbucket) username is included in 
        
                       #  remote URL, so git will not ask for username 
        
                       i = await p.expect(["Username for .*: ", "Password for .*:"], async_=True) 
        
                       if i == 0:  # ask for username then password 
        
                           p.sendline(username) 
        
                           await p.expect("Password for .*:", async_=True) 
        
                           p.sendline(password) 
        
                       elif i == 1:  # only ask for password 
        
                           p.sendline(password) 
        
                       await p.expect(pexpect.EOF, async_=True) 
        
                       response = p.before 
        
                       returncode = p.wait() 
        
                       p.close() 
        
                       return returncode, "", response 
        
                   except pexpect.exceptions.EOF:  # In case of pexpect failure 
        
                       response = p.before 
        
                       returncode = p.exitstatus 
        
                       p.close()  # close process 
        
                       return returncode, "", response 
        
               def call_subprocess( 
        
                   cmdline: "List[str]", 
        
                   cwd: "Optional[str]" = None, 
        
                   env: "Optional[Dict[str, str]]" = None, 
        
                   is_binary=is_binary, 
        
               ) -> "Tuple[int, str, str]": 
        
                   process = subprocess.Popen( 
        
                       cmdline, stdout=subprocess.PIPE, stderr=subprocess.PIPE, cwd=cwd, env=env 
        
                   ) 
        
                   output, error = process.communicate() 
        
                   if is_binary: 
        
                       return ( 
        
                           process.returncode, 
        
                           base64.encodebytes(output).decode("ascii"), 
        
                           error.decode("utf-8"), 
        
                       ) 
        
                   else: 
        
                       return (process.returncode, output.decode("utf-8"), error.decode("utf-8")) 
        
               try: 
        
                   await execution_lock.acquire(timeout=datetime.timedelta(seconds=timeout)) 
        
               except tornado.util.TimeoutError: 
        
                   return (1, "", "Unable to get the lock on the directory") 
        
               try: 
        
                   # Ensure our execution operation will succeed by first checking and waiting for the lock to be removed 
        
                   time_slept = 0 
        
                   lockfile = os.path.join(cwd, ".git", "index.lock") 
        
                   while os.path.exists(lockfile) and time_slept < MAX_WAIT_FOR_LOCK_S: 
        
                       await tornado.gen.sleep(CHECK_LOCK_INTERVAL_S) 
        
                       time_slept += CHECK_LOCK_INTERVAL_S 
        
                   # If the lock still exists at this point, we will likely fail anyway, but let's try anyway 
        
                   get_logger().debug("Execute {!s} in {!s}.".format(cmdline, cwd)) 
        
                   if username is not None and password is not None: 
        
                       code, output, error = await call_subprocess_with_authentication( 
        
                           cmdline, 
        
                           username, 
        
                           password, 
        
                           cwd, 
        
                           env, 
        
                       ) 
        
                   else: 
        
                       current_loop = tornado.ioloop.IOLoop.current() 
        
                       code, output, error = await current_loop.run_in_executor( 
        
                           None, call_subprocess, cmdline, cwd, env 
        
                       ) 
        
                   log_output = ( 
        
                       output[:MAX_LOG_OUTPUT] + "..." if len(output) > MAX_LOG_OUTPUT else output 
        
                   ) 
        
                   log_error = ( 
        
                       error[:MAX_LOG_OUTPUT] + "..." if len(error) > MAX_LOG_OUTPUT else error 
        
                   ) 
        
                   get_logger().debug( 
        
                       "Code: {}\nOutput: {}\nError: {}".format(code, log_output, log_error) 
        
                   ) 
        
               except BaseException as e: 
        
                   code, output, error = -1, "", traceback.format_exc() 
        
                   get_logger().warning("Fail to execute {!s}".format(cmdline), exc_info=True) 
        
               finally: 
        
                   execution_lock.release() 
        
               return code, output, error

krassowski linked an issue Jul 16, 2025 that may be closed by this pull request

Git subprocesses not properly terminated, causing process accumulation #1418

Open

krassowski added the bug label Jul 16, 2025

Zsailer mentioned this pull request Jul 22, 2025

Improve error messaging and string formatting consistency nsingl00/jupyterlab-git#1

Open

4 tasks

nsingl00 changed the title ~~Fix git subprocess termination issues to prevent process accumulation~~ Fix zombie process accumulation from git operations in cloud environments Jul 24, 2025

nsingl00 force-pushed the subprocess branch from d58ff6f to e59e82b Compare July 24, 2025 20:52

nsingl00 mentioned this pull request Jul 24, 2025

"Unable to get lock on the directory" as several git processes continue running #975

Open

pr-triage-board bot added this to PR triage (experimental) Aug 25, 2025

krassowski requested changes Aug 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix zombie process accumulation from git operations in cloud environments #1419

Fix zombie process accumulation from git operations in cloud environments #1419

nsingl00 commented Jul 16, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

nsingl00 commented Jul 17, 2025

Uh oh!

ellisonbg commented Jul 21, 2025

Uh oh!

nsingl00 commented Jul 24, 2025

Uh oh!

krassowski left a comment

Uh oh!

Uh oh!

	async def execute(
	cmdline: "List[str]",
	cwd: "str",
	timeout: "float" = 20,
	env: "Optional[Dict[str, str]]" = None,
	username: "Optional[str]" = None,
	password: "Optional[str]" = None,
	is_binary=False,
	) -> "Tuple[int, str, str]":
	"""Asynchronously execute a command.

	Args:
	cmdline (List[str]): Command line to be executed
	cwd (Optional[str]): Current working directory
	env (Optional[Dict[str, str]]): Defines the environment variables for the new process
	username (Optional[str]): User name
	password (Optional[str]): User password
	Returns:
	(int, str, str): (return code, stdout, stderr)
	"""

	async def call_subprocess_with_authentication(
	cmdline: "List[str]",
	username: "str",
	password: "str",
	cwd: "Optional[str]" = None,
	env: "Optional[Dict[str, str]]" = None,
	) -> "Tuple[int, str, str]":
	try:
	p = pexpect.spawn(
	cmdline[0],
	cmdline[1:],
	cwd=cwd,
	env=env,
	encoding="utf-8",
	timeout=None,
	)

	# We expect a prompt from git
	# In most of cases git will prompt for username and
	# then for password
	# In some cases (Bitbucket) username is included in
	# remote URL, so git will not ask for username
	i = await p.expect(["Username for .: ", "Password for .:"], async_=True)
	if i == 0: # ask for username then password
	p.sendline(username)
	await p.expect("Password for .*:", async_=True)
	p.sendline(password)
	elif i == 1: # only ask for password
	p.sendline(password)

	await p.expect(pexpect.EOF, async_=True)
	response = p.before

	returncode = p.wait()
	p.close()
	return returncode, "", response
	except pexpect.exceptions.EOF: # In case of pexpect failure
	response = p.before
	returncode = p.exitstatus
	p.close() # close process
	return returncode, "", response

	def call_subprocess(
	cmdline: "List[str]",
	cwd: "Optional[str]" = None,
	env: "Optional[Dict[str, str]]" = None,
	is_binary=is_binary,
	) -> "Tuple[int, str, str]":
	process = subprocess.Popen(
	cmdline, stdout=subprocess.PIPE, stderr=subprocess.PIPE, cwd=cwd, env=env
	)
	output, error = process.communicate()
	if is_binary:
	return (
	process.returncode,
	base64.encodebytes(output).decode("ascii"),
	error.decode("utf-8"),
	)
	else:
	return (process.returncode, output.decode("utf-8"), error.decode("utf-8"))

	try:
	await execution_lock.acquire(timeout=datetime.timedelta(seconds=timeout))
	except tornado.util.TimeoutError:
	return (1, "", "Unable to get the lock on the directory")

	try:
	# Ensure our execution operation will succeed by first checking and waiting for the lock to be removed
	time_slept = 0
	lockfile = os.path.join(cwd, ".git", "index.lock")
	while os.path.exists(lockfile) and time_slept < MAX_WAIT_FOR_LOCK_S:
	await tornado.gen.sleep(CHECK_LOCK_INTERVAL_S)
	time_slept += CHECK_LOCK_INTERVAL_S

	# If the lock still exists at this point, we will likely fail anyway, but let's try anyway

	get_logger().debug("Execute {!s} in {!s}.".format(cmdline, cwd))
	if username is not None and password is not None:
	code, output, error = await call_subprocess_with_authentication(
	cmdline,
	username,
	password,
	cwd,
	env,
	)
	else:
	current_loop = tornado.ioloop.IOLoop.current()
	code, output, error = await current_loop.run_in_executor(
	None, call_subprocess, cmdline, cwd, env
	)
	log_output = (
	output[:MAX_LOG_OUTPUT] + "..." if len(output) > MAX_LOG_OUTPUT else output
	)
	log_error = (
	error[:MAX_LOG_OUTPUT] + "..." if len(error) > MAX_LOG_OUTPUT else error
	)
	get_logger().debug(
	"Code: {}\nOutput: {}\nError: {}".format(code, log_output, log_error)
	)
	except BaseException as e:
	code, output, error = -1, "", traceback.format_exc()
	get_logger().warning("Fail to execute {!s}".format(cmdline), exc_info=True)
	finally:
	execution_lock.release()

	return code, output, error

Fix zombie process accumulation from git operations in cloud environments #1419

Are you sure you want to change the base?

Fix zombie process accumulation from git operations in cloud environments #1419

Conversation

nsingl00 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Testing

Uh oh!

github-actions bot commented Jul 16, 2025

Uh oh!

nsingl00 commented Jul 17, 2025

Uh oh!

ellisonbg commented Jul 21, 2025

Uh oh!

nsingl00 commented Jul 24, 2025

Uh oh!

krassowski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nsingl00 commented Jul 16, 2025 •

edited

Loading