Skip to content

Conversation

squell
Copy link
Member

@squell squell commented May 10, 2025

The problem with #1129 was two-fold:

  1. When the monitor shuts down, it takes the pseudoterminal it controls with it. So we end up in a condition where the parent is racing with the monitor to read remaining data before the monitor crosses the finish line.

This could be demonstrated by adding a sleep in the monitor, but a better fix is simply to use the existing backchannel to synchronise parent and monitor. I re-used the old ExecCommand message for this, as that was already used to handle a similar race at the start.

So basically the backchannel protocol is: parent sends an Edge to signal the monitor it can go (this was already the case), do the usual thing while the monitor and child are doing their dance, and then again when the monitor has communicated that it is ended, let it wait for another Edge from the parent.

  1. Secondly, the monitor is sending the "the child process has terminated" signal to the parent via backchannel communication, but the child is communicating itself the data over a pipe. It so happens that its final data is still in flight but overtaken by the backchannel.

The fix: in the flush_left function, also read the right side to completion.

@squell squell added the C-exec Execution component (interfacing with OS) label May 10, 2025
@squell squell changed the title Fix race between parent and monitor. Fix race between parent and monitor in use_pty May 10, 2025
@squell
Copy link
Member Author

squell commented May 10, 2025

By the way, I have left b156eb9 in for illustration purposes only. That fix alone takes care of the problem "in practice". We can drop that one before it is merged.

@squell squell requested a review from pvdrz May 10, 2025 14:10
@squell squell linked an issue May 10, 2025 that may be closed by this pull request
@squell squell enabled auto-merge (rebase) May 10, 2025 14:14
@trifectatechfoundation trifectatechfoundation locked as spam and limited conversation to collaborators May 10, 2025
@trifectatechfoundation trifectatechfoundation unlocked this conversation May 11, 2025
@squell squell disabled auto-merge May 11, 2025 20:51
@squell
Copy link
Member Author

squell commented May 12, 2025

This fixes the problem on the FreeBSD machine I tested, but not on the Debian trixie machine.

@squell squell marked this pull request as draft May 12, 2025 11:59
@squell squell marked this pull request as ready for review May 12, 2025 18:29
@squell squell marked this pull request as draft May 13, 2025 22:46
@pvdrz
Copy link
Collaborator

pvdrz commented May 19, 2025

I checked the changes and they totally make sens to me. However I was unable to reproduce this bug on my system so I don't have a proper way to see it working.

@squell
Copy link
Member Author

squell commented Aug 4, 2025

While this fixes the problem on BSD for me, @rnijveld reported that it still happened for him even with this patch set. So that's why we haven't merged this yet.

@squell squell marked this pull request as ready for review August 25, 2025 17:54
@squell squell linked an issue Sep 11, 2025 that may be closed by this pull request
@squell squell marked this pull request as draft September 12, 2025 08:44
@squell squell marked this pull request as ready for review September 12, 2025 13:16
@squell
Copy link
Member Author

squell commented Sep 12, 2025

As for the final commit: the I/O subsystem doesn't know that the end of a pipe is at hand, since the child has stopped responding (but is being kept in suspended animation to avoid a "broken pipe").

So the solution that I used is to put the right end of the pipe in nonblocking mode, and then read until the OS indicates that a read on it will block (which signals the true end of the stream).

As far as I can tell, all the rest of our code will work fine with the Pty leader in nonblocking mode, but to avoid any changes in that part, I've only set it to 'nonblocking' before the final loop.

Since we're not going to be doing anything with the pipe after that, that is a conservative choice.

@squell
Copy link
Member Author

squell commented Sep 16, 2025

I have investigated whether the "red light" synchronization mechanism is still necessary or merely reading from the Pty until exhaustion is enough. But on FreeBSD I've found that just the last two commits don't fix the problem. So we can't drop those.

@squell
Copy link
Member Author

squell commented Sep 16, 2025

Seeing as this PR has been confirmed by @rnijveld to solve the problem and that the synch mechanism was also informally approved by @pvdrz I think we should now take this changeset. 🎈

@squell squell merged commit a7dba9e into main Sep 16, 2025
17 checks passed
@squell squell deleted the race-to-finish branch September 16, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-exec Execution component (interfacing with OS)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

In use_pty mode sometimes output is missed.
3 participants