-
Notifications
You must be signed in to change notification settings - Fork 191
defunct processes, the possible explanation #74
Description
introduction
The following explanation is focused on using csi-s3 with goofys as a backend. All the components are in their latest version.
The issue I stumbled upon is the number of goofys Zombie processes.
The number doesn't have any importance in the understanding.
explanation
I looked in the csi-s3 code and more importantly at the FuseUnmount function and then at waitForProcess
Lines 133 to 156 in ddbd6fd
| func waitForProcess(p *os.Process, backoff int) error { | |
| if backoff == 20 { | |
| return fmt.Errorf("Timeout waiting for PID %v to end", p.Pid) | |
| } | |
| cmdLine, err := getCmdLine(p.Pid) | |
| if err != nil { | |
| glog.Warningf("Error checking cmdline of PID %v, assuming it is dead: %s", p.Pid, err) | |
| return nil | |
| } | |
| if cmdLine == "" { | |
| // ignore defunct processes | |
| // TODO: debug why this happens in the first place | |
| // seems to only happen on k8s, not on local docker | |
| glog.Warning("Fuse process seems dead, returning") | |
| return nil | |
| } | |
| if err := p.Signal(syscall.Signal(0)); err != nil { | |
| glog.Warningf("Fuse process does not seem active or we are unprivileged: %s", err) | |
| return nil | |
| } | |
| glog.Infof("Fuse process with PID %v still active, waiting...", p.Pid) | |
| time.Sleep(time.Duration(backoff*100) * time.Millisecond) | |
| return waitForProcess(p, backoff+1) | |
| } |
Due to the name of the function I was expected to see a wait4 syscall to consume the child process, in our case goofys.
If we look at the below outputs:
- we have a
goofysZombie process withpid=32767
$ ps aux | grep goofys
root 32767 0.0 0.0 0 0 ? Zs Jun14 0:00 [goofys] <defunct>- its parent process the s3driver
$ pstree -s 32767
systemd───containerd-shim───s3driver───goofysAs s3driver launches goofys backend (I guess it is the case for the other backends 🤷🏼♂️), s3driver is the parent process. Then as a good parent 😃 it should wait4 its child to know what was its status.
In other words, there is a leak on child termination. The fix should be trivial; in the waitForProcess when the cmdLine is empty, we have to syscall.wait4 on the given pid.
Lines 142 to 148 in ddbd6fd
| if cmdLine == "" { | |
| // ignore defunct processes | |
| // TODO: debug why this happens in the first place | |
| // seems to only happen on k8s, not on local docker | |
| glog.Warning("Fuse process seems dead, returning") | |
| return nil | |
| } |
wdyt @ctrox?