Skip to content

Docker client gets hung due to zombie connection, does not recover #1188

@gbhat618

Description

@gbhat618

Jenkins and plugins versions report

Noticed in the latest version of jenkins (2.528.3) and this plugin (1308.vff6e33248305).
While connecting with the Public IP of docker host VM, if for some reason the connection on controller becomes zombie (i.e. socket connection present on controller, but not on docker), the build triggers get stuck in contacting docker host at,

thread dump
"jenkins.util.Timer [#5]" #68 [103] daemon prio=5 os_prio=0 cpu=1486.31ms elapsed=7407.21s tid=0x000078fc40004630 nid=103 runnable  [0x000078fce68fb000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.Net.poll([email protected]/Native Method)
	at sun.nio.ch.NioSocketImpl.park([email protected]/NioSocketImpl.java:191)
	at sun.nio.ch.NioSocketImpl.timedFinishConnect([email protected]/NioSocketImpl.java:548)
	at sun.nio.ch.NioSocketImpl.connect([email protected]/NioSocketImpl.java:592)
	at java.net.SocksSocketImpl.connect([email protected]/SocksSocketImpl.java:327)
	at java.net.Socket.connect([email protected]/Socket.java:751)
	at org.apache.hc.client5.http.impl.io.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:205)
	at org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:490)
	at org.apache.hc.client5.http.impl.classic.InternalExecRuntime.connectEndpoint(InternalExecRuntime.java:164)
	at org.apache.hc.client5.http.impl.classic.InternalExecRuntime.connectEndpoint(InternalExecRuntime.java:174)
	at org.apache.hc.client5.http.impl.classic.ConnectExec.execute(ConnectExec.java:144)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement$$Lambda/0x000078fcee157b58.proceed(Unknown Source)
	at org.apache.hc.client5.http.impl.classic.ProtocolExec.execute(ProtocolExec.java:195)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement$$Lambda/0x000078fcee157b58.proceed(Unknown Source)
	at org.apache.hc.client5.http.impl.classic.ContentCompressionExec.execute(ContentCompressionExec.java:150)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement$$Lambda/0x000078fcee157b58.proceed(Unknown Source)
	at org.apache.hc.client5.http.impl.classic.HttpRequestRetryExec.execute(HttpRequestRetryExec.java:113)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement$$Lambda/0x000078fcee157b58.proceed(Unknown Source)
	at org.apache.hc.client5.http.impl.classic.RedirectExec.execute(RedirectExec.java:110)
	at org.apache.hc.client5.http.impl.classic.ExecChainElement.execute(ExecChainElement.java:51)
	at org.apache.hc.client5.http.impl.classic.InternalHttpClient.doExecute(InternalHttpClient.java:185)
	at org.apache.hc.client5.http.impl.classic.CloseableHttpClient.execute(CloseableHttpClient.java:87)
	at org.apache.hc.client5.http.impl.classic.CloseableHttpClient.execute(CloseableHttpClient.java:55)
	at org.apache.hc.client5.http.classic.HttpClient.executeOpen(HttpClient.java:183)
	at com.github.dockerjava.httpclient5.ApacheDockerHttpClientImpl.execute(ApacheDockerHttpClientImpl.java:189)
	at com.github.dockerjava.httpclient5.ApacheDockerHttpClient.execute(ApacheDockerHttpClient.java:9)
	at com.github.dockerjava.core.DefaultInvocationBuilder.execute(DefaultInvocationBuilder.java:228)
	at com.github.dockerjava.core.DefaultInvocationBuilder.get(DefaultInvocationBuilder.java:202)
	at com.github.dockerjava.core.DefaultInvocationBuilder.get(DefaultInvocationBuilder.java:74)
	at com.github.dockerjava.core.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:44)
	at com.github.dockerjava.core.exec.ListContainersCmdExec.execute(ListContainersCmdExec.java:15)
	at com.github.dockerjava.core.exec.AbstrSyncDockerCmdExec.exec(AbstrSyncDockerCmdExec.java:21)
	at com.github.dockerjava.core.command.AbstrDockerCmd.exec(AbstrDockerCmd.java:33)
	at com.nirima.jenkins.plugins.docker.DockerCloud.countContainersInDocker(DockerCloud.java:638)
	at com.nirima.jenkins.plugins.docker.DockerCloud.canAddProvisionedAgent(DockerCloud.java:656)
	at com.nirima.jenkins.plugins.docker.DockerCloud.provision(DockerCloud.java:394)
	- locked <0x000000069217bb88> (a com.nirima.jenkins.plugins.docker.DockerCloud)
	at io.jenkins.docker.FastNodeProvisionerStrategy.applyToCloud(FastNodeProvisionerStrategy.java:71)
	at io.jenkins.docker.FastNodeProvisionerStrategy.apply(FastNodeProvisionerStrategy.java:41)
	at hudson.slaves.NodeProvisioner.update(NodeProvisioner.java:327)
	at hudson.slaves.NodeProvisioner.lambda$suggestReviewNow$4(NodeProvisioner.java:199)
	at hudson.slaves.NodeProvisioner$$Lambda/0x000078fcedd2ea28.run(Unknown Source)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:572)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:317)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run([email protected]/ScheduledThreadPoolExecutor.java:304)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith([email protected]/Thread.java:1596)
	at java.lang.Thread.run([email protected]/Thread.java:1583)

   Locked ownable synchronizers:
	- <0x000000068c284050> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	- <0x000000068ea724c8> (a java.util.concurrent.ThreadPoolExecutor$Worker)
	- <0x00000006ac631450> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)

No connections were seen on docker host VM when checked with netstat -natup.

In this case the connectionTimeout seemed to ineffective.

What Operating System are you using (both controller, and any agents involved in the problem)?

Controller was CloudBees CI running in k8s uses RHEL 9, docker host was Debian 12.

Reproduction steps

It's hard to reproduce as the JVM should still keep waiting for the other side (ie. docker host), but docker host should have already dropped the connection.

Tried on docker host,

sudo apt install iptables iptables-persistent
sudo iptables -A OUTPUT -p tcp -d <ip of the controller> --sport 2375 -j DROP

But doesn't reproduce systematically

Expected Results

Some kind of timeout should unblock the provision method being stuck.

Actual Results

Stuck waiting for other side - a zombie connection.

Anything else?

Currently it seems there is no SO_TIMEOUT possibility to detect dead connection.
Perhaps at https://github.com/docker-java/docker-java/blob/faa88e16460a8cb321c9695cdbc34cb7a662458e/docker-java-transport-httpclient5/src/main/java/com/github/dockerjava/httpclient5/ApacheDockerHttpClientImpl.java#L117-L122 ?

Are you interested in contributing a fix?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions