Skip to content

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Oct 12, 2025

This currently includes #4930 (and serves as a test for it). Draft until that one is merged.

This currently includes #4951 and is therefore a draft until #4951 is merged.

Inspired by the discussion in #4905.

In case early stage of runc init (nsenter) fails for some reason, it
logs error(s) with FATAL log level, via bail().

The runc init log is read by a parent (runc create/run/exec) and is
logged via normal logrus mechanism, which is all fine and dandy, except
when runc init fails, we return the error from the parent (which is
usually not too helpful, for example):

runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

Now, the actual underlying error is from runc init and it was logged
earlier; here's how full runc output looks like:

FATA[0000] nsexec-1[3247792]: failed to unshare remaining namespaces: No space left on device
FATA[0000] nsexec-0[3247790]: failed to sync with stage-1: next state: Success
ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

The problem is, upper level runtimes tend to ignore everything except
the last line from runc, and thus error reported by e.g. docker is not
very helpful.

This patch tries to improve the situation by collecting FATAL errors
from runc init and appending those to the error returned (instead of
logging). With it, the above error will look like this:

ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF; runc init error(s): nsexec-1[141549]: failed to unshare remaining namespaces: No space left on device; nsexec-0[141547]: failed to sync with stage-1: next state: Success

Yes, it is long and ugly, but at least the upper level runtime will
report it.

Fixes: #4905

@kolyshkin kolyshkin force-pushed the better-init-errors branch 2 times, most recently from 08fb065 to 0200b76 Compare October 13, 2025 19:01
@kolyshkin kolyshkin marked this pull request as draft October 13, 2025 22:41
@kolyshkin kolyshkin marked this pull request as ready for review October 14, 2025 00:05
@kolyshkin kolyshkin marked this pull request as draft October 14, 2025 18:47
@kolyshkin kolyshkin marked this pull request as ready for review October 15, 2025 23:02
@kolyshkin kolyshkin requested review from AkihiroSuda, cyphar, lifubang and rata and removed request for cyphar October 15, 2025 23:02
@kolyshkin kolyshkin force-pushed the better-init-errors branch 2 times, most recently from abf4958 to ef31851 Compare October 24, 2025 01:48
@kolyshkin kolyshkin marked this pull request as draft October 25, 2025 03:50
@rata
Copy link
Member

rata commented Oct 29, 2025

@kolyshkin The extra path (the one no present in the other mentioned PRs) LGTM. But would that print the libcrypto issue? I mean, is the go panic forwarded?

This panic you posted in this issue, for example: #4916 (comment)

It seems packages.microsoft.com is down now, I can't easily test myself (Yeah, I'm sending some messages, but they are probably aware already :)). If you still have that install handy, it will be great if you can test it :)

In case early stage of runc init (nsenter) fails for some reason, it
logs error(s) with FATAL log level, via bail().

The runc init log is read by a parent (runc create/run/exec) and is
logged via normal logrus mechanism, which is all fine and dandy, except
when `runc init` fails, we return the error from the parent (which is
usually not too helpful, for example):

	runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

Now, the actual underlying error is from runc init and it was logged
earlier; here's how full runc output looks like:

	FATA[0000] nsexec-1[3247792]: failed to unshare remaining namespaces: No space left on device
	FATA[0000] nsexec-0[3247790]: failed to sync with stage-1: next state: Success
	ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF

The problem is, upper level runtimes tend to ignore everything except
the last line from runc, and thus error reported by e.g. docker is not
very helpful.

This patch tries to improve the situation by collecting FATAL errors
from runc init and appending those to the error returned (instead of
logging). With it, the above error will look like this:

	ERRO[0000] runc run failed: unable to start container process: can't get final child's PID from pipe: EOF; runc init error(s): nsexec-1[141549]: failed to unshare remaining namespaces: No space left on device; nsexec-0[141547]: failed to sync with stage-1: next state: Success

Yes, it is long and ugly, but at least the upper level runtime will
report it.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin requested a review from lifubang October 29, 2025 16:55
@kolyshkin kolyshkin marked this pull request as ready for review October 29, 2025 16:55
@kolyshkin
Copy link
Contributor Author

@kolyshkin The extra path (the one no present in the other mentioned PRs) LGTM. But would that print the libcrypto issue? I mean, is the go panic forwarded?

Alas, no. This PR is about the C code of runc init (i.e. libct/nsenter).

You can emulate the libcrypto error by adding "panic" call into libcontainer.Init, and think of ways to catch that in the parent. I thought about it a bit, and haven't found an easy way to catch that. This is because we redirect runc init stdout/stderr to our own stdout/stderr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error when starting the containers: "can't get final child's PID from pipe"

3 participants