[SPARK-53296][CORE] ESS exit main thread in case boss thread exits #52050

AmandeepSingh285 · 2025-08-16T14:20:10Z

What changes were proposed in this pull request?

This PR adds a custom BossThreadFactory that allows registering an onUncaughtException handler for ESS boss threads.

Specifically:

Introduced a new BossThreadFactory (extending Netty’s DefaultThreadFactory) to capture uncaught exceptions in boss threads.
Added logic in ESS to handle boss thread failures by:
a. Logging the uncaught exception,
b. Stopping the shuffle service,
c. Counting down the shutdown barrier, and
d, Exiting the JVM with a non-zero code to trigger process restart.

With this change, if the boss thread is killed due to an OOM or other fatal error, the main thread will no longer remain alive in a broken state. Instead, the ESS process will exit and be restarted by the JVM/runtime environment, ensuring the host becomes usable again for shuffle operations.

Why are the changes needed?

In the current ESS implementation, if the boss thread encounters an OOM error and is killed, the process enters a degraded state where new connections cannot be established on the affected ESS hosts. This happens because the main thread continues running even after the boss thread has terminated.

This pull request introduces support for terminating the main thread when the boss thread is killed. As a result, the JVM exits and the ESS host is automatically restarted, restoring its ability to serve shuffle-related operations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Created a new thread and threw an exception -

val threadFactory = new BossThreadFactory("testThread", true, essServerStopFunc)
val t = threadFactory.newThread(new Runnable {
override def run(): Unit = {
logInfo("Throwing exception in test thread")
throw new RuntimeException("This is a test exception to check if the boss thread " +
"is properly handling uncaught exceptions.")
}
})
t.start()

With the proposed changes, as the boss thread was killed, the main thread exited as well resulting in the JVM automatically getting restarted and hence ESS was not left in a degraded state

Was this patch authored or co-authored using generative AI tooling?

No

mridulm

I did not look at the PR in detail, but note that ESS runs within node manager in YARN - we cannot exit the vm.

Raised fix for ESS boss thread

82f25be

github-actions bot added the CORE label Aug 16, 2025

HyukjinKwon changed the title ~~[SPARK-53296] ESS exit main thread in case boss thread exits~~ [SPARK-53296][CORE] ESS exit main thread in case boss thread exits Aug 18, 2025

mridulm reviewed Aug 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-53296][CORE] ESS exit main thread in case boss thread exits #52050

[SPARK-53296][CORE] ESS exit main thread in case boss thread exits #52050

AmandeepSingh285 commented Aug 16, 2025

Uh oh!

mridulm left a comment •

edited

Loading

Uh oh!

Uh oh!

[SPARK-53296][CORE] ESS exit main thread in case boss thread exits #52050

Are you sure you want to change the base?

[SPARK-53296][CORE] ESS exit main thread in case boss thread exits #52050

Conversation

AmandeepSingh285 commented Aug 16, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

mridulm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mridulm left a comment •

edited

Loading