Skip to content

Conversation

AmandeepSingh285
Copy link

What changes were proposed in this pull request?

This PR adds a custom BossThreadFactory that allows registering an onUncaughtException handler for ESS boss threads.

Specifically:

  1. Introduced a new BossThreadFactory (extending Netty’s DefaultThreadFactory) to capture uncaught exceptions in boss threads.
  2. Added logic in ESS to handle boss thread failures by:
    a. Logging the uncaught exception,
    b. Stopping the shuffle service,
    c. Counting down the shutdown barrier, and
    d, Exiting the JVM with a non-zero code to trigger process restart.

With this change, if the boss thread is killed due to an OOM or other fatal error, the main thread will no longer remain alive in a broken state. Instead, the ESS process will exit and be restarted by the JVM/runtime environment, ensuring the host becomes usable again for shuffle operations.

Why are the changes needed?

In the current ESS implementation, if the boss thread encounters an OOM error and is killed, the process enters a degraded state where new connections cannot be established on the affected ESS hosts. This happens because the main thread continues running even after the boss thread has terminated.

This pull request introduces support for terminating the main thread when the boss thread is killed. As a result, the JVM exits and the ESS host is automatically restarted, restoring its ability to serve shuffle-related operations.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Created a new thread and threw an exception -

val threadFactory = new BossThreadFactory("testThread", true, essServerStopFunc)
val t = threadFactory.newThread(new Runnable {
override def run(): Unit = {
logInfo("Throwing exception in test thread")
throw new RuntimeException("This is a test exception to check if the boss thread " +
"is properly handling uncaught exceptions.")
}
})
t.start()

With the proposed changes, as the boss thread was killed, the main thread exited as well resulting in the JVM automatically getting restarted and hence ESS was not left in a degraded state

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the CORE label Aug 16, 2025
@HyukjinKwon HyukjinKwon changed the title [SPARK-53296] ESS exit main thread in case boss thread exits [SPARK-53296][CORE] ESS exit main thread in case boss thread exits Aug 18, 2025
Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not look at the PR in detail, but note that ESS runs within node manager in YARN - we cannot exit the vm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants