[SPARK-53296][CORE] ESS exit main thread in case boss thread exits #52050
+149
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR adds a custom BossThreadFactory that allows registering an onUncaughtException handler for ESS boss threads.
Specifically:
a. Logging the uncaught exception,
b. Stopping the shuffle service,
c. Counting down the shutdown barrier, and
d, Exiting the JVM with a non-zero code to trigger process restart.
With this change, if the boss thread is killed due to an OOM or other fatal error, the main thread will no longer remain alive in a broken state. Instead, the ESS process will exit and be restarted by the JVM/runtime environment, ensuring the host becomes usable again for shuffle operations.
Why are the changes needed?
In the current ESS implementation, if the boss thread encounters an OOM error and is killed, the process enters a degraded state where new connections cannot be established on the affected ESS hosts. This happens because the main thread continues running even after the boss thread has terminated.
This pull request introduces support for terminating the main thread when the boss thread is killed. As a result, the JVM exits and the ESS host is automatically restarted, restoring its ability to serve shuffle-related operations.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Created a new thread and threw an exception -
val threadFactory = new BossThreadFactory("testThread", true, essServerStopFunc)
val t = threadFactory.newThread(new Runnable {
override def run(): Unit = {
logInfo("Throwing exception in test thread")
throw new RuntimeException("This is a test exception to check if the boss thread " +
"is properly handling uncaught exceptions.")
}
})
t.start()
With the proposed changes, as the boss thread was killed, the main thread exited as well resulting in the JVM automatically getting restarted and hence ESS was not left in a degraded state
Was this patch authored or co-authored using generative AI tooling?
No