[SPARK-48547][DEPLOY] Add opt-in flag to have SparkSubmit automatically call System.exit after user code main method exits #52091

pan3793 · 2025-08-21T09:43:41Z

This PR is based on #46889 authored by @JoshRosen

What changes were proposed in this pull request?

This PR adds a new SparkConf flag option, spark.submit.callSystemExitOnMainExit (default false), which when true will cause SparkSubmit to call System.exit() in the JVM once the user code's main method has exited (for Java / Scala jobs) or once the user's Python or R script has exited.

Why are the changes needed?

This is intended to address a longstanding issue where spark-submit runs might hang after user code has completed:

According to Java’s java.lang.Runtime docs:

The Java Virtual Machine initiates the shutdown sequence in response to one of several events:

when the number of live non-daemon threads drops to zero for the first time (see note below on the JNI Invocation API);

when the Runtime.exit or System.exit method is called for the first time; or

when some external event occurs, such as an interrupt or a signal is received from the operating system.

For Python and R programs, SparkSubmit’s PythonRunner and RRunner will call System.exit() if the user program exits with a non-zero exit code (see python and R runner code).

But for Java and Scala programs, plus any successful R or Python programs, Spark will not automatically call System.exit.

In those situation, the JVM will only shutdown when, via event (1), all non-daemon threads have exited (unless the job is cancelled and sent an external interrupt / kill signal, corresponding to event (3)).

Thus, non-daemon threads might cause logically-completed spark-submit jobs to hang rather than completing.

The non-daemon threads are not always under Spark's own control and may not necessarily be cleaned up by SparkContext.stop().

Thus, it is useful to have an opt-in functionality to have SparkSubmit automatically call System.exit() upon main method exit (which usually, but not always, corresponds to job completion): this option will allow users and data platform operators to enforce System.exit() calls without having to modify individual jobs' code.

Does this PR introduce any user-facing change?

Yes, it adds a new user-facing configuration option for opting in to a behavior change.

How was this patch tested?

New tests in SparkSubmitSuite, including one which hangs (failing with a timeout) unless the new option is set to true.

Was this patch authored or co-authored using generative AI tooling?

No.

pan3793 · 2025-08-22T06:11:48Z

cc @JoshRosen who is the original author of this change.
cc @dongjoon-hyun @mridulm @yaooqinn @LuciferYang

There were some discussions in the original PR #46889, but actually no objections, this functionality is useful for Spark on K8s cases, to allow Spark to exit even with non-daemon threads (align to Spark on YARN behavior).

SPARK-34674 was proposed to address the Spark on K8s exit issue, but it calls SparkContext.stop instead of System.exit, which does not cover the daemon threads cases. If this PR is accepted, SPARK-34674's change is unnecessary, and it makes Spark on K8s exit process different from other schedules, we can consider providing a config to allow users to disable the SparkContext.stop behavior and rely on the shutdown hook to stop SparkContext.

For other reviewers, in addition to the UT added in the change, I created a demo project to help you verify this functionality manually.

itskals · 2025-08-25T06:28:44Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+        logInfo(
+          log"Calling System.exit() with exit code ${MDC(LogKeys.EXIT_CODE, exitCode)} " +
+          log"because main ${MDC(LogKeys.CONFIG, SUBMIT_CALL_SYSTEM_EXIT_ON_MAIN_EXIT.key)}=true")
+        exitFn(exitCode)


In one of our internal implementations, we initiated a timer that permitted certain slow non-demon threads to complete their tasks, such as updating external metrics and reporting states. Upon the timer's expiration, the system logs a message, records thread dumps of all active threads (this can be used for triaging the cause of the condition), and subsequently exits. Can you extend this patch to do these?

Something like this:

`.....
if (forceTerminateJVM) {
createAndStartShutdownThread(timeToWaitInSeconds, exitCode)
}
}

def createAndStartShutdownThread(timeToWaitInSeconds: Long, exitCode: Int): Unit = {
logInfo("Starting the shutdown thread")
val shutdownThread = new Thread(new Runnable {
override def run(): Unit = {
logInfo(s"Shutdown thread will wait for ${timeToWaitInSeconds}s before " +
s"exiting the JVM")
Thread.sleep(TimeUnit.SECONDS.toMillis(timeToWaitInSeconds))
logWarning("There are non-daemon threads preventing this JVM from shutting down")
Thread.getAllStackTraces.keySet().asScala.filter(t => !t.isDaemon && t.isAlive).
foreach(t => {
logWarning(s"== ${t.toString} ==")
logWarning(t.getStackTrace().mkString(""))
})
logWarning(s"Stopping the JVM with System.exit(${exitCode})")
System.exit(exitCode)
}
}, "JVMShutdownThread")
shutdownThread.setDaemon(true)
shutdownThread.start()
}
`

@itskals I think it's more appropriate to implement this in a shutdown hook. You actually want a graceful shutdown mechanism that's not coupled to either daemon or non-daemon threads. If users want to do some cleanup work or graceful shutdown logic before the JVM terminates, they need to register a shutdown hook rather than creating non-daemon threads.

BTW, you can use three backticks to quote the code block

this is a code block

While the usage of non-demon threads is not desirable, it's hard to enforce in actual application development.

I think it's more appropriate to implement this in a shutdown hook.
Also possible there.

ForVic

LGTM + cc @zhouyejoe @mridulm since we added something really similar internally

ForVic · 2025-08-25T21:27:23Z

core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala

+      if (sparkConf.get(SUBMIT_CALL_SYSTEM_EXIT_ON_MAIN_EXIT)) {
+        logInfo(
+          log"Calling System.exit() with exit code ${MDC(LogKeys.EXIT_CODE, exitCode)} " +
+          log"because main ${MDC(LogKeys.CONFIG, SUBMIT_CALL_SYSTEM_EXIT_ON_MAIN_EXIT.key)}=true")


main in this log line feels out of place

thank you for checking, updated.

yaooqinn · 2025-08-26T10:25:08Z

What's the behavior for cluster mode?

pan3793 · 2025-08-26T10:38:46Z

What's the behavior for cluster mode?

@yaooqinn In YARN cluster mode, the AM always sends a kill signal to terminate the Driver JVM after main exited; there is no behavior change from the user's perspective, regardless of whether spark.submit.callSystemExitOnMainExit is on or off.

In K8s cluster mode, after the main method exits, SparkContext stops,

if there are daemon threads in driver JVM process
1.1. JVM process will hang without this patch, or spark.submit.callSystemExitOnMainExit is false
1.2. JVM process will exit when spark.submit.callSystemExitOnMainExit is true, similar to YARN cluster mode
if there are no daemon threads in driver JVM process, the JVM process will exit

In summary, from the user's perspective, this functionality helps K8s cluster mode for jobs that have daemon threads in driver - they can enable spark.submit.callSystemExitOnMainExit to avoid driver Pod hang after Spark job finishes.

yaooqinn · 2025-08-26T10:47:27Z

Since this is not a general config, can we make it k8s specific https://spark.apache.org/docs/latest/running-on-kubernetes.html#configuration?

pan3793 · 2025-08-28T10:12:26Z

Since this is not a general config, can we make it k8s specific https://spark.apache.org/docs/latest/running-on-kubernetes.html#configuration?

@yaooqinn Actually, compared to other Resource Managers, YARN is special, it does additional cleanup work when the AM main method exits. I tested Standalone cluster mode, it's similar to K8s, non-daemon threads also block driver JVM exits. I don't have env to test Mesos, but have a quick look at the codebase, I think it's similar to K8s.

For cases where the driver runs outside of the Resource Manager, e.g. local or client mode, non-daemon threads also block driver JVM exits.

In summary, I think this feature is a general solution to tackle the non-daemon threads issues.

JoshRosen and others added 4 commits August 21, 2025 17:38

Add tests.

cbd7afe

Implement feature.

4f85f7e

config doc

079040c

version

9261a98

github-actions bot added the CORE label Aug 21, 2025

pan3793 mentioned this pull request Aug 21, 2025

[SPARK-53198][CORE] Support terminating driver JVM after SparkContext is stopped #51929

Closed

itskals reviewed Aug 25, 2025

View reviewed changes

ForVic approved these changes Aug 25, 2025

View reviewed changes

polish log

25cda18

yaooqinn approved these changes Aug 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48547][DEPLOY] Add opt-in flag to have SparkSubmit automatically call System.exit after user code main method exits #52091

[SPARK-48547][DEPLOY] Add opt-in flag to have SparkSubmit automatically call System.exit after user code main method exits #52091

pan3793 commented Aug 21, 2025

Uh oh!

pan3793 commented Aug 22, 2025

Uh oh!

itskals Aug 25, 2025

Uh oh!

itskals Aug 25, 2025

Uh oh!

pan3793 Aug 25, 2025 •

edited

Loading

Uh oh!

itskals Aug 25, 2025

Uh oh!

ForVic left a comment

Uh oh!

ForVic Aug 25, 2025

Uh oh!

pan3793 Aug 26, 2025

Uh oh!

yaooqinn commented Aug 26, 2025

Uh oh!

pan3793 commented Aug 26, 2025 •

edited

Loading

Uh oh!

yaooqinn commented Aug 26, 2025

Uh oh!

pan3793 commented Aug 28, 2025

Uh oh!

Uh oh!

[SPARK-48547][DEPLOY] Add opt-in flag to have SparkSubmit automatically call System.exit after user code main method exits #52091

Are you sure you want to change the base?

[SPARK-48547][DEPLOY] Add opt-in flag to have SparkSubmit automatically call System.exit after user code main method exits #52091

Conversation

pan3793 commented Aug 21, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 commented Aug 22, 2025

Uh oh!

itskals Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

itskals Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itskals Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

ForVic left a comment

Choose a reason for hiding this comment

Uh oh!

ForVic Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

pan3793 Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Aug 26, 2025

Uh oh!

pan3793 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Aug 26, 2025

Uh oh!

pan3793 commented Aug 28, 2025

Uh oh!

Uh oh!

pan3793 Aug 25, 2025 •

edited

Loading

pan3793 commented Aug 26, 2025 •

edited

Loading