-
Notifications
You must be signed in to change notification settings - Fork 28.8k
[SPARK-53198][CORE] Support terminating driver JVM after SparkContext is stopped #51929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -45,6 +45,9 @@ private[spark] object SparkExitCode { | |||
OutOfMemoryError. */ | |||
val OOM = 52 | |||
|
|||
/** Exit because the SparkContext is stopped. */ | |||
val SPARK_CONTEXT_STOPPED = 69 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkContext
is an essential component/service of the Spark application, according to [1][2], 69 is a widely-used exit code with the typical meaning - Service unavailable: A service required to complete the task is unavailable.
[1] https://www.ditig.com/linux-exit-status-codes
[2] https://www.man7.org/linux/man-pages/man3/sysexits.h.3head.html
cc @yaooqinn |
Can you elaborate more? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of misuse of DriverPlugin
- the expectation is for plugin's to terminate when shutdown
is invoked.
Btw, if we do want to do this - we can trigger it from shutdown
(and avoid it in shutdown context, etc).
logWarning("SparkContext liveness check is disabled.") | ||
} else { | ||
val task: Runnable = () => { | ||
if (sc.isStopped) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although there is a delay, terminateDelay
, this approach looks like a kind of race condition because this checks only the starting point of all stop
logic of SparkContext
like the following. SparkContext
is supposed to do many things after setting this flag.
spark/core/src/main/scala/org/apache/spark/SparkContext.scala
Lines 2322 to 2346 in 5c52a00
if (!stopped.compareAndSet(false, true)) { | |
logInfo("SparkContext already stopped.") | |
return | |
} | |
if (_shutdownHookRef != null) { | |
ShutdownHookManager.removeShutdownHook(_shutdownHookRef) | |
} | |
if (listenerBus != null) { | |
Utils.tryLogNonFatalError { | |
postApplicationEnd(exitCode) | |
} | |
} | |
Utils.tryLogNonFatalError { | |
_driverLogger.foreach(_.stop()) | |
} | |
Utils.tryLogNonFatalError { | |
_ui.foreach(_.stop()) | |
} | |
Utils.tryLogNonFatalError { | |
_cleaner.foreach(_.stop()) | |
} | |
Utils.tryLogNonFatalError { | |
_executorAllocationManager.foreach(_.stop()) | |
} |
If this is really needed, it's more easier to trigger System.exit
thread inside SparkContext.stop
instead of SparkLivenessPlugin
. That would be much cheaper.
@mridulm @dongjoon-hyun thanks for your suggestion, if we narrow the PR scope to address:
What do you think about the direction of SPARK-48547 (#46889)? I feel this might be a more generic approach. |
I agree with you, @pan3793 . I'd give +1 for @JoshRosen 's approach. |
@dongjoon-hyun thank you for the suggestion, and sorry for the late reply. Close and in favor #52091 |
What changes were proposed in this pull request?
This PR adds
SparkLivenessPlugin
to allow checking the liveness ofSparkContext
, which is an essential component of the Spark application, and terminating the Spark driver JVM once theSparkContext
is detected to be stopped.Why are the changes needed?
This helps two typical use cases:
In local / K8s cluster mode, unlike YARN, the non-daemon thread blocks the driver JVM exit even if
SparkContext
is stopped. This is a challenge for user to migrate their Spark workloads from YARN to K8s, especially when non-daemon threads are created by third-party libraries. SPARK-48547 (#46889) was proposed to address such issue but unfortunately does not get in.In some cases,
SparkContext
may stop abnormally after driver OOM, but the driver JVM does not exit due to other non-daemon threads live, which causes services like Thrift Server / Connect Server to be unable to process new requests, previously, we suggest the user configurespark.driver.extraJavaOptions=-XX:OnOutOfMemoryError="kill -9 %p"
to mitigate such issues, or consider using https://github.com/Netflix-Skunkworks/jvmquakeDoes this PR introduce any user-facing change?
It's a new feature, disabled by default.
How was this patch tested?
An example project was created to help reviewers verify this patch in local mode, I also tested it in K8s cluster mode - without this patch, the driver pod runs forever, and with this patch, the driver pod exited after the configured delay interval.
Was this patch authored or co-authored using generative AI tooling?
No.