Skip to content

Conversation

pan3793
Copy link
Member

@pan3793 pan3793 commented Aug 8, 2025

What changes were proposed in this pull request?

This PR adds SparkLivenessPlugin to allow checking the liveness of SparkContext, which is an essential component of the Spark application, and terminating the Spark driver JVM once the SparkContext is detected to be stopped.

Why are the changes needed?

This helps two typical use cases:

In local / K8s cluster mode, unlike YARN, the non-daemon thread blocks the driver JVM exit even if SparkContext is stopped. This is a challenge for user to migrate their Spark workloads from YARN to K8s, especially when non-daemon threads are created by third-party libraries. SPARK-48547 (#46889) was proposed to address such issue but unfortunately does not get in.

In some cases, SparkContext may stop abnormally after driver OOM, but the driver JVM does not exit due to other non-daemon threads live, which causes services like Thrift Server / Connect Server to be unable to process new requests, previously, we suggest the user configure spark.driver.extraJavaOptions=-XX:OnOutOfMemoryError="kill -9 %p" to mitigate such issues, or consider using https://github.com/Netflix-Skunkworks/jvmquake

Does this PR introduce any user-facing change?

It's a new feature, disabled by default.

How was this patch tested?

An example project was created to help reviewers verify this patch in local mode, I also tested it in K8s cluster mode - without this patch, the driver pod runs forever, and with this patch, the driver pod exited after the configured delay interval.

...
2025-08-08 21:45:56 [INFO] [main] org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend#185 - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
SparkSession created
Waiting 10s
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
2025-08-08 21:46:06 [INFO] [main] org.apache.spark.SparkContext#185 - SparkContext is stopping with exitCode 0 from stop at Main.scala:24.
2025-08-08 21:46:07 [INFO] [main] org.sparkproject.jetty.server.AbstractConnector#383 - Stopped Spark@1117cc7c{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
2025-08-08 21:46:07 [INFO] [main] org.apache.spark.ui.SparkUI#185 - Stopped Spark web UI at http://io-github-pan3793-main-96561e9889edf8fc-driver-svc.spark.svc:4040
2025-08-08 21:46:07 [INFO] [main] org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend#185 - Shutting down all executors
2025-08-08 21:46:07 [INFO] [dispatcher-CoarseGrainedScheduler] org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint#185 - Asking each executor to shut down
2025-08-08 21:46:07 [INFO] [-267463507-pool-23-thread-1] org.apache.spark.scheduler.cluster.k8s.ExecutorPodsWatchSnapshotSource#185 - Kubernetes client has been closed.
2025-08-08 21:46:07 [INFO] [dispatcher-event-loop-0] org.apache.spark.MapOutputTrackerMasterEndpoint#185 - MapOutputTrackerMasterEndpoint stopped!
2025-08-08 21:46:07 [INFO] [main] org.apache.spark.storage.memory.MemoryStore#185 - MemoryStore cleared
2025-08-08 21:46:07 [INFO] [main] org.apache.spark.storage.BlockManager#185 - BlockManager stopped
2025-08-08 21:46:07 [INFO] [main] org.apache.spark.storage.BlockManagerMaster#185 - BlockManagerMaster stopped
2025-08-08 21:46:07 [INFO] [dispatcher-event-loop-1] org.apache.spark.scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint#185 - OutputCommitCoordinator stopped!
2025-08-08 21:46:07 [INFO] [main] org.apache.spark.SparkContext#185 - Successfully stopped SparkContext
SparkSession stopped
Hello from non-daemon thread
2025-08-08 21:46:09 [WARN] [driver-liveness] org.apache.spark.deploy.SparkLivenessDriverPlugin#245 - SparkContext is stopped, will terminate Driver JVM after 30 seconds.
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
Hello from non-daemon thread
2025-08-08 21:46:39 [INFO] [shutdown-hook-0] org.apache.spark.util.ShutdownHookManager#185 - Shutdown hook called
2025-08-08 21:46:39 [INFO] [shutdown-hook-0] org.apache.spark.util.ShutdownHookManager#185 - Deleting directory /tmp/spark-ef530907-bfdf-4d1d-b089-3e1e4208779e
2025-08-08 21:46:39 [INFO] [shutdown-hook-0] org.apache.spark.util.ShutdownHookManager#185 - Deleting directory /var/data/spark-771d2b86-f238-4f45-a975-ea9e9e70535a/spark-ef68a0af-ab50-4257-b02f-c37966e265a5
2025-08-08 21:46:39 [INFO] [shutdown-hook-0] org.apache.spark.util.ShutdownHookManager#185 - Deleting directory /tmp/spark-f3efedf8-30d5-47d8-81da-cf8ac15d1a79
2025-08-08 21:46:39 [INFO] [shutdown-hook-0] org.apache.spark.util.ShutdownHookManager#185 - Deleting directory /tmp/spark-8e55c898-4fb8-4ba1-9f38-034d156a3b76
<pod terminated>

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Aug 8, 2025
@@ -45,6 +45,9 @@ private[spark] object SparkExitCode {
OutOfMemoryError. */
val OOM = 52

/** Exit because the SparkContext is stopped. */
val SPARK_CONTEXT_STOPPED = 69
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkContext is an essential component/service of the Spark application, according to [1][2], 69 is a widely-used exit code with the typical meaning - Service unavailable: A service required to complete the task is unavailable.

[1] https://www.ditig.com/linux-exit-status-codes
[2] https://www.man7.org/linux/man-pages/man3/sysexits.h.3head.html

@pan3793
Copy link
Member Author

pan3793 commented Aug 11, 2025

cc @yaooqinn

@yaooqinn
Copy link
Member

SPARK-48547 (#46889) was proposed to address such issue but unfortunately does not get in.

Can you elaborate more?

Copy link
Contributor

@mridulm mridulm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of misuse of DriverPlugin - the expectation is for plugin's to terminate when shutdown is invoked.

Btw, if we do want to do this - we can trigger it from shutdown (and avoid it in shutdown context, etc).

logWarning("SparkContext liveness check is disabled.")
} else {
val task: Runnable = () => {
if (sc.isStopped) {
Copy link
Member

@dongjoon-hyun dongjoon-hyun Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although there is a delay, terminateDelay, this approach looks like a kind of race condition because this checks only the starting point of all stop logic of SparkContext like the following. SparkContext is supposed to do many things after setting this flag.

if (!stopped.compareAndSet(false, true)) {
logInfo("SparkContext already stopped.")
return
}
if (_shutdownHookRef != null) {
ShutdownHookManager.removeShutdownHook(_shutdownHookRef)
}
if (listenerBus != null) {
Utils.tryLogNonFatalError {
postApplicationEnd(exitCode)
}
}
Utils.tryLogNonFatalError {
_driverLogger.foreach(_.stop())
}
Utils.tryLogNonFatalError {
_ui.foreach(_.stop())
}
Utils.tryLogNonFatalError {
_cleaner.foreach(_.stop())
}
Utils.tryLogNonFatalError {
_executorAllocationManager.foreach(_.stop())
}

If this is really needed, it's more easier to trigger System.exit thread inside SparkContext.stop instead of SparkLivenessPlugin. That would be much cheaper.

@pan3793
Copy link
Member Author

pan3793 commented Aug 15, 2025

@mridulm @dongjoon-hyun thanks for your suggestion, if we narrow the PR scope to address:

Non-daemon threads might block driver JVM exit after the main method finishes.

On K8s mode, SPARK-34674 made it always call SparkContext.stop after the main method finishes, so exiting after SparkContext is stopped takes the same effect in such a situation

What do you think about the direction of SPARK-48547 (#46889)? I feel this might be a more generic approach.

@dongjoon-hyun
Copy link
Member

I agree with you, @pan3793 . I'd give +1 for @JoshRosen 's approach.
You may want to take over by making a PR with his authorship and pinging him at the same time.

@pan3793
Copy link
Member Author

pan3793 commented Aug 21, 2025

@dongjoon-hyun thank you for the suggestion, and sorry for the late reply. Close and in favor #52091

@pan3793 pan3793 closed this Aug 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants