remove config and checkpoint dir from default spark config #357

behrica · 2024-11-09T18:06:40Z

fixes #356

anthony-khong · 2024-11-18T10:45:09Z

Hi Carsten @behrica, my apologies for the delayed reply. Anything I can do to help to make this work for you? Do you need me to deploy a new version to Clojars?

behrica · 2024-11-18T20:51:54Z

My change makes it work for me,
bu I don't know if there was any reason to have the default config in the first place.

anthony-khong · 2024-11-19T06:09:51Z

Hi Carsten, I just gave you a maintainer access to the repo. Please feel free to change it however you see fit 🙏

alza-bitz · 2025-02-13T09:47:25Z

@behrica it looks like the CI/CD build for this PR failed, but it was a while ago so the logs are no longer accessible to check why it failed. If you have maintainer access, you should have permission to re-try the build in the actions tab?

Let's fix the build and then the PR will be in a mergeable state.

@anthony-khong if you're ok with this PR could you approve it? Thanks 🙏

alza-bitz · 2025-02-13T09:54:21Z

Actually @anthony-khong I'd be interested to know the original rationale for specifying the default config thus:

{:configs {:spark.sql.adaptive.enabled "true"
             :spark.sql.adaptive.coalescePartitions.enabled "true"}
   :checkpoint-dir "target/checkpoint/"}

As in, what would be lost (or break, not work) by removing this? I can look up what these config options mean for sure but I was thinking about anything non-obvious for newcomers? As in, the purpose of having a default session is to get working quickly correct?

behrica · 2025-02-13T10:37:09Z

This seems
https://www.sparkcodehub.com/spark-checkpoint-dir-configuration
to say that it cannot be set correctly for "all cluster setups", as it should be ideally on a "fast , durable drive"..

Its old setting failed for me on latest Azure Databricks, as it points "somehow" to a non-writable location.

The ohter 2 seems to be "performabce setting", which probbaly cannot be set "correctly" for all situations neither.

I am by far no expert in Spark, but "No options" seems to be the best default.

But here we touch again the question of the "default session".
I never understood if and why on my Azure Databricks cluster a new session with that checkpoint dir is even "tried" to be created.
It should use the existing one, using whatever settings it has.
But I got my JARs (submit as a job) failing on start, due to this and found no other way of fixing it.
(they failed with a clear exception of "... cannot write to target/checkpoint")

behrica · 2025-02-13T10:48:27Z

The docu on getOrCreate is clear:

In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.

eventhough I find this "weired"

In any case to me a "user" need full control on the "options" incl. "checkpoint dir" and this is to given with curret code, some are applied always, its seems.

I cannot change the config "before it is used", I think....

see here as well:
#332

The "atom" was once a "delay", which would maybe avoid the problem, but still be brittle
There is some explanation here for this decisions,
#243

somehow related to "swap it out" for testing.

behrica added 2 commits November 9, 2024 17:55

changed group to behrica

f8cf85a

removed all configs from default

16080dd

behrica mentioned this pull request Feb 13, 2025

spark connect and databricks connect support? #345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove config and checkpoint dir from default spark config #357

remove config and checkpoint dir from default spark config #357

Uh oh!

behrica commented Nov 9, 2024

Uh oh!

anthony-khong commented Nov 18, 2024

Uh oh!

behrica commented Nov 18, 2024

Uh oh!

anthony-khong commented Nov 19, 2024

Uh oh!

alza-bitz commented Feb 13, 2025

Uh oh!

alza-bitz commented Feb 13, 2025

Uh oh!

behrica commented Feb 13, 2025

Uh oh!

behrica commented Feb 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

remove config and checkpoint dir from default spark config #357

Are you sure you want to change the base?

remove config and checkpoint dir from default spark config #357

Uh oh!

Conversation

behrica commented Nov 9, 2024

Uh oh!

anthony-khong commented Nov 18, 2024

Uh oh!

behrica commented Nov 18, 2024

Uh oh!

anthony-khong commented Nov 19, 2024

Uh oh!

alza-bitz commented Feb 13, 2025

Uh oh!

alza-bitz commented Feb 13, 2025

Uh oh!

behrica commented Feb 13, 2025

Uh oh!

behrica commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

behrica commented Feb 13, 2025 •

edited

Loading