Skip to content

Conversation

@behrica
Copy link
Collaborator

@behrica behrica commented Nov 9, 2024

fixes #356

@anthony-khong
Copy link
Member

Hi Carsten @behrica, my apologies for the delayed reply. Anything I can do to help to make this work for you? Do you need me to deploy a new version to Clojars?

@behrica
Copy link
Collaborator Author

behrica commented Nov 18, 2024

My change makes it work for me,
bu I don't know if there was any reason to have the default config in the first place.

@anthony-khong
Copy link
Member

Hi Carsten, I just gave you a maintainer access to the repo. Please feel free to change it however you see fit 🙏

@alza-bitz
Copy link

@behrica it looks like the CI/CD build for this PR failed, but it was a while ago so the logs are no longer accessible to check why it failed. If you have maintainer access, you should have permission to re-try the build in the actions tab?

Let's fix the build and then the PR will be in a mergeable state.

@anthony-khong if you're ok with this PR could you approve it? Thanks 🙏

@alza-bitz
Copy link

Actually @anthony-khong I'd be interested to know the original rationale for specifying the default config thus:

{:configs {:spark.sql.adaptive.enabled "true"
             :spark.sql.adaptive.coalescePartitions.enabled "true"}
   :checkpoint-dir "target/checkpoint/"}

As in, what would be lost (or break, not work) by removing this? I can look up what these config options mean for sure but I was thinking about anything non-obvious for newcomers? As in, the purpose of having a default session is to get working quickly correct?

@behrica
Copy link
Collaborator Author

behrica commented Feb 13, 2025

This seems
https://www.sparkcodehub.com/spark-checkpoint-dir-configuration
to say that it cannot be set correctly for "all cluster setups", as it should be ideally on a "fast , durable drive"..

Its old setting failed for me on latest Azure Databricks, as it points "somehow" to a non-writable location.

The ohter 2 seems to be "performabce setting", which probbaly cannot be set "correctly" for all situations neither.

I am by far no expert in Spark, but "No options" seems to be the best default.

But here we touch again the question of the "default session".
I never understood if and why on my Azure Databricks cluster a new session with that checkpoint dir is even "tried" to be created.
It should use the existing one, using whatever settings it has.
But I got my JARs (submit as a job) failing on start, due to this and found no other way of fixing it.
(they failed with a clear exception of "... cannot write to target/checkpoint")

@behrica
Copy link
Collaborator Author

behrica commented Feb 13, 2025

The docu on getOrCreate is clear:

In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.

eventhough I find this "weired"

In any case to me a "user" need full control on the "options" incl. "checkpoint dir" and this is to given with curret code, some are applied always, its seems.

I cannot change the config "before it is used", I think....

see here as well:
#332

The "atom" was once a "delay", which would maybe avoid the problem, but still be brittle
There is some explanation here for this decisions,
#243

somehow related to "swap it out" for testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geni not usable wit latest Databricks

3 participants