⚠️ DEPRECATED: This repository has been deprecated and is no longer maintained. Please use the new repository at https://github.com/BERDataLakehouse/spark_notebook
This prototype establishes a Docker container configuration for JupyterHub, designed to furnish a multi-user environment tailored for executing Spark jobs via Jupyter notebooks.
Accessing Spark Jupyter Notebook
To deploy the JupyterHub container and Spark nodes locally, execute the following command:
docker-compose up --builddocker exec -it spark-test-node \
sh -c '
/opt/bitnami/spark/bin/spark-submit \
--master $SPARK_MASTER_URL \
examples/src/main/python/pi.py 10 \
2>/dev/null
'After launching the Jupyter Notebook, establish a Spark context or session with the Spark
master set to the environment variable SPARK_MASTER_URL and proceed to submit your job. Once the job is submitted,
you can monitor the job status and logs in the Spark UI.
Sample code to calculate Pi using SparkContext:
from pyspark import SparkConf, SparkContext
import random
import os
spark_master_url = os.environ['SPARK_MASTER_URL']
conf = SparkConf().setMaster(spark_master_url).setAppName("Pi")
sc = SparkContext(conf=conf)
num_samples = 100000000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
sc.stop()In order to build the uv dependencies, including when adding new modules, graphviz-dev
must be installed. For example:
macOS:
# Install Graphviz system library (required for pygraphviz \ependency)
brew install graphviz
# Optionally, set environment variables to help uv find Graphviz headers and libraries
# This fixes build errors where the compiler can't locate the Graphviz headers
export CPATH=$(brew --prefix graphviz)/include:$CPATH
export LIBRARY_PATH=$(brew --prefix graphviz)/lib:$LIBRARY_PATHLinux (Debian/Ubuntu):
# Install Graphviz development libraries
sudo apt install graphviz-devPython 3.12.10 must be installed on the system.
# Install Python dependencies
uv sync --locked # only the first time or when uv.lock changesPYTHONPATH=src uv run pytest testsSPARK_MASTER_URL:spark://spark-master:7077NOTEBOOK_PORT: 4041SPARK_DRIVER_HOST:notebook(the hostname of the Jupyter notebook container).
When running Spark in the Jupyter notebook container, the default spark.driver.host configuration is set to
the hostname (SPARK_DRIVER_HOST) of the container.
In addition, the environment variable SPARK_MASTER_URL should also be configured.
from spark.utils import get_spark_session
spark = get_spark_session(app_name)
# To build spark session for Delta Lake operations, set the delta_lake parameter to True
spark = get_spark_session(app_name, delta_lake=True)If you want to configure the SparkSession manually, you can do so as follows:
spark = SparkSession.builder \
.master(os.environ['SPARK_MASTER_URL']) \
.appName("TestSparkJob") \
.getOrCreate()conf = SparkConf() \
.setMaster(os.environ['SPARK_MASTER_URL']) \
.setAppName("TestSparkJob")
sc = SparkContext(conf=conf)/opt/bitnami/spark/bin/spark-submit \
--master $SPARK_MASTER_URL \
/opt/bitnami/spark/examples/src/main/python/pi.py 10 \
2>/dev/nullssh -f -N -L localhost:44041:10.58.2.201:4041 <kbase_developer_username>@login1.berkeley.kbase.uswhere kbase_developer_username is an Argonne account, typically starting with ac..
Navigate to http://localhost:44041/ in your browser.
Enjoy your Spark journey!
For more information, please consult the User Guide.
Regenerate cdm-spark-cluster-manager-api-client with openapi-python-client
- Python 3.9+ installed on your system.
- The
openapi-python-clientpackage installed. If not already installed, you can do so using pip:pip install openapi-python-client - Access to the OpenAPI specification for the cdm-kube-spark-manager, either via a URL or a local file.
- TODO - Post the URL after deployment to Rancher2
From a URL:
openapi-python-client generate --url https://api.example.com/openapi.json --output-path cdm-spark-cluster-manager-api-clientFrom a Local File:
openapi-python-client generate --path ./openapi.yaml --output-path cdm-spark-cluster-manager-api-client Copy the generated client files to cdm_spark_cluster_manager_api_client
cp -r path_of_openapi_ouptput_path/cdm-spark-cluster-manager-api-client/cdm_spark_cluster_manager_api_client path_of_cdm-jupyterhub/src/spark
cp path_of_openapi_ouptput_path/cdm-spark-cluster-manager-api-client/README.md path_of_cdm-jupyterhub/src/spark/cdm_spark_cluster_manager_api_client