Image Specifics#
This page provides details about features specific to one or more images.
Apache Spark™#
Specific Docker Image Options#
-p 4040:4040
- Thejupyter/pyspark-notebook
andjupyter/all-spark-notebook
images open SparkUI (Spark Monitoring and Instrumentation UI) at default port4040
, this option maps the4040
port inside the docker container to the4040
port on the host machine.Note
Every new spark context that is created is put onto an incrementing port (i.e. 4040, 4041, 4042, etc.), and it might be necessary to open multiple ports.
For example,
docker run --detach -p 8888:8888 -p 4040:4040 -p 4041:4041 quay.io/jupyter/pyspark-notebook
.
IPython low-level output capture and forward#
Spark images (pyspark-notebook
and all-spark-notebook
) have been configured to disable IPython low-level output capture and forward system-wide.
The rationale behind this choice is that Spark logs can be verbose, especially at startup when Ivy is used to load additional jars.
Those logs are still available but only in the container’s logs.
If you want to make them appear in the notebook, you can overwrite the configuration in a user-level IPython kernel profile.
To do that, you have to uncomment the following line in your ~/.ipython/profile_default/ipython_kernel_config.py
and restart the kernel.
c.IPKernelApp.capture_fd_output = True
If you have no IPython profile, you can initiate a fresh one by running the following command.
ipython profile create
# [ProfileCreate] Generating default config file: '/home/jovyan/.ipython/profile_default/ipython_config.py'
# [ProfileCreate] Generating default config file: '/home/jovyan/.ipython/profile_default/ipython_kernel_config.py'
Build an Image with a Different Version of Spark#
You can build a pyspark-notebook
image with a different Spark
version by overriding the default value of the following arguments at build time.
all-spark-notebook
is inherited from pyspark-notebook
, so you have to first build pyspark-notebook
and then all-spark-notebook
to get the same version in all-spark-notebook
.
Spark distribution is defined by the combination of Spark, Hadoop, and Scala versions, see Download Apache Spark and the archive repo for more information.
openjdk_version
: The version of the OpenJDK (JRE headless) distribution (17
by default).This version needs to match the version supported by the Spark distribution used above.
See Spark Overview and Ubuntu packages.
spark_version
(optional): The Spark version to install, for example3.5.0
. If not specified (this is the default), latest stable Spark will be installed.hadoop_version
: The Hadoop version (3
by default). Note, that Spark < 3.3 require to specifymajor.minor
Hadoop version (i.e.3.2
).scala_version
(optional): The Scala version, for example2.13
(not specified by default). Starting with Spark >= 3.2, the distribution file might contain the Scala version.spark_download_url
: URL to use for Spark downloads. You may need to use https://archive.apache.org/dist/spark/ url if you want to download old Spark versions.
For example, here is how to build a pyspark-notebook
image with Spark 3.2.0
, Hadoop 3.2
, and OpenJDK 11
.
Warning
This recipe is not tested and might be broken.
# From the root of the project
# Build the image with different arguments
docker build --rm --force-rm \
-t my-pyspark-notebook ./images/pyspark-notebook \
--build-arg openjdk_version=11 \
--build-arg spark_version=3.2.0 \
--build-arg hadoop_version=3.2 \
--build-arg spark_download_url="https://archive.apache.org/dist/spark/"
# Check the newly built image
docker run -it --rm my-pyspark-notebook pyspark --version
# Welcome to
# ____ __
# / __/__ ___ _____/ /__
# _\ \/ _ \/ _ `/ __/ '_/
# /___/ .__/\_,_/_/ /_/\_\ version 3.2.0
# /_/
# Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 11.0.21
# Branch HEAD
# Compiled by user ubuntu on 2021-10-06T12:46:30Z
# Revision 5d45a415f3a29898d92380380cfd82bfc7f579ea
# Url https://github.com/apache/spark
# Type --help for more information.
Usage Examples#
The jupyter/pyspark-notebook
and jupyter/all-spark-notebook
images support the use of Apache Spark in Python and R notebooks.
The following sections provide some examples of how to get started using them.
Using Spark Local Mode#
Spark local mode is useful for experimentation on small data when you do not have a Spark cluster available.
Warning
In these examples, Spark spawns all the main execution components in the same single JVM. You can read additional info about local mode here. If you want to use all the CPU, one of the simplest ways is to set up a Spark Standalone Cluster.
Local Mode in Python#
In a Python notebook.
from pyspark.sql import SparkSession
# Spark session & context
spark = SparkSession.builder.master("local").getOrCreate()
sc = spark.sparkContext
# Sum of the first 100 whole numbers
rdd = sc.parallelize(range(100 + 1))
rdd.sum()
# 5050
Local Mode in R#
In an R notebook with SparkR.
library(SparkR)
# Spark session & context
sc <- sparkR.session("local")
# Sum of the first 100 whole numbers
sdf <- createDataFrame(list(1:100))
dapplyCollect(sdf,
function(x)
{ x <- sum(x)}
)
# 5050
In an R notebook with sparklyr.
library(sparklyr)
# Spark configuration
conf <- spark_config()
# Set the catalog implementation in-memory
conf$spark.sql.catalogImplementation <- "in-memory"
# Spark session & context
sc <- spark_connect(master = "local", config = conf)
# Sum of the first 100 whole numbers
sdf_len(sc, 100, repartition = 1) %>%
spark_apply(function(e) sum(e))
# 5050
Connecting to a Spark Cluster in Standalone Mode#
Connection to Spark Cluster on Standalone Mode requires the following set of steps:
Verify that the docker image (check the Dockerfile) and the Spark Cluster, which is being deployed, run the same version of Spark.
Run the Docker container with
--net=host
in a location that is network-addressable by all of your Spark workers. (This is a Spark networking requirement.)Note
When using
--net=host
, you must also use the flags--pid=host -e TINI_SUBREAPER=true
. See jupyter/docker-stacks#64 for details._
Note: In the following examples, we are using the Spark master URL spark://master:7077
which shall be replaced by the URL of the Spark master.
Standalone Mode in Python#
The same Python version needs to be used on the notebook (where the driver is located) and on the Spark workers.
The Python version used on the driver and worker side can be adjusted by setting the environment variables PYSPARK_PYTHON
and/or PYSPARK_DRIVER_PYTHON
,
see Spark Configuration for more information.
from pyspark.sql import SparkSession
# Spark session & context
spark = SparkSession.builder.master("spark://master:7077").getOrCreate()
sc = spark.sparkContext
# Sum of the first 100 whole numbers
rdd = sc.parallelize(range(100 + 1))
rdd.sum()
# 5050
Standalone Mode in R#
In an R notebook with SparkR.
library(SparkR)
# Spark session & context
sc <- sparkR.session("spark://master:7077")
# Sum of the first 100 whole numbers
sdf <- createDataFrame(list(1:100))
dapplyCollect(sdf,
function(x)
{ x <- sum(x)}
)
# 5050
In an R notebook with sparklyr.
library(sparklyr)
# Spark session & context
# Spark configuration
conf <- spark_config()
# Set the catalog implementation in-memory
conf$spark.sql.catalogImplementation <- "in-memory"
sc <- spark_connect(master = "spark://master:7077", config = conf)
# Sum of the first 100 whole numbers
sdf_len(sc, 100, repartition = 1) %>%
spark_apply(function(e) sum(e))
# 5050
Define Spark Dependencies#
Note
This example is given for Elasticsearch.
Spark dependencies can be declared thanks to the spark.jars.packages
property
(see Spark Configuration for more information).
They can be defined as a comma-separated list of Maven coordinates at the creation of the Spark session.
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("elasticsearch")
.config(
"spark.jars.packages", "org.elasticsearch:elasticsearch-spark-30_2.12:7.13.0"
)
.getOrCreate()
)
Dependencies can also be defined in the spark-defaults.conf
.
However, it has to be done by root
, so it should only be considered to build custom images.
USER root
RUN echo "spark.jars.packages org.elasticsearch:elasticsearch-spark-30_2.12:7.13.0" >> "${SPARK_HOME}/conf/spark-defaults.conf"
USER ${NB_UID}
Jars will be downloaded dynamically at the creation of the Spark session and stored by default in ${HOME}/.ivy2/jars
(can be changed by setting spark.jars.ivy
).
Tensorflow#
The jupyter/tensorflow-notebook
image supports the use of
Tensorflow in a single machine or distributed mode.
Single Machine Mode#
import tensorflow as tf
hello = tf.Variable("Hello World!")
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
sess.run(hello)
Distributed Mode#
import tensorflow as tf
hello = tf.Variable("Hello Distributed World!")
server = tf.train.Server.create_local_server()
sess = tf.Session(server.target)
init = tf.global_variables_initializer()
sess.run(init)
sess.run(hello)