How to setup a PySpark Jupyter notebook locally

Here is a quick tutorial on how to setup PySpark on your local computer with a Jupyter notebook environment.

Downloading Spark

Go here and pick out the latest Spark version. The default option of "Pre-built for Apache Hadoop 2.7 and later" is OK to select. Proceed to download.

# Untar spark and put it into /opt
$ tar xvfz spark-2.4.0-bin-hadoop2.7.tgz -C /opt

# Give it a cleaner name
$ sudo mv /opt/spark-2.4.0-bin-hadoop2.7 /opt/spark

Run the REPL

You can start a PySpark REPL by running /opt/spark/bin/pyspark.

Run a Jupyter notebook

Often people will suggest to use environmental variables directly to run PySpark in a Jupyter notebook; however, this is a messy approach. It is much better instead to setup a custom Jupyter kernel.

I can check what kernels I have available with jupyter kernelspec list.

$ jupyter kernelspec list
Available kernels:
  julia-1.0    /home/suzil/.local/share/jupyter/kernels/julia-1.0
  scala        /home/suzil/.local/share/jupyter/kernels/scala
  python3      /home/suzil/anaconda3/share/jupyter/kernels/python3
  python2      /usr/share/jupyter/kernels/python2

Unfortunately, these auto-created kernels were put into various places which isn't ideal, but whatever. You can create your custom kernel at any of these paths, but I will stick it in the anaconda3 path where the python3 kernel is.

All I have to do is create a file called kernel.json and put into a folder that I will create at /home/suzil/anaconda3/share/jupyter/kernels/pyspark.

These path values are for me, so make sure you modify to the path of your own Python interpreter.

Now, try to run jupyter kernelspec list again.

$ jupyter kernelspec list
Available kernels:
  julia-1.0    /home/suzil/.local/share/jupyter/kernels/julia-1.0
  scala        /home/suzil/.local/share/jupyter/kernels/scala
  pyspark      /home/suzil/anaconda3/share/jupyter/kernels/pyspark
  python3      /home/suzil/anaconda3/share/jupyter/kernels/python3
  python2      /usr/share/jupyter/kernels/python2

pyspark is now available as a kernel! I can just run:

$ jupyter notebook

I select the PySpark kernel option when creating a new notebook.

Let me verify that I can create a RDD and execute a command on it. Note that sc should already be in scope without any import.

words = sc.parallelize([
   'scala',
   'java',
   'hadoop',
   'spark',
   'akka',
   'spark vs hadoop',
   'pyspark',
   'pyspark and spark',
])
words.count()
8

Voilà! PySpark is running locally in a Jupyter notebook.