Configuring Anaconda with Spark¶
Overview¶
Apache Spark is an analytics engine and parallel computation framework with Scala, Python and R interfaces. Spark can load data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, and others.
You can install Spark on a cluster using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.
Different Ways to Use Spark with Anaconda¶
Spark scripts are often developed interactively and can be written as a Python script or in a Jupyter notebook.
A PySpark script can be submitted to a Spark cluster using various methods:
- Running the script directly on the head node.
- Using the spark-submit command either in Standalone mode or with the YARN resource manager.
- Interactively in an IPython shell or Jupyter Notebook on the cluster.
To run a script on the head node, simply execute python example.py
on the
cluster. Alternatively, you can install Jupyter Notebook on the cluster using
Anaconda Scale. Refer to the Installation documentation for more
information.
You can also use Anaconda Scale with enterprise Hadoop distributions such as Cloudera CDH or Hortonworks HDP. The sections below provide details on how to configure and use Anaconda with Spark jobs and initialize a SparkContext.
Configuring Anaconda for Spark jobs¶
Configuring Anaconda with the spark-submit
command¶
You can submit Spark jobs using the PYSPARK_PYTHON
environment variable that
refers to the location of the Python executable in Anaconda, for example:
$ PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py
Configuring Anaconda with Jupyter Notebooks and Cloudera CDH¶
You can configure Jupyter Notebooks to use Anaconda Scale with Cloudera CDH using the following Python code at the top of your notebook:
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.7.0_67-cloudera/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
The above configuration was tested with Cloudera CDH 5.9 and Spark 1.6. Depending on the version of Cloudera CDH that you have installed, you might need to customize these paths according to the location of Java, Spark, and Anaconda on your cluster.
Configuring Anaconda with Jupyter Notebooks and Hortonworks HDP¶
You can configure Jupyter Notebooks to use Anaconda Scale with Hortonworks HDP using the following Python code at the top of your notebook:
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
The above configuration was tested with Hortonworks HDP 2.5, Apache Ambari 2.4, and Spark 1.6. Depending on the version of Hortonworks HDP that you have installed, you might need to customize these paths according to the location of Spark and Anaconda on your cluster.
Creating a SparkContext¶
Once you’ve configured the appropriate environment variables, you can initialize
a SparkContext (in yarn-client
client mode in this example) using:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('anaconda-pyspark')
sc = SparkContext(conf=conf)
For more information about configuring Spark settings, refer to the PySpark documentation.
Once you’ve initialized a SparkContext, you can start using Anaconda with Spark jobs. Refer to the Using Anaconda with Spark documentation for example Spark jobs that use libraries from Anaconda.