Configuring Anaconda with Spark¶
You can configure Anaconda to work with Spark jobs in three ways: with the “spark-submit” command, or with Jupyter Notebooks and Cloudera CDH, or with Jupyter Notebooks and Hortonworks HDP.
After you configure Anaconda with one of those three methods, then you can create and initialize a SparkContext.
Configuring Anaconda with the spark-submit
command¶
You can submit Spark jobs using the PYSPARK_PYTHON
environment variable that
refers to the location of the Python executable in Anaconda.
EXAMPLE:
PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py
Configuring Anaconda with Jupyter Notebooks and Cloudera CDH¶
Configure Jupyter Notebooks to use Anaconda Scale with Cloudera CDH using the following Python code at the top of your notebook:
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["JAVA_HOME"] = "/usr/java/jdk1.7.0_67-cloudera/jre"
os.environ["SPARK_HOME"] = "/opt/cloudera/parcels/CDH/lib/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
The above configuration was tested with Cloudera CDH 5.11 and Spark 1.6. Depending on the version of Cloudera CDH that you have installed, you might need to customize these paths according to the location of Java, Spark and Anaconda on your cluster.
If you’ve installed a custom Anaconda parcel, the path for PYSPARK_PYTHON
will be /opt/cloudera/parcels/PARCEL_NAME/bin/python
, where PARCEL_NAME
is the name of the custom parcel you created.
Configuring Anaconda with Jupyter Notebooks and Hortonworks HDP¶
Configure Jupyter Notebooks to use Anaconda Scale with Hortonworks HDP using the following Python code at the top of your notebook:
import os
import sys
os.environ["PYSPARK_PYTHON"] = "/opt/continuum/anaconda/bin/python"
os.environ["SPARK_HOME"] = "/usr/hdp/current/spark-client"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.9-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")
The above configuration was tested with Hortonworks HDP 2.6, Apache Ambari 2.4 and Spark 1.6. Depending on the version of Hortonworks HDP that you have installed, you might need to customize these paths according to the location of Spark and Anaconda on your cluster.
If you’ve installed a custom Anaconda management pack, the path for
PYSPARK_PYTHON
will be /opt/continuum/PARCEL_NAME/bin/python
,
where PARCEL_NAME
is the name of the custom parcel you created.
Creating a SparkContext¶
Once you have configured the appropriate environment variables, you can initialize
a SparkContext–in yarn-client
client mode in this example–using:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('anaconda-pyspark')
sc = SparkContext(conf=conf)
For more information about configuring Spark settings, see the PySpark documentation.
Once you’ve initialized a SparkContext, you can start using Anaconda with Spark jobs. For examples of Spark jobs that use libraries from Anaconda, see Using Anaconda with Spark.