Cloudera CDH¶
There are two methods of using Anaconda on an existing cluster with Cloudera CDH, Cloudera’s Distribution Including Apache Hadoop: 1) the Anaconda parcel for Cloudera CDH, and 2) Anaconda Scale. The instructions below describe how to install the Anaconda parcel on a CDH cluster using Cloudera Manager.
SEE ALSO: Blog post Self-service Open Data Science: Custom Anaconda parcels for Cloudera.
Using the Anaconda parcel¶
The Anaconda parcel provides a static installation of Anaconda (based on Python 2.7) that can be used with Python and PySpark jobs on the cluster.
From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.
Click the
Edit Settings
button on the top right of the parcels page.Click the plus symbol in the
Remote Parcel Repository URLs
section, and add the following repository URL for the Anaconda parcel:https://repo.continuum.io/pkgs/misc/parcels/
Click the
Save Changes
button on the top of the page.Click the Parcels indicator in the top navigation bar to return to the list of available parcels, where you should see the latest version of the Anaconda parcel that is available.
Click the
Download
button to the right of the Anaconda parcel listing.After the parcel is downloaded, click the
Distribute
button to distribute the parcel to all of the cluster nodes.After the parcel is distributed, click the
Activate
button to activate the parcel on all of the cluster nodes, which will prompt with a confirmation dialog.After the parcel is activated, Anaconda is now available on all of the cluster nodes.
You can submit Spark jobs along with the
PYSPARK_PYTHON
environment variable that refers to the location of Anaconda, for example:$ PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit pyspark_script.py
Note: The repository URL shown above installs the most recent version of the
Anaconda parcel. To install an older version of the Anaconda parcel, add the
following repository URL to the Remote Parcel Repository URLs
in Cloudera
manager, then follow the above steps with your desired version of the Anaconda
parcel.
Note: Continuum builds new Cloudera parcels at least once a year each spring, and also offers custom parcel creation for our enterprise customers. The Anaconda parcel provided at the repository URL shown above is based on Python 2.7. To use the Anaconda parcel with other versions of Python or with additional packages, please contact sales@continuum.io for more information about custom Anaconda parcel builds or other enterprise solutions for using Anaconda with cluster computing.
Anaconda Workgroup and Anaconda Enterprise subscribers can also use Anaconda Repository to create and distribute their own custom Anaconda parcels for Cloudera Manager.
https://repo.continuum.io/pkgs/misc/parcels/archive/
For more information about managing Cloudera parcels, refer to the Cloudera documentation.
Using Anaconda Scale¶
Anaconda Scale provides additional functionality, including the ability to manage multiple conda environments and packages (including Python and R) alongside an existing CDH cluster. For more information, refer to the Anaconda Scale documentation.