Using Anaconda with Cloudera CDH

There are different methods of using Anaconda Scale on a cluster with Cloudera CDH:

  1. The Anaconda parcel for Cloudera CDH
  2. A dynamic, managed version of Anaconda on all of the nodes

SEE ALSO: Blog post Self-service Open Data Science: Custom Anaconda parcels for Cloudera.

The freely available Anaconda parcel is based on Python 2.7 and includes the default conda packages that are available in the free Anaconda distribution.

In addition to the freely available Anaconda parcel based on Anaconda with Python 2.7, we can provide custom parcels for different versions of Python with additional conda packages depending on your needs. For more information about custom Anaconda parcels for Cloudera CDH, please contact sales@continuum.io.

Anaconda Workgroup and Anaconda Enterprise subscribers can also use Anaconda Repository to create and distribute their own custom Anaconda parcels for Cloudera Manager.

If you need more dynamic functionality than the Anaconda parcel offers, Anaconda Scale also provides functionality to dynamically install and manage multiple conda environments (such as Python 2, Python 3, and R environments) and packages across a cluster.

Using the Anaconda Parcel

Refer to the Anaconda parcel documentation for more information about installing the Anaconda parcel on a CDH cluster using Cloudera Manager.

If you want to transition from the Anaconda parcel for CDH to the dynamic, managed version of Anaconda Scale, the instructions below describe how to uninstall the Anaconda parcel on a CDH cluster and transition to a centrally managed version of Anaconda.

Uninstalling the Anaconda parcel

If the Anaconda parcel is installed on the CDH cluster, use the following steps to uninstall the parcel.

  1. From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.
  2. Click the Deactivate button to the right of the Anaconda parcel listing.
  3. Click OK on the Deactivate prompt to deactivate the Anaconda parcel and restart Spark and related services.
  4. Click the arrow to the right of the Anaconda parcel listing and choose Remove From Hosts, which will prompt with a confirmation dialog.
  5. The Anaconda parcel has been removed from the cluster nodes.

For more information about managing Cloudera parcels, refer to the Cloudera documentation.

Transitioning to a centrally managed Anaconda installation

Once you’ve uninstalled the Anaconda parcel, refer to the Anaconda Scale installation instructions for more information about installing a centrally managed version of Anaconda.

Using Anaconda with Cloudera CDH and Spark

You can submit Spark jobs using the PYSPARK_PYTHON environment variable that refers to the location of Anaconda, for example:

$ PYSPARK_PYTHON=/opt/continuum/anaconda/bin/python spark-submit pyspark_script.py