Using Anaconda with Cloudera CDH¶
There are two methods of using Anaconda on an existing cluster with Cloudera CDH, Cloudera’s Distribution Including Apache Hadoop: 1) the Anaconda parcel for Cloudera CDH, and 2) Anaconda for cluster management. The instructions below describe how to uninstall the Anaconda parcel on a CDH cluster and transition to Anaconda for cluster management.
Uninstalling the Anaconda parcel¶
If the Anaconda parcel is installed on the CDH cluster, use the following steps to uninstall the parcel. Otherwise, you can skip to the next section.
- From the Cloudera Manager Admin Console, click the Parcels indicator in the top navigation bar.
- Click the
Deactivate
button to the right of the Anaconda parcel listing. - Click
OK
on the Deactivate prompt to deactive the Anaconda parcel and restart Spark and related services. - Click the arrow to the right of the Anaconda parcel listing and choose
Remove From Hosts
, which will prompt with a confirmation dialog. - The Anaconda parcel has been removed from the cluster nodes.
For more information about managing Cloudera parcels, refer to the Cloudera documentation.
Using Anaconda for cluster management¶
Anaconda for cluster management provides additional functionality, including the ability to manage multiple conda environments and packages (including Python and R) alongside an existing CDH cluster.
Configure the nodes with Anaconda for cluster management using the Bare-metal Cluster Setup instructions.
During this process, you will create a profile and provider that describes the cluster.
Provision the cluster using the following command, replacing
cluster-cdh
with the name of your cluster andprofile-cdh
with the name of your profile:$ acluster create cluster-cdh -p profile-cdh
You can submit Spark jobs along with the
PYSPARK_PYTHON
environment variable that refers to the location of Anaconda, for example:$ PYSPARK_PYTHON=/opt/anaconda/bin/python spark-submit pyspark_script.py