Overview of Spark, YARN and HDFS¶
Apache Spark is an analytics engine and parallel computation framework with Scala, Python and R interfaces. Spark can load data directly from disk, memory and other data storage technologies such as Amazon S3, Hadoop Distributed File System (HDFS), HBase, Cassandra, and others.
You can install Spark on a cluster using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.
Submitting Spark Jobs¶
Spark scripts are often developed interactively and can be written as a Python script or as a Jupyter notebook file.
A Spark script can be submitted to a Spark cluster using various methods:
- Running the script directly on the head node.
- Using the spark-submit command either in Standalone mode or with the YARN resource manager.
- Interactively in an IPython shell or Jupyter Notebook on the cluster.
To run a script on the head node, simply execute python example.py
on the
cluster.
You can install Jupyter Notebook on the cluster using Anaconda Scale. Refer to the Installation documentation for more information.