Running PySpark with the YARN resource manager¶
This example runs a script on the Spark cluster with the YARN resource manager and returns the hostname of each node in the cluster.
Who is this for?¶
This example is for users of a Spark cluster who wish to run a PySpark job using the YARN resource manager.
Before you start¶
Download the spark-yarn.py example script
to your cluster.
You need Spark running with the YARN resource manager. You can install Spark and YARN using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.
Running the job¶
This code is almost the same as the code on the page Running PySpark as a Spark standalone job, which describes the code in more detail.
Here is the complete script to run the Spark + YARN example in PySpark:
# spark-yarn.py
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print rdd
NOTE: You may need to install NumPy on the cluster nodes using
adam scale -n cluster conda install numpy
.
Run the script on the Spark cluster with
spark-submit.
The output shows the first 10 values that were returned from the spark-basic.py
script.
16/05/05 22:26:53 INFO spark.SparkContext: Running Spark version 1.6.0
[...]
16/05/05 22:27:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 3242 bytes)
16/05/05 22:27:04 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:46587 (size: 2.6 KB, free: 530.3 MB)
16/05/05 22:27:04 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 652 ms on localhost (1/1)
16/05/05 22:27:04 INFO cluster.YarnScheduler: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/05/05 22:27:04 INFO scheduler.DAGScheduler: ResultStage 0 (runJob at PythonRDD.scala:393) finished in 4.558 s
16/05/05 22:27:04 INFO scheduler.DAGScheduler: Job 0 finished: runJob at PythonRDD.scala:393, took 4.951328 s
[(0, 0), (1, 1), (2, 0), (3, 1), (4, 0), (5, 1), (6, 0), (7, 1), (8, 0), (9, 1)]
Troubleshooting¶
If something goes wrong, see Help and support.