How to perform distributed natural language processing

Overview

This example provides a simple PySpark job that utilizes the NLTK library. NLTK is a popular Python package for natural language processing. This example will demonstrate the installation of Python libraries on the cluster, the usage of Spark with the YARN resource manager and execution of the PySpark job.

Who is this for?

This how-to is for users of a Spark cluster who wish to run a PySpark job with the YARN resource manager. This how-to will show you how to integrate third-party Python libraries with Spark.

Before you start

To execute this example, download the spark-nltk.py example script or spark-nltk.ipynb example notebook.

For this example you’ll need Spark running with the YARN resource manager. You can install Spark and YARN using an enterprise Hadoop distribution such as Cloudera CDH or Hortonworks HDP.

Install NLTK

You install NLTK on all of the cluster nodes using the adam scale command:

$ adam scale -n cluster conda install nltk

You should see output similar to this from each node, which indicates that the package was successfully installed across the cluster:

All nodes (x4) response:
{
  "actions": {
    "EXTRACT": [
      "conda-env-2.5.2-py27_0",
      "conda-4.1.11-py27_0"
    ],
    "FETCH": [
      "conda-env-2.5.2-py27_0",
      "conda-4.1.11-py27_0"
    ],
    "LINK": [
      "conda-env-2.5.2-py27_0 1 None",
      "conda-4.1.11-py27_0 1 None"
    ],
    "PREFIX": "/opt/continuum/anaconda",
    "SYMLINK_CONDA": [
      "/opt/continuum/anaconda"
    ],
    "UNLINK": [
      "conda-4.1.6-py27_0",
      "conda-env-2.5.1-py27_0"
    ],
    "op_order": [
      "RM_FETCHED",
      "FETCH",
      "RM_EXTRACTED",
      "EXTRACT",
      "UNLINK",
      "LINK",
      "SYMLINK_CONDA"
    ]
  },
  "success": true
}

For this example, you will need to download the NLTK sample data. You can download the data on all cluster nodes by using the adam cmd command.

$ adam cmd 'sudo /opt/continuum/anaconda/bin/python -m nltk.downloader -d /usr/share/nltk_data all'

The sample data will be downloaded over the next few minutes. After the download process completes, you should see output similar to:

All nodes (x4) response: [nltk_data] Downloading collection 'all'
[nltk_data]    |
[nltk_data]    | Downloading package abc to /usr/share/nltk_data...
[nltk_data]    |   Unzipping corpora/abc.zip.
[nltk_data]    | Downloading package alpino to /usr/share/nltk_data...
[nltk_data]    |   Unzipping corpora/alpino.zip.
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     /usr/share/nltk_data...

....

[nltk_data]    |   Unzipping models/bllip_wsj_no_aux.zip.
[nltk_data]    | Downloading package word2vec_sample to
[nltk_data]    |     /usr/share/nltk_data...
[nltk_data]    |   Unzipping models/word2vec_sample.zip.
[nltk_data]    |
[nltk_data]  Done downloading collection all

Running the Job

Here is the complete script to run the Spark + NLTK example in PySpark.

# spark-nltk.py
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('spark-nltk')
sc = SparkContext(conf=conf)

data = sc.textFile('file:///usr/share/nltk_data/corpora/state_union/1972-Nixon.txt')

def word_tokenize(x):
    import nltk
    return nltk.word_tokenize(x)

def pos_tag(x):
    import nltk
    return nltk.pos_tag([x])

words = data.flatMap(word_tokenize)
print words.take(10)

pos_word = words.map(pos_tag)
print pos_word.take(5)

Let’s walk through the above code example. First, we will import PySpark and create a SparkContext.

from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('spark-nltk')
sc = SparkContext(conf=conf)

After a SparkContext is created, we can load some data into Spark. In this case, the data file is from one of the example documents provided by NLTK. Note that we could also copy the data to HDFS and load it from Spark.

data = sc.textFile('file:///usr/share/nltk_data/corpora/state_union/1972-Nixon.txt')

Next, we write a function called word_tokenize that will import nltk on the Spark worker nodes and call nltk.word_tokenize. The function is mapped to the text file that was read in the previous step.

def word_tokenize(x):
    import nltk
    return nltk.word_tokenize(x)

words = data.flatMap(word_tokenize)

We can confirm that the flatMap operation worked by returning some of the words in the dataset.

print words.take(10)

Finally, NTLK’s part-of-speech tagger can be used to attach the part of speech to each word in the data set.

def pos_tag(x):
    import nltk
    return nltk.pos_tag([x])

pos_word = words.map(pos_tag)
print pos_word.take(5)

Run the script on the Spark cluster using the spark-submit script. The output shows the words that were returned from the Spark script, including the results from the flatMap operation and the POS-tagger.

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/06/13 05:14:29 INFO SparkContext: Running Spark version 1.4.0

[...]

['Address',
 'on',
 'the',
 'State',
 'of',
 'the',
 'Union',
 'Delivered',
 'Before',
 'a']

[...]

[[('Address', 'NN')],
 [('on', 'IN')],
 [('the', 'DT')],
 [('State', 'NNP')],
 [('of', 'IN')]]

Troubleshooting

If something goes wrong, consult the Help and Support page.

Further information

See the Spark and PySpark documentation pages for more information.

For more information on NLTK see the NLTK book.