mapreduce {rmr2}R Documentation

MapReduce using Hadoop Streaming

Description

Defines and executes a map reduce job.

Usage

 mapreduce(
  input,
  output = NULL,
  map = to.map(identity),
  reduce = NULL,
  vectorized.reduce = FALSE,
  combine = NULL,
  in.memory.combine = FALSE,
  input.format = "native",
  output.format = "native",
  backend.parameters = list(),
  verbose = TRUE) 

Arguments

input

Paths to the input folder(s) (on HDFS) or vector thereof or or the return value of another mapreduce or a to.dfs call

output

A path to the destination folder (on HDFS); if missing, a big.data.object is returned, see "Value" below

map

An optional R function of two arguments, returning either NULL or the return value of keyval, that specifies the map operation to execute as part of a mapreduce job. The two arguments represent multiple key-value pairs according to the definition of the mapreduce model. They can be any of the following: list, vector, matrix, data frame or NULL (the last one only allowed for keys). Keys are matched to the corresponding values by position, according to the second dimension if it is defined (that is rows in matrices and data frames, position otherwise), analogous to the behavior of cbind, see keyval for details.

reduce

An optional R function of two arguments, a key and a data structure representing all the values associated with that key (the same type as returned by the map call, merged with rbind for matrices and data frames and c otherwise), returning either NULL or the return value of keyval, that specifies the reduce operation to execute as part of a mapreduce job. The default is no reduce phase, that is the output of the map phase is the output of the mapreduce job, see the vectorized.reduce argument for an alternate interface

vectorized.reduce

The argument to the reduce should be construed as a collection of keys and values associated to them by position (by row when 2-dimensional). Identical keys are consecutive and once a key is present once, all the records associated with that key will be passed to the same reduce call (complete group guarantee). This form of reduce has been introduced mostly for efficiency reasons when processing small reduce groups, because the records are small and few of them are associated with the same key. This option affects the combiner too.

combine

A function with the same signature and possible return values as the reduce function, or TRUE, which means use the reduce function as combiner. NULL means no combiner is used.

in.memory.combine

Apply the combiner just after calling the map function, before returning the results to hadoop. This is useful to reduce the amount of I/O and (de)serialization work when combining on small sets of records has any effect (you may want to tune the input format to read more data for each map call together with this approach, see arguments read.size or nrow for a variety of formats)

input.format

Input format specification, see make.input.format

output.format

Output format specification, see make.output.format

backend.parameters

This option is for advanced users only and may be removed in the future. Specify additional, backend-specific options, as in backend.parameters = list(hadoop = list(D = "mapred.reduce.tasks=1"), local = list()). It is recommended not to use this argument to change the semantics of mapreduce (output should be independent of this argument). Each backend can only see the nested list named after the backend itself. The interpretation is the following: for the hadoop backend, generate an additional hadoop streaming command line argument for each element of the list, "-name value". If the value is TRUE generate "-name" only, if it is FALSE skip. One possible use is to specify the number of mappers and reducers on a per-job basis. It is not guaranteed that the generated streaming command will be a legal command. In particular, remember to put any generic options before any specific ones, as per hadoop streaming manual. For the local backend, the list is currently ignored.

verbose

Run hadoop in verbose mode. When FALSE job and, on YARN, application ids are returned as attributes. No effect on the local backend

Details

Defines and executes a mapreduce job. Jobs can be chained together by simply providing the return value of one as input to the other. The map and reduce functions will run in an environment that is a close approximation of the environment of this call, even if the actual execution happens in a different interpreter on a different machine. Changes to the outer environments performed inside the map and reduce functions with the <<- operator will only affect a per-process copy of the environment, not the original one, in a departure from established but seldom used R semantics. This is unlikely to change in the future because of the challenges inherent in adopting reference semantics in a parallel environment. The map function should not read from standard input and write to standard output. Logging and debugging messages should be written to standard error, and will be redirected to the appropriate logs or to console by the backend. If necessary, library functions that can not be prevented from writing into standard output can be surrounded by a pair of sink calls as in sink(stderr()); library.function(); sink(NULL). See also the Tutorial https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial

Value

The value of output, or, when missing, a big.data.object

See Also

to.map and to.reduce can be used to convert other functions into suitable arguments for the map and reduce arguments; see the tests directory in the package for more examples


[Package rmr2 version 3.3.1 Index]