mapreduce {rmr2} | R Documentation |
Defines and executes a map reduce job.
mapreduce( input, output = NULL, map = to.map(identity), reduce = NULL, vectorized.reduce = FALSE, combine = NULL, in.memory.combine = FALSE, input.format = "native", output.format = "native", backend.parameters = list(), verbose = TRUE)
input |
Paths to the input folder(s) (on HDFS) or vector thereof
or or the return value of another |
output |
A path to the destination folder (on HDFS); if missing, a |
map |
An optional R function of two arguments, returning either NULL or the return value of |
reduce |
An optional R function of two arguments, a key and a data structure representing all the values associated with that key (the same type as returned by the map call, merged with |
vectorized.reduce |
The argument to the reduce should be construed as a collection of keys and values associated to them by position (by row when 2-dimensional). Identical keys are consecutive and once a key is present once, all the records associated with that key will be passed to the same reduce call (complete group guarantee). This form of reduce has been introduced mostly for efficiency reasons when processing small reduce groups, because the records are small and few of them are associated with the same key. This option affects the combiner too. |
combine |
A function with the same signature and possible return values as the reduce function, or TRUE, which means use the reduce function as combiner. NULL means no combiner is used. |
in.memory.combine |
Apply the combiner just after calling the map function, before returning the results to hadoop. This is useful to reduce the amount of I/O and (de)serialization work when combining on small sets of records has any effect (you may want to tune the input format to read more data for each map call together with this approach, see arguments |
input.format |
Input format specification, see |
output.format |
Output format specification, see |
backend.parameters |
This option is for advanced users only and may be removed in the future. Specify additional, backend-specific
options, as in |
verbose |
Run hadoop in verbose mode. When |
Defines and executes a mapreduce job. Jobs can be chained together by simply providing the return value of one as input to the
other. The map and reduce functions will run in an environment that is a close approximation of the environment of this
call, even if the actual execution happens in a different interpreter on a different machine. Changes to the outer
environments performed inside the map and reduce functions with the <<-
operator will only affect a per-process copy of the
environment, not the original one, in a departure from established but seldom used R semantics. This is unlikely to change in the future
because of the challenges inherent in adopting reference semantics in a parallel environment. The map function should not read from standard input and write to standard output. Logging and debugging messages should be written to standard error, and will be redirected to the appropriate logs or to console by the backend. If necessary, library functions that can not be prevented from writing into standard output can be surrounded by a pair of sink
calls as in sink(stderr()); library.function(); sink(NULL)
. See also the Tutorial
https://github.com/RevolutionAnalytics/RHadoop/wiki/Tutorial
The value of output
, or, when missing, a big.data.object
to.map
and to.reduce
can be used to convert other functions into suitable arguments for the map and
reduce arguments; see the tests directory in the package for more examples