hadoop.settings {rmr2} | R Documentation |
There are a few hadoop settings that one should be aware of and know how to modify to allow the successful execution of mapreduce programs
Since the transition to YARN and MR2, each Mapreduce job needs to secure a container in order to execute. A container is a resource allocation unit and the resource we are concerned with here is memory. At default settings, at least in a non-scientific sampling of deployments, the memory available to a container is used almost entirely by the map and reduce java processes. In rmr2 this is not compatible with the java process successfully executing an instance of the R interpreter, which is necessary for rmr2. Therefore, by default, rmr2 modifies per-job settings to set the java process to use 400MB of memory and leave the rest for use by R. This is assuming that the default container size is larger than 400MB and that R can work successfully in the remaining space. Under certain conditions, it is also possible that 400MB won't be enough for the java process. To solve these problems, the user has access to a number of properties that can be set using configuration files on an per-job basis directly in rmr2 (see rmr.options
, argument backend.parameters
). Four important properties are mapreduce.map.java.opts
, mapreduce.reduce.java.opts
, mapreduce.map.memory.mb
and mapreduce.reduce.memory.mb
The first two are set by rmr2
to -Xmx400M
, which sets the memory allocated to the map or reduce java task. The other two properties control the size of the container for, resp., the map and reduce phase and rmr2 leaves them at default values, unless the user decides otherwise. There are many other properties the control the execution environment of mapreduce jobs but they are out of scope for this help entry (you are referred to the documentation accompanying your Hadoop distribution). These four, in the experience of the RHadoop team are the ones one needs to acton upon most often.