rhive-export {RHive} | R Documentation |
Export R function to Hive using functions in Package ‘RHive’
rhive.export(exportName, pos=-1, limit=100*1024*1024, ALL=FALSE) rhive.exportAll(exportName, pos=1, limit=100*1024*1024) rhive.assign(name, value) rhive.assign.export(name, value) rhive.rm(name) rhive.rm.export(name) rhive.script.export(exportName, mapper=NULL, reducer=NULL, mapArgs=NULL, reduceArgs=NULL, bufferSize=-1L) rhive.script.unexport(exportName) rhive.export.script(exportName, mapper=NULL, reducer=NULL, mapArgs=NULL, reduceArgs=NULL, bufferSize=-1L) rhive.unexport.script(exportName) rhive.list.udfs() rhive.rm.udf(exportName)
exportName |
function name to be exported. |
limit |
total exported object size. default is 100MB |
ALL |
export all objects |
name |
a variable name, given as a character string. |
value |
a value to be assigned to 'name' |
pos |
where to do the assignment. |
mapper |
R object as map function or Hive query. |
reducer |
R object as reducer function. |
bufferSize |
streaming buffer size. |
mapArgs |
mapper custom environment. |
reduceArgs |
reducer custom environment. |
RHive supports the following additional Hive functions. One is RUDF and
its syntax is R(export-R-function-name, arguments, ..., return-type)
.
Another is RUDAF and its syntax is RA(export-R-function-name, arguments, ...)
.
R function which runs via RUDAF should be made with the following rule.
This rule is a function naming rule. An R aggregation function is composed of
4 sub-functions and each sub-function has a naming rule.
First sub-function uses user-defined name, which is export-R-function-name.
Second is made from combining first sub-function name and '.partial'.
Third is made from combining first function name and '.merge'.
Final function is made from combining first name and '.terminate'.
UDTF is a built-in table-generating function in Hive.
RHive supports two kinds of UDTF, unfold and expand.
'unfold' syntax is unfold(value,col1-v,col2-v,...,delim) as (col1,col2,...)
.
this 'unfold' function allows user to change one column into many columns.
'expand' syntax is expand(value,col-v,delim) as(col)
.
this 'expand' function allows user to change one column into many rows.
## try to connect hive server ## Not run: rhive.connect("127.0.0.1") ## execute HQL(hive query) ## Not run: rhive.query("select * from emp") ## define R function ## Not run: coff <- 5.2 ## Not run: scoring <- function(sal) { coff * sal } ## End(Not run) ## assign R object to Hive ## Not run: rhive.assign('scoring', scoring) ## Not run: rhive.assign('coff', coff) ## export R objects (scoring and coff) to Hive ## Not run: rhive.exportAll('scoring') ## execute HQL using exported R objects ## name of UDF is 'R' ## Not run: rhive.query("select R('scoring',sal,0.0) from emp") ## delete R object in .rhiveExportEnv ## Not run: rhive.rm('scoring') ## Not run: rhive.rm('coff') ## define R aggregation function ## define iterate operator ## Not run: hsum <- function(prev, sal) { if(is.null(prev)) sal else prev + sal } ## End(Not run) ## define partial aggregation operator ## Not run: hsum.partial <- function(agg_sal) { agg_sal } ## End(Not run) ## define merge operator ## Not run: hsum.merge <- function(prev, agg_sal) { if(is.null(prev)) agg_sal else prev + agg_sal } ## End(Not run) ## define final aggregation operator ## Not run: hsum.terminate <- function(agg_sal) { agg_sal } ## End(Not run) ## Not run: rhive.assign('hsum', hsum) ## Not run: rhive.assign('hsum.partial', hsum.partial) ## Not run: rhive.assign('hsum.merge', hsum.merge) ## Not run: rhive.assign('hsum.terminate', hsum.terminate) ## Not run: rhive.exportAll('hsum') ## name of UDAF is 'RA' ## Not run: rhive.query("select RA('hsum',sal) from emp group by empno") ## export/unexport user define map/reduce script ## Not run: map <- function(k, v) { if(is.null(v)) { put(NA,1) } lapply(v, function(vv) { lapply(strsplit(x=vv, split = "\t")[[1]], function(w) put(paste(args, w, sep = ""), 1)) }) } reduce <- function(k,vv) { put(k, sum(as.numeric(vv))) } mrscript <- rhive.script.export("scripttest", map, reduce) rhive.query(paste("from (from emp MAP ename,position USING '", mrscript[1], "' as position, one cluster by position) map_output REDUCE map_output.aa, map_output.bb USING '", mrscript[2], "' as position, count", sep = "")) ## End(Not run) ## close connection ## Not run: rhive.close()