rhive-export {RHive}R Documentation

Export R function to Hive using functions in Package ‘RHive’

Description

Export R function to Hive using functions in Package ‘RHive’

Usage

rhive.export(exportName, pos=-1, limit=100*1024*1024, ALL=FALSE)
rhive.exportAll(exportName, pos=1, limit=100*1024*1024)
rhive.assign(name, value)
rhive.assign.export(name, value)
rhive.rm(name)
rhive.rm.export(name)
rhive.script.export(exportName, mapper=NULL, reducer=NULL, mapArgs=NULL,
  reduceArgs=NULL, bufferSize=-1L)
rhive.script.unexport(exportName)
rhive.export.script(exportName, mapper=NULL, reducer=NULL, mapArgs=NULL,
  reduceArgs=NULL, bufferSize=-1L)
rhive.unexport.script(exportName)
rhive.list.udfs()
rhive.rm.udf(exportName)

Arguments

exportName

function name to be exported.

limit

total exported object size. default is 100MB

ALL

export all objects

name

a variable name, given as a character string.

value

a value to be assigned to 'name'

pos

where to do the assignment.

mapper

R object as map function or Hive query.

reducer

R object as reducer function.

bufferSize

streaming buffer size.

mapArgs

mapper custom environment.

reduceArgs

reducer custom environment.

Details

RHive supports the following additional Hive functions. One is RUDF and its syntax is R(export-R-function-name, arguments, ..., return-type).

Another is RUDAF and its syntax is RA(export-R-function-name, arguments, ...). R function which runs via RUDAF should be made with the following rule. This rule is a function naming rule. An R aggregation function is composed of 4 sub-functions and each sub-function has a naming rule. First sub-function uses user-defined name, which is export-R-function-name. Second is made from combining first sub-function name and '.partial'. Third is made from combining first function name and '.merge'. Final function is made from combining first name and '.terminate'.

UDTF is a built-in table-generating function in Hive. RHive supports two kinds of UDTF, unfold and expand. 'unfold' syntax is unfold(value,col1-v,col2-v,...,delim) as (col1,col2,...). this 'unfold' function allows user to change one column into many columns. 'expand' syntax is expand(value,col-v,delim) as(col). this 'expand' function allows user to change one column into many rows.

Author(s)

rhive@nexr.com

Examples

## try to connect hive server
## Not run: rhive.connect("127.0.0.1")

## execute HQL(hive query)
## Not run: rhive.query("select * from emp")


## define R function
## Not run: coff <- 5.2
## Not run: scoring <- function(sal) {
    coff * sal
}
## End(Not run)

## assign R object to Hive
## Not run: rhive.assign('scoring', scoring)
## Not run: rhive.assign('coff', coff)

## export R objects (scoring and coff) to Hive 
## Not run: rhive.exportAll('scoring')

## execute HQL using exported R objects
## name of UDF is 'R'
## Not run: rhive.query("select R('scoring',sal,0.0) from emp")

## delete R object in .rhiveExportEnv
## Not run: rhive.rm('scoring')
## Not run: rhive.rm('coff')

## define R aggregation function
## define iterate operator
## Not run: hsum <- function(prev, sal) {
    if(is.null(prev))
        sal
    else
        prev + sal
}
## End(Not run)
## define partial aggregation operator
## Not run: hsum.partial <- function(agg_sal) {
	agg_sal
}
## End(Not run)
## define merge operator
## Not run: hsum.merge <- function(prev, agg_sal) {
    if(is.null(prev))
        agg_sal
    else
        prev + agg_sal
}
## End(Not run)
## define final aggregation operator
## Not run: hsum.terminate <- function(agg_sal) {
    agg_sal
}
## End(Not run)

## Not run: rhive.assign('hsum', hsum)
## Not run: rhive.assign('hsum.partial', hsum.partial)
## Not run: rhive.assign('hsum.merge', hsum.merge)
## Not run: rhive.assign('hsum.terminate', hsum.terminate)
## Not run: rhive.exportAll('hsum')

## name of UDAF is 'RA'
## Not run: rhive.query("select RA('hsum',sal) from emp group by empno")


## export/unexport user define map/reduce script
## Not run: 
map <- function(k, v) {
    if(is.null(v)) {
        put(NA,1)
    }
    lapply(v, function(vv) {
        lapply(strsplit(x=vv, split = "\t")[[1]],
            function(w) put(paste(args, w, sep = ""), 1))
    })
}

reduce <- function(k,vv) {
    put(k, sum(as.numeric(vv)))
}

mrscript <- rhive.script.export("scripttest", map, reduce)


rhive.query(paste("from (from emp MAP ename,position USING '", mrscript[1],
    "' as position, one cluster by position) map_output REDUCE map_output.aa,
map_output.bb USING '",
    mrscript[2], "' as position, count", sep = ""))


## End(Not run)

## close connection
## Not run: rhive.close()

[Package RHive version 2.0-0.10 Index]