rhive.basic {RHive} | R Documentation |
R Distributed basic statistic function using Hive
rhive.basic.mode(tableName, col, forcedRef=TRUE) rhive.basic.range(tableName, col) rhive.basic.merge(x, y, by.x, by.y, forcedRef=TRUE) rhive.basic.xtabs(formula, tableName) rhive.basic.cut(tableName, col, breaks, right=TRUE, summary=FALSE, forcedRef=TRUE) rhive.basic.cut2(tableName, col1, col2, breaks1, breaks2, right=TRUE, keepCol=FALSE, forcedRef=TRUE) rhive.basic.by(tableName, INDICES, fun, arguments, forcedRef=TRUE) rhive.basic.scale(tableName, col) rhive.basic.t.test(x,col1,y,col2) rhive.block.sample(tableName, percent=0.01, seed=0, subset)
tableName |
hive table name. |
x, y |
table-names to be coerced to one or an object which can be coerced. |
by.x, by.y |
specifications of the common columns. |
col |
column name |
col1 |
column name |
col2 |
column name |
formula |
a formula object with the cross-classifying variables (separated by '+') on the right hand side (or an object which can be coerced to a formula). |
breaks |
a numeric vector of two or more cut points. a format is 'min:max:step' and 'step' is optional. or either a numeric vector of two or more cut points or a single number (greater than or equal to 2) giving the number of intervals into which 'x' is to be cut. |
breaks1 |
a breaks of col1 |
breaks2 |
a breaks of col2 |
summary |
a option whether summarize the result of cut or not. |
INDICES |
a list of column to be grouped. |
fun |
a hive function name to be applied. |
arguments |
input data for a function. for examples, arguments = c("sal", "deptno", 3.2, "'NexR'") |
right |
logical, indicating if the intervals should be closed on the right (and open on the left) or vice versa. |
keepCol |
an option which keeps original columns |
forcedRef |
the option which forces to create temp-table for result. |
percent |
percent of data size which is picked up. |
seed |
first selected block index. |
subset |
an optional record-set specifying a subset of observations to be used. |
## try to connect hive server ## Not run: rhive.connect("hive-server-ip") ## find the most frequency data of specified column ## Not run: rhive.basic.mode('emp','deptno') ## calculate min,max of specified column ## Not run: rhive.basic.range('emp','sal') ## merge two tables using shared column ## Not run: rhive.basic.merge('emp','dept', by.x = 'deptno', by.y = 'id') DF <- as.data.frame(UCBAdmissions) ## Not run: rhive.write.table(DF) ## Nice for taking margins ... ## Not run: rhive.basic.xtabs('freq', c('gender', 'admit'), 'df') ## divides the range of a column into intervals ## Not run: rhive.basic.cut('emp', 'sal', breaks='0:5000:100') ## divides the range of a column into intervals ## Not run: rhive.basic.cut2('emp', 'dept', 'sal', 'loc', breaks1='0:5000:100', breaks2='0:100:10') ## End(Not run) ## extract the summation of salary by group ## Not run: rhive.basic.by('emp', 'deptno', 'sum', c("sal")) ## centers and/or scales the columns of table ## Not run: rhive.basic.scale('emp', 'sal') ## analyze two dataset ## Not run: rhive.basic.t.test(emp$sal, emp$age) ## sampling ## Not run: rhive.basic.sample("emp", subset="id < 100") ## close connection ## Not run: rhive.close()