data.table-package {data.table} | R Documentation |
Enhanced data.frame
Description
data.table
inherits from data.frame
. It offers fast subset, fast grouping, fast update, fast ordered joins and list columns in a short and flexible syntax, for faster development. It is inspired by A[B]
syntax in R where A
is a matrix and B
is a 2-column matrix. Since a data.table
is a data.frame
, it is compatible with R functions and packages that only accept data.frame
.
The 10 minute quick start guide to data.table
may be a good place to start: vignette("datatable-intro")
. Or, the first section of FAQs is intended to be read from start to finish and is considered core documentation: vignette("datatable-faq")
. If you have read and searched these documents and the help page below, please feel free to ask questions on datatable-help or the Stack Overflow data.table tag. To report a bug please type: bug.report(package="data.table")
.
Please check the homepage for up to the minute news.
Tip: one of the quickest ways to learn the features is to type example(data.table)
and study the output at the prompt.
Usage
data.table(..., keep.rownames=FALSE, check.names=FALSE, key=NULL)
## S3 method for class 'data.table'
x[i, j, by, keyby, with = TRUE,
nomatch = getOption("datatable.nomatch"), # default: NA_integer_
mult = "all",
roll = FALSE,
rollends = if (roll=="nearest") c(TRUE,TRUE)
else if (roll>=0) c(FALSE,TRUE)
else c(TRUE,FALSE),
which = FALSE,
.SDcols,
verbose = getOption("datatable.verbose"), # default: FALSE
allow.cartesian = getOption("datatable.allow.cartesian"), # default: FALSE
drop = NULL,
on = NULL # join without setting keys, new feature from v1.9.6+
]
Arguments
... |
Just as ... in data.frame . Usual recycling rules are applied to vectors of different lengths to create a list of equal length vectors.
|
keep.rownames |
If ... is a matrix or data.frame , TRUE will retain the rownames of that object in a column named rn .
|
check.names |
Just as check.names in data.frame .
|
key |
Character vector of one or more column names which is passed to setkey . It may be a single comma separated string such as key="x,y,z" , or a vector of names such as key=c("x","y","z") .
|
x |
A data.table .
|
i |
Integer, logical or character vector, single column numeric matrix , expression of column names, list or data.table .
integer and logical vectors work the same way they do in [.data.frame . Other than NA s in logical i are treated as FALSE and a single NA logical is not recycled to match the number of rows, as it is in [.data.frame .
character is matched to the first column of x 's key.
expression is evaluated within the frame of the data.table (i.e. it sees column names as if they are variables) and can evaluate to any of the other types.
When i is a data.table , x must have a key. i is joined to x using x 's key and the rows in x that match are returned. An equi-join is performed between each column in i to each column in x 's key; i.e., column 1 of i is matched to the 1st column of x 's key, column 2 to the second, etc. The match is a binary search in compiled C in O(log n) time. If i has fewer columns than x 's key then not all of x's key columns will be joined to (a common use case) and many rows of x will (ordinarily) match to each row of i . If i has more columns than x 's key, the columns of i not involved in the join are included in the result. If i also has a key, it is i 's key columns that are used to match to x 's key columns (column 1 of i 's key is joined to column 1 of x 's key, column 2 of i 's key to column 2 of x 's key, and so on for as long as the shorter key) and a binary merge of the two tables is carried out. In all joins the names of the columns are irrelevant; the columns of x 's key are joined to in order, either from column 1 onwards of i when i is unkeyed, or from column 1 onwards of i 's key. In code, the number of join columns is determined by min(length(key(x)),if (haskey(i)) length(key(i)) else ncol(i)) .
All types of 'i' may be prefixed with ! . This signals a not-join or not-select should be performed. Throughout data.table documentation, where we refer to the type of 'i', we mean the type of 'i' after the '!', if present. See examples.
Advanced: When i is an expression of column names that evaluates to data.table or list , a join is performed. We call this a self join.
Advanced: When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope.
|
j |
A single column name, single expresson of column names, list() of expressions of column names, an expression or function call that evaluates to list (including data.frame and data.table which are list s, too), or (when with=FALSE ) a vector of names or positions to select.
j is evaluated within the frame of the data.table ; i.e., it sees column names as if they are variables. Use j=list(...) to return multiple columns and/or expressions of columns. A single column or single expression returns that type, usually a vector. See the examples.
|
by |
A single unquoted column name, a list() of expressions of column names, a single character string containing comma separated column names (where spaces are significant since column names may contain spaces even at the start or end), or a character vector of column names.
The list() of expressions is evaluated within the frame of the data.table (i.e. it sees column names as if they are variables). The data.table is then grouped by the by and j is evaluated within each group. The order of the rows within each group is preserved, as is the order of the groups. j=list(...) may be omitted when there is just one expression, for convenience, typically a single expression such as sum(colB) ; e.g., DT[,sum(colB),by=colA] .
When by contains the first n columns of x 's key, we call this a keyed by. In a keyed by the groups appear contiguously in RAM and memory is copied in bulk internally, for extra speed. Otherwise, we call it an ad hoc by. Ad hoc by is still many times faster than tapply , for example, but just not as fast as keyed by when datasets are very large, in particular when the size of each group is large. Not to be confused with keyby= defined below.
Advanced: When i is a data.table , DT[i,j,by=.EACHI] evaluates j for the groups in 'DT' that each row in i joins to. That is, you can join (in i ) and aggregate (in j ) simultaneously. We call this grouping by each i. It is particularly memory efficient as you don't have to materialise the join result only to aggregate later. Please refer to this Stackoverflow answer for a more detailed explanation until we roll out vignettes.
Advanced: When grouping, symbols .SD , .BY , .N , .I and .GRP may be used in the j expression, defined as follows.
.SD is a data.table containing the Subset of x 's Data for each group, excluding any columns used in by (or keyby ).
.BY is a list containing a length 1 vector for each item in by . This can be useful when by is not known in advance. The by variables are also available to j directly by name; useful for example for titles of graphs if j is a plot command, or to branch with if() depending on the value of a group variable.
.N is an integer, length 1, containing the number of rows in the group. This may be useful when the column names are not known in advance and for convenience generally. When grouping by i , .N is the number of rows in x matched to, for each row of i , regardless of whether nomatch is NA or 0 . It is renamed to N (no dot) in the result (otherwise a column called ".N" could conflict with the .N variable, see FAQ 4.6 for more details and example), unless it is explicity named; e.g., DT[,list(total=.N),by=a] .
.I is an integer vector equal to seq_len(nrow(x)) . While grouping, it holds for each item in the group, it's row location in x . This is useful to subset in j ; e.g. DT[, .I[which.max(somecol)], by=grp] .
.GRP is an integer, length 1, containing a simple group counter. 1 for the 1st group, 2 for the 2nd, etc.
.SD , .BY , .N , .I and .GRP are read only. Their bindings are locked and attempting to assign to them will generate an error. If you wish to manipulate .SD before returning it, take a copy(.SD) first (see FAQ 4.5). Using := in the j of .SD is reserved for future use as a (tortuously) flexible way to update DT by reference by group (even when groups are not contiguous in an ad hoc by).
Advanced: In the X[Y,j] form of grouping, the j expression sees variables in X first, then Y . We call this join inherited scope. If the variable is not in X or Y then the calling frame is searched, its calling frame, and so on in the usual way up to and including the global environment.
|
keyby |
An ad-hoc-by or keyed-by (just as by= defined above) but with an additional setkey() run on the by columns of the result afterwards, for convenience. It is common practice to use 'keyby=' routinely when you wish the result to be sorted. Out loud we read keyby= as by= then setkey. Otherwise, 'by=' can be relied on to return the groups in the order they appear in the data.
|
with |
By default with=TRUE and j is evaluated within the frame of x ; column names can be used as variables. When with=FALSE j is a character vector of column names or a numeric vector of column positions to select, and the value returned is always a data.table . with=FALSE is often useful in data.table to select columns dynamically.
|
nomatch |
Same as nomatch in match . When a row in i has no match to x 's key, nomatch=NA (default) means NA is returned for x 's non-join columns for that row of i . 0 means no rows will be returned for that row of i . The default value (used when nomatch is not supplied) can be changed from NA to 0 using options(datatable.nomatch=0) .
|
mult |
When multiple rows in x match to the row in i , mult controls which are returned: "all" (default), "first" or "last" .
|
roll |
Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i 's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF). Usually, there should be no duplicates in x 's key, the last key column is a date (or time, or datetime) and all the columns of x 's key are joined to. A common idiom is to select a contemporaneous regular time series (dts ) across a set of identifiers (ids ): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date) and CJ stands for cross join.
When roll is a positive number, this limits how far values are carried forward. roll=TRUE is equivalent to roll=+Inf .
When roll is a negative number, values are rolled backwards; i.e., next observation carried backwards (NOCB). Use -Inf for unlimited roll back.
When roll is "nearest" , the nearest value is joined to.
|
rollends |
A logical vector length 2 (a single logical is recycled). When rolling forward (e.g. roll=TRUE ) if a value is past the last observation within each group defined by the join columns, rollends[2]=TRUE will roll the last value forwards. rollends[1]=TRUE will roll the first value backwards if the value is before it. If rollends=FALSE the value of i must fall in a gap in x but not after the end or before the beginning of the data, for that group defined by all but the last join column. When roll is a finite number, that limit is also applied when rolling the ends.
|
which |
TRUE returns the row numbers of x that i matches to. NA returns the row numbers of i that have no match in x . By default FALSE and the rows in x that match are returned.
|
.SDcols |
Advanced. Specifies the columns of x included in .SD . May be character column names or numeric positions. This is useful for speed when applying a function through a subset of (possible very many) columns; e.g., DT[,lapply(.SD,sum),by="x,y",.SDcols=301:350] .
|
verbose |
TRUE turns on status and information messages to the console. Turn this on by default using options(datatable.verbose=TRUE) . The quantity and types of verbosity may be expanded in future.
|
allow.cartesian |
FALSE prevents joins that would result in more than nrow(x)+nrow(i) rows. This is usually caused by duplicate values in i 's join columns, each of which join to the same group in 'x' over and over again: a misspecified join. Usually this was not intended and the join needs to be changed. The word 'cartesian' is used loosely in this context. The traditional cartesian join is (deliberately) difficult to achieve in data.table : where every row in i joins to every row in x (a nrow(x)*nrow(i) row result). 'cartesian' is just meant in a 'large multiplicative' sense.
|
drop |
Never used by data.table . Do not use. It needs to be here because data.table inherits from data.frame . See vignette("datatable-faq") .
|
on |
A named atomic vector of column names indicating which columns in i should be joined to which columns in x . See Examples .
|
Details
data.table
builds on base R functionality to reduce 2 types of time :
programming time (easier to write, read, debug and maintain)
compute time
It combines database like operations such as subset
, with
and by
and provides similar joins that merge
provides but faster. This is achieved by using R's column based ordered in-memory data.frame
structure, eval
within the environment of a list
, the [.data.table
mechanism to condense the features, and compiled C to make certain operations fast.
The package can be used just for rapid programming (compact syntax). Largest compute time benefits are on 64bit platforms with plentiful RAM, or when smaller datasets are repeatedly queried within a loop, or when other methods use so much working memory that they fail with an out of memory error.
As with [.data.frame
, compound queries can be concatenated on one line; e.g.,
DT[,sum(v),by=colA][V1<300][tail(order(V1))]
# sum(v) by colA then return the 6 largest which are under 300
The j
expression does not have to return data; e.g.,
DT[,plot(colB,colC),by=colA]
# produce a set of plots (likely to pdf) returning no data
Multiple data.table
s (e.g. X
, Y
and Z
) can be joined in many ways; e.g.,
X[Y][Z]
X[Z][Y]
X[Y[Z]]
X[Z[Y]]
A data.table
is a list
of vectors, just like a data.frame
. However :
it never has rownames. Instead it may have one key of one or more columns. This key can be used for row indexing instead of rownames.
it has enhanced functionality in [.data.table
for fast joins of keyed tables, fast aggregation, fast last observation carried forward (LOCF) and fast add/modify/delete of columns by reference with no copy at all.
Since a list
is a vector
, data.table
columns may be type list
. Columns of type list
can contain mixed types. Each item in a column of type list
may be different lengths. This is true of data.frame
, too.
Several methods are provided for data.table
, including is.na
, na.omit
,
t
, rbind
, cbind
, merge
and others.
Note
If keep.rownames
or check.names
are supplied they must be written in full because R does not allow partial argument names after '...
'. For example, data.table(DF,keep=TRUE)
will create a
column called "keep"
containing TRUE
and this is correct behaviour; data.table(DF,keep.rownames=TRUE)
was intended.
POSIXlt is not supported as a column type because it uses 40 bytes to store a single datetime. Unexpected errors may occur if you manage to create a column of type POSIXlt. Please see NEWS for 1.6.3, and IDateTime
instead. IDateTime has methods to convert to and from POSIXlt.
References
data.table
homepage: http://datatable.r-forge.r-project.org/
User reviews: http://crantastic.org/packages/data-table
http://en.wikipedia.org/wiki/Binary_search
http://en.wikipedia.org/wiki/Radix_sort
See Also
data.frame
, [.data.frame
, setkey
, J
, SJ
, CJ
, merge.data.table
, tables
, test.data.table
, IDateTime
, unique.data.table
, copy
, :=
, alloc.col
, truelength
, rbindlist
, setNumericRounding
Examples
## Not run:
example(data.table) # to run these examples at the prompt
## End(Not run)
DF = data.frame(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
DF
DT
identical(dim(DT),dim(DF)) # TRUE
identical(DF$a, DT$a) # TRUE
is.list(DF) # TRUE
is.list(DT) # TRUE
is.data.frame(DT) # TRUE
tables()
DT[2] # 2nd row
DT[,v] # v column (as vector)
DT[,list(v)] # v column (as data.table)
DT[2:3,sum(v)] # sum(v) over rows 2 and 3
DT[2:5,cat(v,"\n")] # just for j's side effect
DT[c(FALSE,TRUE)] # even rows (usual recycling)
DT[,2,with=FALSE] # 2nd column
colNum = 2
DT[,colNum,with=FALSE] # same
setkey(DT,x) # set a 1-column key. No quotes, for convenience.
setkeyv(DT,"x") # same (v in setkeyv stands for vector)
v="x"
setkeyv(DT,v) # same
# key(DT)<-"x" # copies whole table, please use set* functions instead
DT["a"] # binary search (fast)
DT[x=="a"] # same; i.e. binary search (fast)
DT[,sum(v),by=x] # keyed by
DT[,sum(v),by=key(DT)] # same
DT[,sum(v),by=y] # ad hoc by
DT["a",sum(v)] # j for one group
DT[c("a","b"),sum(v),by=.EACHI] # j for two groups
X = data.table(c("b","c"),foo=c(4,2))
X
DT[X] # join
DT[X,sum(v),by=.EACHI] # join and eval j for each row in i
DT[X,mult="first"] # first row of each group
DT[X,mult="last"] # last row of each group
DT[X,sum(v)*foo,by=.EACHI] # join inherited scope
setkey(DT,x,y) # 2-column key
setkeyv(DT,c("x","y")) # same
DT["a"] # join to 1st column of key
DT[.("a")] # same, .() is an alias for list()
DT[list("a")] # same
DT[.("a",3)] # join to 2 columns
DT[.("a",3:6)] # join 4 rows (2 missing)
DT[.("a",3:6),nomatch=0] # remove missing
DT[.("a",3:6),roll=TRUE] # rolling join (locf)
DT[,sum(v),by=.(y%%2)] # by expression
DT[,.SD[2],by=x] # 2nd row of each group
DT[,tail(.SD,2),by=x] # last 2 rows of each group
DT[,lapply(.SD,sum),by=x] # apply through columns by group
DT[,list(MySum=sum(v),
MyMin=min(v),
MyMax=max(v)),
by=.(x,y%%2)] # by 2 expressions
DT[,sum(v),x][V1<20] # compound query
DT[,sum(v),x][order(-V1)] # ordering results
print(DT[,z:=42L]) # add new column by reference
print(DT[,z:=NULL]) # remove column by reference
print(DT["a",v:=42L]) # subassign to existing v column by reference
print(DT["b",v2:=84L]) # subassign to new column by reference (NA padded)
DT[,m:=mean(v),by=x][] # add new column by reference by group
# NB: postfix [] is shortcut to print()
DT[,.SD[which.min(v)],by=x][] # nested query by group
DT[!.("a")] # not join
DT[!"a"] # same
DT[!2:4] # all rows other than 2:4
DT[x!="b" | y!=3] # not yet optimized, currently vector scans
DT[!.("b",3)] # same result but much faster
# new feature: 'on' argument, from v1.9.6+
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), mul=6:8)
DT1[DT2, on=c(x="x")] # join on columns 'x'
DT1[DT2, on="x"] # same as above
DT1[DT2, .(sum(a) * mul), by=.EACHI, on="x"] # using by=.EACHI
DT1[DT2, .(sum(a) * mul), by=.EACHI, on="x", nomatch=0L] # using by=.EACHI
# Follow r-help posting guide, support is here (*not* r-help) :
# http://stackoverflow.com/questions/tagged/data.table
# or
# datatable-help@lists.r-forge.r-project.org
## Not run:
vignette("datatable-intro")
vignette("datatable-faq")
test.data.table() # over 1,300 low level tests
update.packages() # keep up to date
## End(Not run)
[Package
data.table version 1.9.6
Index]