createDataPartition {caret} | R Documentation |
A series of test/training partitions are created using
createDataPartition
while createResample
creates one or
more bootstrap samples. createFolds
splits the data into
k
groups while createTimeSlices
creates cross-validation
sample information to be used with time series data.
createDataPartition(y, times = 1, p = 0.5, list = TRUE, groups = min(5, length(y))) createResample(y, times = 10, list = TRUE) createFolds(y, k = 10, list = TRUE, returnTrain = FALSE) createMultiFolds(y, k = 10, times = 5) createTimeSlices(y, initialWindow, horizon = 1, fixedWindow = TRUE, skip = 0)
y |
a vector of outcomes. For |
times |
the number of partitions to create |
p |
the percentage of data that goes to training |
list |
logical - should the results be in a list ( |
groups |
for numeric |
k |
an integer for the number of folds. |
returnTrain |
a logical. When true, the values returned are the
sample positions corresponding to the data used during
training. This argument only works in conjunction with |
initialWindow |
The initial number of consecutive values in each training set sample |
horizon |
The number of consecutive values in test set sample |
fixedWindow |
A logical: if |
skip |
An integer specifying how many (if any) resamples to skip to thin the total amount. |
For bootstrap samples, simple random sampling is used.
For other data splitting, the random sampling is done within the
levels of y
when y
is a factor in an attempt to balance
the class distributions within the splits.
For numeric y
, the sample is split into groups sections based
on percentiles and sampling is done within these subgroups. For
createDataPartition
, the number of percentiles is set via the
groups
argument. For createFolds
and createMultiFolds
,
the number of groups is set dynamically based on the sample size and k
.
For smaller samples sizes, these two functions may not do stratified
splitting and, at most, will split the data into quartiles.
Also, for createDataPartition
, very small class sizes (<= 3) the
classes may not show up in both the training and test data
For multiple k-fold cross-validation, completely independent folds are created.
The names of the list objects will denote the fold membership using the pattern
"Foldi.Repj" meaning the ith section (of k) of the jth cross-validation set
(of times
). Note that this function calls createFolds
with
list = TRUE
and returnTrain = TRUE
.
Hyndman and Athanasopoulos (2013)) discuss rolling forecasting origin< techniques that move the training and test sets in time. createTimeSlices
can create the indices for this type of splitting.
A list or matrix of row position integers corresponding to the training data
Max Kuhn, createTimeSlices
by Tony Cooper
http://topepo.github.io/caret/splitting.html
Hyndman and Athanasopoulos (2013), Forecasting: principles and practice. https://www.otexts.org/fpp
data(oil) createDataPartition(oilType, 2) x <- rgamma(50, 3, .5) inA <- createDataPartition(x, list = FALSE) plot(density(x[inA])) rug(x[inA]) points(density(x[-inA]), type = "l", col = 4) rug(x[-inA], col = 4) createResample(oilType, 2) createFolds(oilType, 10) createFolds(oilType, 5, FALSE) createFolds(rnorm(21)) createTimeSlices(1:9, 5, 1, fixedWindow = FALSE) createTimeSlices(1:9, 5, 1, fixedWindow = TRUE) createTimeSlices(1:9, 5, 3, fixedWindow = TRUE) createTimeSlices(1:9, 5, 3, fixedWindow = FALSE) createTimeSlices(1:15, 5, 3) createTimeSlices(1:15, 5, 3, skip = 2) createTimeSlices(1:15, 5, 3, skip = 3)