step_classdist {recipes} | R Documentation |
step_classdist
creates a a specification of a recipe step
that will convert numeric data into Mahalanobis distance measurements to
the data centroid. This is done for each value of a categorical class
variable.
step_classdist(recipe, ..., class, role = "predictor", trained = FALSE, mean_func = mean, cov_func = cov, pool = FALSE, log = TRUE, objects = NULL)
recipe |
A recipe object. The step will be added to the sequence of operations for this recipe. |
... |
One or more selector functions to choose which variables are
affected by the step. See |
class |
A single character string that specifies a single categorical variable to be used as the class. |
role |
For model terms created by this step, what analysis role should they be assigned?. By default, the function assumes that resulting distances will be used as predictors in a model. |
trained |
A logical to indicate if the quantities for preprocessing have been estimated. |
mean_func |
A function to compute the center of the distribution. |
cov_func |
A function that computes the covariance matrix |
pool |
A logical: should the covariance matrix be computed by pooling the data for all of the classes? |
log |
A logical: should the distances be transformed by the natural log function? |
objects |
Statistics are stored here once this step has been trained
by |
step_classdist
will create a
The function will create a new column for every unique value of the
class
variable. The resulting variables will not replace the
original values and have the prefix classdist_
.
Note that, by default, the default covariance function requires that each
class should have at least as many rows as variables listed in the
terms
argument. If pool = TRUE
, there must be at least as
many data points are variables overall.
An updated version of recipe
with the
new step added to the sequence of existing steps (if any).
# in case of missing data... mean2 <- function(x) mean(x, na.rm = TRUE) rec <- recipe(Species ~ ., data = iris) %>% step_classdist(all_predictors(), class = "Species", pool = FALSE, mean_func = mean2) rec_dists <- prep(rec, training = iris) dists_to_species <- bake(rec_dists, newdata = iris, everything()) ## on log scale: dist_cols <- grep("classdist", names(dists_to_species), value = TRUE) dists_to_species[, c("Species", dist_cols)]