postResample {caret} | R Documentation |
Given two numeric vectors of data, the mean squared error and R-squared are calculated. For two factors, the overall agreement rate and Kappa are determined.
postResample(pred, obs) defaultSummary(data, lev = NULL, model = NULL) twoClassSummary(data, lev = NULL, model = NULL) mnLogLoss(data, lev = NULL, model = NULL) multiClassSummary(data, lev = NULL, model = NULL) R2(pred, obs, formula = "corr", na.rm = FALSE) RMSE(pred, obs, na.rm = FALSE) getTrainPerf(x)
pred |
A vector of numeric data (could be a factor) |
obs |
A vector of numeric data (could be a factor) |
data |
a data frame or matrix with columns |
lev |
a character vector of factors levels for the response. In regression cases, this would be |
model |
a character string for the model name (as taken form the |
formula |
which R^2 formula should be used? Either "corr" or "traditional". See Kvalseth (1985) for a summary of the different equations. |
na.rm |
a logical value indicating whether |
x |
an object of class |
.
postResample
is meant to be used with apply
across a matrix. For numeric data
the code checks to see if the standard deviation of either vector is zero. If so, the correlation
between those samples is assigned a value of zero. NA
values are ignored everywhere.
Note that many models have more predictors (or parameters) than data points, so the typical mean squared
error denominator (n - p) does not apply. Root mean squared error is calculated using sqrt(mean((pred - obs)^2
.
Also, R^2 is calculated wither using as the square of the correlation between the observed and predicted outcomes when form = "corr"
. when form = "traditional"
,
R^2 = 1-\frac{∑ (y_i - \hat{y}_i)^2}{∑ (y_i - \bar{y}_i)^2}
For defaultSummary
is the default function to compute performance metrics in train
. It is a wrapper around postResample
.
twoClassSummary
computes sensitivity, specificity and the area under the ROC curve. mnLogLoss
computes the minus log-likelihood of the multinomial distribution (without the constant term):
-logLoss = \frac{-1}{n}∑_{i=1}^n ∑_{j=1}^C y_{ij} \log(p_{ij})
where the y
values are binary indicators for the classes and p
are the predicted class probabilities.
multiClassSummary
computes some overall measures of for performance (e.g. overall accuracy and the Kappa statistic) and several averages of statistics calculated from "one-versus-all" configurations. For example, if there are three classes, three sets of sensitivity values are determined and the average is reported with the name ("Mean_Sensitivity"). The same is true for a number of statistics generated by confusionMatrix
. With two classes, the basic sensitivity is reported with the name "Sensitivity"
To use twoClassSummary
and/or mnLogLoss
, the classProbs
argument of trainControl
should be TRUE
. multiClassSummary
can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.
Other functions can be used via the summaryFunction
argument of trainControl
. Custom functions must have the same arguments asdefaultSummary
.
The function getTrainPerf
returns a one row data frame with the resampling results for the chosen model. The statistics will have the prefix "Train
" (i.e. "TrainROC
"). There is also a column called "method
" that echoes the argument of the call to trainControl
of the same name.
A vector of performance estimates.
Max Kuhn, Zachary Mayer
Kvalseth. Cautionary note about R^2. American Statistician (1985) vol. 39 (4) pp. 279-285
predicted <- matrix(rnorm(50), ncol = 5) observed <- rnorm(10) apply(predicted, 2, postResample, obs = observed) classes <- c("class1", "class2") set.seed(1) dat <- data.frame(obs = factor(sample(classes, 50, replace = TRUE)), pred = factor(sample(classes, 50, replace = TRUE)), class1 = runif(50), class2 = runif(50)) defaultSummary(dat, lev = classes) twoClassSummary(dat, lev = classes) mnLogLoss(dat, lev = classes)