simEval {resemble} | R Documentation |
This function searches for the most similar sample of each sample in a given data set based on a similarity/dissimilarity (e.g. distance matrix). The samples are compared against their corresponding most similar samples in terms of the side information provided. The root mean square of differences and the correlation coefficient are computed for continuous variables and for discrete variables the kappa index is calculated.
simEval(d, sideInf, lower.tri = FALSE, cores = 1, ...)
d |
a |
sideInf |
a |
lower.tri |
a |
cores |
number of cores used to find the neareast neighbours of similarity/dissimilarity scores (default = 1). See details. |
... |
additional parameters (for internal use only). |
For the evaluation of similarity/dissimilarity matrices this function uses side information (information about one variable which is available for a
group of samples, Ramirez-Lopez et al., 2013). It is assumed that there is a correlation (or at least an indirect or secondary correlation)
between this side informative variable and the spectra. In other words, this approach is based on the assumption that the similarity measures between the spectra of a given group of samples should be able to reflect their
similarity also in terms of the side informative variable (e.g. compositional similarity).
If sideInf
is a numeric vector
the root mean square of differences (RMSD) is used for assessing the similarity between the samples and their corresponding most similar samples in terms of the side information provided. It is computed as follows:
It can be computed as:
RMSD = √{\frac{1}{n} ∑_{i=1}^n {(y_i - \ddot{y}_i)^2}}
where y_i is the value of the side variable of the ith sample, \ddot{y}_i is the value of the side variable of the nearest neighbour
of the ith sample and n is the total number of observations.
If sideInf
is a factor the kappa index (κ) is used instead the RMSD. It is computed as follows:
κ = \frac{p_{o}-p_{e}}{1-p_{e}}
where both p_o and p_e are two different agreement indexes between the the side information of the samples and the side information of their corrresponding nearest samples (i.e. most similar samples).
While p_o is the relative agreement p_e is the the agreement expected by chance.
Multi-threading for the computation of dissimilarities (see cores
parameter) is based on OpenMP and hence works only on windows and linux.
simEval
returns a list
with the following components:
"eval
either the RMSD (and the correlation coefficient) or the kappa index
firstNN
a data.frame
containing the original side informative variable in the first column and the side informative values of the corresponding nearest neighbours in the second column
Leonardo Ramirez-Lopez
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
## Not run: require(prospectr) data(NIRsoil) Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)] Xr <- NIRsoil$spc[as.logical(NIRsoil$train),] # Example 1 # Compute a principal components distance pca.d <- orthoDiss(Xr = Xr, pcSelection = list("cumvar", 0.999), method = "pca", local = FALSE, center = TRUE, scaled = TRUE) # The final number of pcs used for computing the distance # matrix of objects in Xr pca.d$n.components # The final distance matrix ds <- pca.d$dissimilarity # Example 1.1 # Evaluate the distance matrix on the baisis of the # side information (Yr) associated with Xr se <- simEval(d = ds, sideInf = Yr) # The final evaluation results se$eval # The final values of the side information (Yr) and the values of # the side information corresponding to the first nearest neighbours # found by using the distance matrix se$firstNN # Example 1.2 # Evaluate the distance matrix on the baisis of two side # information (Yr and Yr2) # variables associated with Xr Yr2 <- NIRsoil$CEC[as.logical(NIRsoil$train)] se2 <- simEval(d = ds, sideInf = cbind(Yr, Yr2)) # The final evaluation results se2$eval # The final values of the side information variables and the values # of the side information variables corresponding to the first # nearest neighbours found by using the distance matrix se2$firstNN ### # Example 2 # Evaluate the distances produced by retaining different number of # principal components (this is the same principle used in the # optimized principal components approach ("opc")) # first project the data pca <- orthoProjection(Xr = Xr, method = "pca", pcSelection = list("manual", 30), center = TRUE, scaled = TRUE) # standardize the scores scores.s <- sweep(pca$scores, MARGIN = 2, STATS = pca$sc.sdv, FUN = "/") rslt <- matrix(NA, ncol(scores.s), 3) colnames(rslt) <- c("pcs", "rmsd", "r") rslt[,1] <- 1:ncol(scores.s) for(i in 1:ncol(scores.s)) { sc.ipcs <- scores.s[ ,1:i, drop = FALSE] di <- fDiss(Xr = sc.ipcs, method = "euclid", center = FALSE, scaled = FALSE) se <- simEval(d = di, sideInf = Yr) rslt[i,2:3] <- unlist(se$eval) } plot(rslt) ### # Example 3 # Example 3.1 # Evaluate a dissimilarity matrix computed using a moving window # correlation method mwcd <- mcorDiss(Xr = Xr, ws = 35, center = FALSE, scaled = FALSE) se.mw <- simEval(d = mwcd, sideInf = Yr) se.mw$eval # Example 3.2 # Evaluate a dissimilarity matrix computed using the correlation # method cd <- corDiss(Xr = Xr, center = FALSE, scaled = FALSE) se.nc <- simEval(d = cd, sideInf = Yr) se.nc$eval ## End(Not run)