orthoProjection {resemble}R Documentation

Orthogonal projections using partial least squares and principal component analysis

Description

Functions to perform orthogonal projections of high dimensional data matrices using partial least squares (pls) and principal component analysis (pca)

Usage

orthoProjection(Xr, X2 = NULL, 
                Yr = NULL, 
                method = "pca", pcSelection = list("cumvar", 0.99), 
                center = TRUE, scaled = FALSE, cores = 1, ...)
                
pcProjection(Xr, X2 = NULL, Yr = NULL, 
             pcSelection = list("cumvar", 0.99), 
             center = TRUE, scaled = FALSE, 
             method = "pca",
             tol = 1e-6, max.iter = 1000, 
             cores = 1, ...)  
              
plsProjection(Xr, X2 = NULL, Yr, 
              pcSelection = list("opc", 40), 
              scaled = FALSE, 
              tol = 1e-6, max.iter = 1000, 
              cores = 1, ...) 
              
## S3 method for class 'orthoProjection'
predict(object, newdata, ...)

pcProjection(Xr, X2 = NULL, Yr = NULL, pcSelection = list("cumvar", 0.99),
  center = TRUE, scaled = FALSE, method = "pca", tol = 1e-06,
  max.iter = 1000, cores = 1, ...)

plsProjection(Xr, X2 = NULL, Yr, pcSelection = list("opc", 40),
  scaled = FALSE, tol = 1e-06, max.iter = 1000, cores = 1, ...)

## S3 method for class 'orthoProjection'
predict(object, newdata, ...)

Arguments

Xr

a matrix (or data.frame) containing the (reference) data.

X2

an optional matrix (or data.frame) containing data of a second set of observations(samples).

Yr

if the method used in the pcSelection argument is "opc" or if the sm argument is either "pls" or "loc.pls", then it must be a vector containing the side information corresponding to the spectra in Xr. It is equivalent to the sideInf parameter of the simEval function. It can be a numeric vector or matrix (regarding one or more continuous variables). The root mean square of differences (rmsd) is used for assessing the similarity between the samples and their corresponding most similar samples in terms of the side information provided. When sm = "pc", this parameter can also be a single discrete variable of class factor. In such a case the kappa index is used. See simEval function for more details.

method

the method for projecting the data. Options are: "pca" (principal component analysis using the singular value decomposition algorithm), "pca.nipals" (principal component analysis using the non-linear iterative partial least squares algorithm) and "pls" (partial least squares).

pcSelection

a list which specifies the method to be used for identifying the number of principal components to be retained for computing the Mahalanobis distance of each sample in sm = "Xu" to the centre of sm = "Xr". It also specifies the number of components in any of the following cases: sm = "pc", sm = "loc.pc", sm = "pls" and sm = "loc.pls". This list must contain two objects in the following order:

  • method:the method for selecting the number of components. Possible options are: "opc" (optimized pc selection based on Ramirez-Lopez et al. (2013a, 2013b) in which the side information concept is used, see details), "cumvar" (for selecting the number of principal components based on a given cumulative amount of explained variance); "var" (for selecting the number of principal components based on a given amount of explained variance); and "manual" (for specifying manually the desired number of principal components)

  • value:a numerical value that complements the selected method. If "opc" is chosen, it must be a value indicating the maximal number of principal components to be tested (see Ramirez-Lopez et al., 2013a, 2013b). If "cumvar" is chosen, it must be a value (higher than 0 and lower than 1) indicating the maximum amount of cumulative variance that the retained components should explain. If "var" is chosen, it must be a value (higher than 0 and lower than 1) indicating that components that explain (individually) a variance lower than this threshold must be excluded. If "manual" is chosen, it must be a value specifying the desired number of principal components to retain.

The default method for the pcSelection argument is "opc" and the maximal number of principal components to be tested is set to 40. Optionally, the pcSelection argument admits "opc" or "cumvar" or "var" or "manual" as a single character string. In such a case the default for "value" when either "opc" or "manual" are used is 40. When "cumvar" is used the default "value" is set to 0.99 and when "var" is used the default "value" is set to 0.01.

center

a logical indicating if the data Xr (and X2 if specified) must be centered. If X2 is specified the data is centered on the basis of Xr \cup Xu. This argument only applies to the principal components projection. For pls projections the data is always centered.

scaled

a logical indicating if Xr (and X2 if specified) must be scaled. If X2 is specified the data is scaled on the basis of Xr \cup Xu.

cores

number of cores used when method in pcSelection is "opc" (which can be computationally intensive) (default = 1). Dee details.

...

additional arguments to be passed to pcProjection or plsProjection.

tol

tolerance limit for convergence of the algorithm in the nipals algorithm (default is 1e-06). In the case of PLS this applies only to Yr with more than two variables.

max.iter

maximum number of iterations (default is 1000). In the case of method = "pls" this applies only to Yr matrices with more than one variable.

object

object of class "orthoProjection" (as returned by orthoProjection, pcProjection or plsProjection).

newdata

an optional data frame or matrix in which to look for variables with which to predict. If omitted, the scores are used. It must contain the same number of columns, to be used in the same order.

Details

In the case of method = "pca", the algrithm used is the singular value decomposition in which given a data matrix X, is factorized as follows:

X = UDV^{\mathrm{T}}

where U and V are othogonal matrices, and where U is a matrix of the left singular vectors of X, D is a diagonal matrix containing the singular values of X and V is the is a matrix of the right singular vectors of X. The matrix of principal component scores is obtained by a matrix multiplication of U and D, and the matrix of principal component loadings is equivalent to the matrix V. When method = "pca.nipals", the algorithm used for principal component analysis is the non-linear iterative partial least squares (nipals). In the case of the of the partial least squares projection (a.k.a projection to latent structures) the nipals regression algorithm. Details on the "nipals" algorithm are presented in Martens (1991). When method = "opc", the selection of the components is carried out by using an iterative method based on the side information concept (Ramirez-Lopez et al. 2013a, 2013b). First let be P a sequence of retained components (so that P = 1, 2, ...,k . At each iteration, the function computes a dissimilarity matrix retaining p_i components. The values of the side information of the samples are compared against the side information values of their most spectrally similar samples. The optimal number of components retrieved by the function is the one that minimizes the root mean squared differences (RMSD) in the case of continuous variables, or maximizes the kappa index in the case of categorical variables. In this process the simEval function is used. Note that for the "opc" method is necessary to specify Yr (the side information of the samples). Multi-threading for the computation of dissimilarities (see cores parameter) is based on OpenMP and hence works only on windows and linux.

Value

orthoProjection, pcProjection, plsProjection, return a list of class orthoProjection with the following components:

predict.orthoProjection, returns a matrix of scores proprojected for newdtata.

Author(s)

Leonardo Ramirez-Lopez

References

Martens, H. (1991). Multivariate calibration. John Wiley & Sons.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

See Also

orthoDiss, simEval, mbl

Examples

## Not run: 
require(prospectr)

data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train),]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]

Xu <- Xu[!is.na(Yu),]
Yu <- Yu[!is.na(Yu)]

Xr <- Xr[!is.na(Yr),]
Yr <- Yr[!is.na(Yr)] 

# A partial least squares projection using the "opc" method
# for the selection of the optimal number of components
plsProj <- orthoProjection(Xr = Xr, Yr = Yr, X2 = Xu, 
                           method = "pls", 
                           pcSelection = list("opc", 40))
                           
# A principal components projection using the "opc" method
# for the selection of the optimal number of components
pcProj <- orthoProjection(Xr = Xr, Yr = Yr, X2 = Xu, 
                          method = "pca", 
                          pcSelection = list("opc", 40))
                           
# A partial least squares projection using the "cumvar" method
# for the selection of the optimal number of components
plsProj2 <- orthoProjection(Xr = Xr, Yr = Yr, X2 = Xu, 
                            method = "pls", 
                            pcSelection = list("cumvar", 0.99))

## End(Not run) 

[Package resemble version 1.2.2 Index]