mbl {resemble} | R Documentation |
This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy learning approach for predicting a given response variable from a set of (spectral) predictor variables. For each sample in an prediction set a specific local regression is carried out based on a subset of similar samples (nearest neighbours) selected from a reference set. The local model is then used to predict the response value of the target (prediction) sample. Therefore this function does not yield a global regression model.
mbl(Yr, Xr, Yu = NULL, Xu, mblCtrl = mblControl(), dissimilarityM, group = NULL, dissUsage = "predictors", k, k.diss, k.range, method, pls.c, pls.max.iter = 1, pls.tol = 1e-6, noise.v = 0.001, ...)
Yr |
a numeric |
Xr |
input |
Yu |
an optional numeric |
Xu |
input |
mblCtrl |
a list (created with the |
dissimilarityM |
(optional) a dissimilarity matrix. This argument can be used in case a user-defined dissimilarity matrix is preferred over the automatic dissimilarity matrix computation specified in the |
group |
an optional |
dissUsage |
specifies how the dissimilarity information shall be used. The possible options are: |
k |
a numeric (integer) |
k.diss |
a |
k.range |
a |
method |
a character string indicating the method to be used at each local multivariate regression. Options are: |
pls.c |
the number of pls components to be used in the regressions if either |
pls.max.iter |
maximum number of iterations for the partial least squares methods. |
pls.tol |
limit for convergence in the partial orthogonal scores partial least squares regressions using the nipals algorithm. Default is 1e-6. |
noise.v |
a value indicating the variance of the noise for Gaussian process regression. Default is 0.001. |
... |
additional arguments to be passed to other functions. |
By using the group
argument one can specify observations (spectra) groups of samples that have something in common e.g. spectra collected from the same batch of measurements, from the same sample, from samples with very similar origin, etc) which could produce biased cross-validation results due to pseudo-replication. This argument allows to select calibration points that are independent from the validation ones in cross-validation. In this regard, when valMethod = "loc_crossval"
(used in mblControl
function), then the p
argument refer to the percentage of groups of samples (rather than single samples) to be retained in each resampling iteration at each local segment.
The dissUsage
argument is used to specifiy whether the dissimilarity information must be used within the local regressions and (if so), how. When dissUsage = "predictors"
the local (square symmetric) dissimilarity matrix corresponding the selected neighbourhood is used as source
of additional predictors (i.e the columns of this local matrix are treated as predictor variables). In some cases this may result in an improvement
of the prediction performance (Ramirez-Lopez et al., 2013a). If dissUsage = "weights"
, the neighbours of the query point (xu_{j}) are weighted according to their dissimilarity (e.g. distance) to xu_{j} prior carrying out each local regression.
The following tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for computing the final weights based on the measured dissimilarities:
W_{j} = (1 - v^{3})^{3}
where if xr_{i} \in neighbours of xu_{j}:
v_{j}(xu_{j}) = d(xr_{i}, xu_{j})
otherwise:
v_{j}(xu_{j}) = 0
In the above formulas d(xr_{i}, xu_{j}) represents the dissimilarity between the query point and each object in Xr. When dissUsage = "none"
is chosen
the dissimilarity information is not used.
The possible options for performing regressions at each local segment implemented in the mbl
function are described as follows:
Partial least squares ("pls"
): It uses the orthogonal scores (non-linear iterative partial least squares, nipals) algorithm. The only parameter which needs to be optimized is the number of pls components. This can be done by cross-validation at each local segment.
Weighted average pls ("wapls1"
): It uses multiple models generated by multiple pls components (i.e. between a minimum and a maximum number of pls components). At each local partition the final predicted value is a weighted average of all the predicted values generated by the multiple pls models. The weight for each component is calculated as follows:
w_{j} = \frac{1}{s_{1:j}\times g_{j}}
where s_{1:j} is the root mean square of the spectral residuals of the unknown (or target) sample when a total of j pls components are used and g_{j} is the root mean square of the regression coefficients corresponding to the jth pls component (see Shenk et al., 1997 for more details). "wapls1"
is not compatible with valMethod = "loc_crossval"
since the weights are computed based on the sample to be predicted at each local iteration.
Gaussian process with dot product covariance ("gpr"
): Gaussian process regression is a probabilistic and non-parametric Bayesian approach. It is commonly described as a collection of random variables which have a joint Gaussian distribution and it is characterized by both a mean and a covariance function (Williams and Rasmussen, 1996). The covariance function used in the implemented method is the dot product, which inplies that there are no parameters to be optimized for the computation of the covariance.
Here, the process for predicting the response variable of a new sample (y_{new}) from its predictor variables (x_{new}) is carried out first by computing a prediction vector (A). It is derived from a set of reference spectra (X) and their respective response vector (Y) as follows:
A = (X X^\textup{T} + σ^2 I)^{-1} Y
where σ^2 denotes the variance of the noise and I the identity matrix (with dimensions equal to the number of observations in X). The prediction of y_{new} is then carried out by:
y_{new} = (x_{new}x_{new}^\textup{T}) A
The loop used to iterate over the Xu
samples in mbl
uses the %dopar%
operator of the foreach
package, which can be used to parallelize this internal loop. The last example given in the mbl
function ilustrates how to parallelize the mbl
function.
Note that the computational cost depends largely on the way on which the arguments of the function are set. For big datasets, it is recommended to carefully select the values of the parameters to test (e.g. validation method, regression method, dissimilarity information usage, dissimilarity method).
a list
of class mbl
with the following components (sorted by either k
or k.diss
according to the case):
call
: the call used.
cntrlParam
: the list with the control parameters used. If one or more control parameters were reset automatically, then a list containing a list with the initial control parameters specified and a list with the parameters which were finally used.
dissimilarities
: a list with the method used to obtain the dissimilarity matrices and the dissimilarity matrices corresponding to D(Xr, Xu) and D(Xr,Xr) if dissUsage = "predictors"
. This object is returned only if the returnDiss
argument in the mblCtrl
list was set to TRUE
in the the call used.
totalSamplesPredicted
the total number of samples predicted.
pcAnalysis
: a list containing the results of the principal component analysis. The first two objects (scores_Xr
and scores_Xu
) are the scores of the Xr
and Xu
matrices. It also contains the number of principal components used (n.componentsUsed
) and another object which is a vector
containing the standardized Mahalanobis dissimilarities (also called GH, Global H distance) between each sample in Xu
and the centre of Xr
.
components
: a list containing either the number of principal components or partial least squares components used for the computation of the orthogonal dissimilarities. This object is only returned if the dissimilarity meausre specified in mblCtrl
is any of the following options: 'pc'
, 'loc.pc'
, "pls"
, 'loc.pls'
. If any of the local orthogonal dissimilarities was used ('loc.pc'
or "pls"
)
a data.frame
is also returned in his list. This object is equivalent to the loc.n.components
object returned by the orthoDiss
function. It specifies the number of local components (either principal components or partial least squares components) used for computing the dissimilarity between each query sample and its neighbour samples, as returned by the orthoDiss
function.
nnValStats
: a data frame containing the statistics of the nearest neighbour cross-validation for each either k
or k.diss
depending on the arguments specified in the call. It is returned only if 'NNv'
or 'both'
were selected as validation method
localCrossValStats
: a data frame containing the statistics of the local leave-group-out cross validation for each either k
or k.diss
depending on the arguments specified in the call. It is returned only if 'local_crossval'
or 'both'
were selected as validation method
YuPredictionStats
: a data frame containing the statistics of the cross-validation of the prediction of Yu
for each either k
or k.diss
depending on the arguments specified in the call. It is returned only if Yu
was provided.
results
: a list of data frames which contains the results of the predictions for each either k
or k.diss
. Each data.frame
contains the following columns:
o.index
: The index of the sample predicted in the input matrix
k.diss
: This column is only ouput if the k.diss
argument is used. It indicates the corresponding dissimilarity threshold for selecting the neighbors used to predict a given sample.
distance
: This column is only ouput if the k.diss
argument is used. It is a logical that indicates whether the neighbors selected by the given dissimilarity threshold were outside the boundaries specified in the k.range
argument. In that case the number of neighbors used is coerced to on of the boundaries.
k.org
: This column is only ouput if the k.diss
argument is used. It indicates the number of neighbors that are retained when the given dissimilarity threshold is used.
pls.comp
: This column is only ouput if pls
regression was used. It indicates the final number of pls components used. If no optimization was set, it retrieves the original pls components specified in the pls.c
argument.
min.pls
: This column is only ouput if wapls1
regression was used. It indicates the final number of minimum pls components used. If no optimization was set, it retrieves the original minimum pls components specified in the pls.c
argument.
max.pls
: This column is only ouput if wapls1
regression was used. It indicates the final number of maximum pls components used. If no optimization was set, it retrieves the original maximum pls components specified in the pls.c
argument.
yu.obs
: This column is only ouput if the Yu
argument is used. It indicates the input values given in Yu
(the response variable corresponding to the data to be predicted).
pred
: The predicted values
yr.min.obs
: The minimum reference value (of the response variable) in the neighborhood.
yr.max.obs
: The maximum reference value (of the response variable) in the neighborhood.
index.nearest.in.ref
The index in Xr
of the nearest neighbor.
y.nearest
: The reference value (of the response variable) of the nearest neighbor in Xr
.
y.nearest.pred
: This column is only ouput if the validation method (selected with the mblControl
function) is equal to 'NNv'
. It represents the predicted value of the nearest neighbor sample in Xr
using the neighborhood of the predicted sample in Xu
.
loc.rmse.cv
: This column is only ouput if the validation method (selected with the mblControl
function) is equal to 'loc_crossval'
. It represents the cross validation RMSE value computed in for the neighborhood of the sample of the predicted sample in Xu
.
loc.st.rmse.cv
: This column is only ouput if the validation method (selected with the mblControl
function) is equal to 'loc_crossval'
. It represents the cross validation standardized RMSE value computed in for the neighborhood of the sample of the predicted sample in Xu
.
dist.nearest
: The distance to the nearest neighbor.
dist.k.farthest
: The distance to the farthest neighbor selected.
When the k.diss
argument is used, the printed results show a table with a column named 'p.bounded'. It represents the percentage of samples
for which the neighbors selected by the given dissimilarity threshold were outside the boundaries specified in the k.range
argument.
Leonardo Ramirez-Lopez and Antoine Stevens
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610.
Fernandez Pierna, J.A., Dardenne, P. 2008. Soil parameter quantification by NIRS as a Chemometric challenge at "Chimiomitrie 2006". Chemometrics and Intelligent Laboratory Systems 91, 94-98
Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62, 664-673.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
mblControl
, fDiss
, corDiss
, sid
, orthoDiss
, neigCleaning
## Not run: require(prospectr) data(NIRsoil) # Filter the data using the Savitzky and Golay smoothing filter with # a window size of 11 spectral variables and a polynomial order of 3 # (no differentiation). sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0) # Replace the original spectra with the filtered ones NIRsoil$spc <- sg Xu <- NIRsoil$spc[!as.logical(NIRsoil$train),] Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)] Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)] Xr <- NIRsoil$spc[as.logical(NIRsoil$train),] Xu <- Xu[!is.na(Yu),] Xr <- Xr[!is.na(Yr),] Yu <- Yu[!is.na(Yu)] Yr <- Yr[!is.na(Yr)] # Example 1 # A mbl implemented in Ramirez-Lopez et al. (2013, # the spectrum-based learner) # Example 1.1 # An exmaple where Yu is supposed to be unknown, but the Xu # (spectral variables) are known ctrl1 <- mblControl(sm = "pc", pcSelection = list("opc", 40), valMethod = "NNv", scaled = FALSE, center = TRUE) sbl.u <- mbl(Yr = Yr, Xr = Xr, Yu = NULL, Xu = Xu, mblCtrl = ctrl1, dissUsage = "predictors", k = seq(40, 150, by = 10), method = "gpr") sbl.u plot(sbl.u) # Example 1.2 # If Yu is actually known... sbl.u2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl1, dissUsage = "predictors", k = seq(40, 150, by = 10), method = "gpr") sbl.u2 # Example 1.3 # A variation of the spectrum-based learner implemented in # Ramirez-Lopez et al. (2013) where the dissimilarity matrices are # recomputed based on partial least squares scores ctrl_1.3 <- mblControl(sm = "pls", pcSelection = list("opc", 40), valMethod = "NNv", scaled = FALSE, center = TRUE) sbl_1.3 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl_1.3, dissUsage = "predictors", k = seq(40, 150, by = 10), method = "gpr") sbl_1.3 # Example 2 # A mbl similar to the ones implemented in # Ramirez-Lopez et al. (2013) # and Fernandez Pierna and Dardenne (2008) ctrl.mbl <- mblControl(sm = "cor", pcSelection = list("cumvar", 0.999), valMethod = "NNv", scaled = FALSE, center = TRUE) local.mbl <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl.mbl, dissUsage = "none", k = seq(40, 150, by = 10), pls.c = c(5, 15), method = "wapls1") local.mbl # Example 3 # A variation of the previous example (using the optimized pc # dissmilarity matrix) using the control list of the example 1 local.mbl2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl1, dissUsage = "none", k = seq(40, 150, by = 10), pls.c = c(5, 15), method = "wapls1") local.mbl2 # Example 4 # Using the function with user-defined dissimilarities: # Examples 4.1 - 4.2: Compute a square symetric matrix of # dissimilarities between # all the elements in Xr and Xu (dissimilarities will be used as # additional predictor variables later in the mbl function) # Examples 4.3 - 4.4: Derive a dissimilarity value of each element # in Xu to each element in Xr (in this case dissimilarities will # not be used as additional predictor variables later in the # mbl function) # Example 4.1 # the manhattan distance manhattanD <- dist(rbind(Xr, Xu), method = "manhattan") manhattanD <- as.matrix(manhattanD) ctrl.udd <- mblControl(sm = "none", pcSelection = list("cumvar", 0.999), valMethod = c("NNv", "loc_crossval"), resampling = 10, p = 0.75, scaled = FALSE, center = TRUE) mbl.udd1 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl.udd, dissimilarityM = manhattanD, dissUsage = "predictors", k = seq(40, 150, by = 10), method = "gpr") mbl.udd1 #Example 4.2 # first derivative spectra Xr.der.sp <- t(diff(t(rbind(Xr, Xu)), lag = 7, differences = 1)) Xu.der.sp <- t(diff(t(Xu), lag = 7, differences = 1)) # The principal components dissimilarity on the derivative spectra der.ortho <- orthoDiss(Xr = Xr.der.sp, X2 = Xu.der.sp, Yr = Yr, pcSelection = list("opc", 40), method = "pls", center = FALSE, scale = FALSE) der.ortho.diss <- der.ortho$dissimilarity # mbl applied to the absorbance spectra mbl.udd2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl.udd, dissimilarityM = der.ortho.diss, dissUsage = "none", k = seq(40, 150, by = 10), method = "gpr") #Example 4.3 # first derivative spectra der.Xr <- t(diff(t(Xr), lag = 1, differences = 1)) der.Xu <- t(diff(t(Xu), lag = 1, differences = 1)) # the sid on the derivative spectra der.sid <- sid(Xr = der.Xr, X2 = der.Xu, mode = "density", center = TRUE, scaled = FALSE) der.sid <- der.sid$sid mbl.udd3 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl.udd, dissimilarityM = der.sid, dissUsage = "none", k = seq(40, 150, by = 10), method = "gpr") mbl.udd3 # Example 5 # For running the mbl function in parallel n.cores <- detectCores() - 1 if(n.cores == 0) n.cores <- 1 # Set the number of cores according to the OS if (.Platform$OS.type == "windows") { require(doParallel) clust <- makeCluster(n.cores) registerDoParallel(clust) }else{ require(doSNOW) clust <- makeCluster(n.cores, type = "SOCK") registerDoSNOW(clust) ncores <- getDoParWorkers() } ctrl <- mblControl(sm = "pc", pcSelection = list("opc", 40), valMethod = "NNv", scaled = FALSE, center = TRUE) mbl.p <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu, mblCtrl = ctrl, dissUsage = "none", k = seq(40, 150, by = 10), method = "gpr") registerDoSEQ() try(stopCluster(clust)) mbl.p ## End(Not run)