mbl {resemble}R Documentation

A function for memory-based learning (mbl)

Description

This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy learning approach for predicting a given response variable from a set of (spectral) predictor variables. For each sample in an prediction set a specific local regression is carried out based on a subset of similar samples (nearest neighbours) selected from a reference set. The local model is then used to predict the response value of the target (prediction) sample. Therefore this function does not yield a global regression model.

Usage

mbl(Yr, Xr, Yu = NULL, Xu,
    mblCtrl = mblControl(), 
    dissimilarityM,
    group = NULL,
    dissUsage = "predictors", 
    k, k.diss, k.range,
    method, 
    pls.c, pls.max.iter = 1, pls.tol = 1e-6,
    noise.v = 0.001,
    ...)

Arguments

Yr

a numeric vector containing the values of the response variable corresponding to the reference data

Xr

input matrix (or data.frame) of predictor variables of the reference data (observations in rows and variables in columns).

Yu

an optional numeric vector containing the values of the response variable corresponding to the data to be predicted

Xu

input matrix (or data.frame) of predictor variables of the data to be predicted (observations in rows and variables in columns).

mblCtrl

a list (created with the mblControl function) which contains some parameters that control the some aspects of the mbl function. See the mblControl function for more details.

dissimilarityM

(optional) a dissimilarity matrix. This argument can be used in case a user-defined dissimilarity matrix is preferred over the automatic dissimilarity matrix computation specified in the sm argument of the mblControl function. When dissUsage = "predictors", dissimilarityM must be a square symmetric dissimilarity matrix (derived from a matrix of the form rbind(Xr, Xu)) for which the diagonal values are zeros (since the dissimilarity between an object and itself must be 0). On the other hand if dissUsage is set to either "weights" or "none", dissimilarityM must be a matrix representing the dissimilarity of each element in Xu to each element in Xr. The number of columns of the object correspondent to dissimilarityM must be equal to the number of rows in Xu and the number of rows equal to the number of rows in Xr. If both dissimilarityM and sm are specified, only the dissimilarityM argument will be taken into account.

group

an optional factor (or vector that can be coerced to a factor by as.factor) to be taken into account for internal validations. The length of the vector must be equal to nrow(Xr), giving the identifier of related observations (e.g. spectra collected from the same batch of measurements, from the same sample, from samples with very similar origin, etc). When one observation is selected for cross validation, all observations of the same group are removed together and assigned to validation. See details.

dissUsage

specifies how the dissimilarity information shall be used. The possible options are: "predictors", "weights" and "none" (see details below). Default is "predictors".

k

a numeric (integer) vector containing the sequence of k nearest neighbours to be tested. Either k or k.diss must be specified. Numbers with decimal values will be coerced to their next higher integer values. This vector will be automatically sorted into ascending order.

k.diss

a vector containing the sequence of dissimilarity thresholds to be tested. When the dissimilarity between a sample in Xr and a sample Xu is below the given threshold, the sample in sample in Xr is treated as a neighbour of the sample in Xu, otherwise it is ignored. These thresholds depend on the corresponding dissimilarity measure specified in sm. Either k or k.diss must be specified.

k.range

a vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbours allowed when the k.diss argument is used.

method

a character string indicating the method to be used at each local multivariate regression. Options are: "pls", "wapls1" and "gpr" (see details below). Note: "wapls2" from the previos version of the package is no longer available/supported.

pls.c

the number of pls components to be used in the regressions if either "pls" or "wapls1" is used. When "pls" is used, this argument must be a single numerical value. When "wapls1" is used, this argument must be a vector of length 2 indicating the minimum (first value) and the maximum (second value) number of pls components used for the regressions (see details below).

pls.max.iter

maximum number of iterations for the partial least squares methods.

pls.tol

limit for convergence in the partial orthogonal scores partial least squares regressions using the nipals algorithm. Default is 1e-6.

noise.v

a value indicating the variance of the noise for Gaussian process regression. Default is 0.001.

...

additional arguments to be passed to other functions.

Details

By using the group argument one can specify observations (spectra) groups of samples that have something in common e.g. spectra collected from the same batch of measurements, from the same sample, from samples with very similar origin, etc) which could produce biased cross-validation results due to pseudo-replication. This argument allows to select calibration points that are independent from the validation ones in cross-validation. In this regard, when valMethod = "loc_crossval" (used in mblControl function), then the p argument refer to the percentage of groups of samples (rather than single samples) to be retained in each resampling iteration at each local segment. The dissUsage argument is used to specifiy whether the dissimilarity information must be used within the local regressions and (if so), how. When dissUsage = "predictors" the local (square symmetric) dissimilarity matrix corresponding the selected neighbourhood is used as source of additional predictors (i.e the columns of this local matrix are treated as predictor variables). In some cases this may result in an improvement of the prediction performance (Ramirez-Lopez et al., 2013a). If dissUsage = "weights", the neighbours of the query point (xu_{j}) are weighted according to their dissimilarity (e.g. distance) to xu_{j} prior carrying out each local regression. The following tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for computing the final weights based on the measured dissimilarities:

W_{j} = (1 - v^{3})^{3}

where if xr_{i} \in neighbours of xu_{j}:

v_{j}(xu_{j}) = d(xr_{i}, xu_{j})

otherwise:

v_{j}(xu_{j}) = 0

In the above formulas d(xr_{i}, xu_{j}) represents the dissimilarity between the query point and each object in Xr. When dissUsage = "none" is chosen the dissimilarity information is not used. The possible options for performing regressions at each local segment implemented in the mbl function are described as follows:

The loop used to iterate over the Xu samples in mbl uses the %dopar% operator of the foreach package, which can be used to parallelize this internal loop. The last example given in the mbl function ilustrates how to parallelize the mbl function. Note that the computational cost depends largely on the way on which the arguments of the function are set. For big datasets, it is recommended to carefully select the values of the parameters to test (e.g. validation method, regression method, dissimilarity information usage, dissimilarity method).

Value

a list of class mbl with the following components (sorted by either k or k.diss according to the case):

When the k.diss argument is used, the printed results show a table with a column named 'p.bounded'. It represents the percentage of samples for which the neighbors selected by the given dissimilarity threshold were outside the boundaries specified in the k.range argument.

Author(s)

Leonardo Ramirez-Lopez and Antoine Stevens

References

Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610.

Fernandez Pierna, J.A., Dardenne, P. 2008. Soil parameter quantification by NIRS as a Chemometric challenge at "Chimiomitrie 2006". Chemometrics and Intelligent Laboratory Systems 91, 94-98

Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62, 664-673.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.

Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.

See Also

mblControl, fDiss, corDiss, sid, orthoDiss, neigCleaning

Examples

## Not run: 
require(prospectr)

data(NIRsoil)

# Filter the data using the Savitzky and Golay smoothing filter with 
# a window size of 11 spectral variables and a polynomial order of 3 
# (no differentiation).
sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0) 

# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train),]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]

Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train),]

Xu <- Xu[!is.na(Yu),]
Xr <- Xr[!is.na(Yr),]

Yu <- Yu[!is.na(Yu)]
Yr <- Yr[!is.na(Yr)]

# Example 1
# A mbl implemented in Ramirez-Lopez et al. (2013, 
# the spectrum-based learner)
# Example 1.1
# An exmaple where Yu is supposed to be unknown, but the Xu 
# (spectral variables) are known 
ctrl1 <- mblControl(sm = "pc", pcSelection = list("opc", 40), 
                    valMethod = "NNv", 
                    scaled = FALSE, center = TRUE)

sbl.u <- mbl(Yr = Yr, Xr = Xr, Yu = NULL, Xu = Xu,
             mblCtrl = ctrl1, 
             dissUsage = "predictors", 
             k = seq(40, 150, by = 10), 
             method = "gpr")
sbl.u
plot(sbl.u)


 
# Example 1.2
# If Yu is actually known... 
sbl.u2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
              mblCtrl = ctrl1, 
              dissUsage = "predictors", 
              k = seq(40, 150, by = 10), 
              method = "gpr")
sbl.u2

# Example 1.3
# A variation of the spectrum-based learner implemented in 
# Ramirez-Lopez et al. (2013) where the dissimilarity matrices are 
# recomputed based on partial least squares scores
ctrl_1.3 <- mblControl(sm = "pls", pcSelection = list("opc", 40), 
                       valMethod = "NNv", 
                       scaled = FALSE, center = TRUE)
                          
sbl_1.3 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
               mblCtrl = ctrl_1.3,
               dissUsage = "predictors",
               k = seq(40, 150, by = 10), 
               method = "gpr")
sbl_1.3

# Example 2
# A mbl similar to the ones implemented in 
# Ramirez-Lopez et al. (2013) 
# and Fernandez Pierna and Dardenne (2008)
ctrl.mbl <- mblControl(sm = "cor", 
                       pcSelection = list("cumvar", 0.999), 
                       valMethod = "NNv", 
                       scaled = FALSE, center = TRUE)
                          
local.mbl <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                 mblCtrl = ctrl.mbl,
                 dissUsage = "none",
                 k = seq(40, 150, by = 10), 
                 pls.c = c(5, 15),
                 method = "wapls1")
local.mbl

# Example 3
# A variation of the previous example (using the optimized pc 
# dissmilarity matrix) using the control list of the example 1
                         
local.mbl2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                  mblCtrl = ctrl1,
                  dissUsage = "none",
                  k = seq(40, 150, by = 10), 
                  pls.c = c(5, 15),
                  method = "wapls1")
local.mbl2

# Example 4
# Using the function with user-defined dissimilarities:
# Examples 4.1 - 4.2: Compute a square symetric matrix of 
# dissimilarities between 
# all the elements in Xr and Xu (dissimilarities will be used as 
# additional predictor variables later in the mbl function)

# Examples 4.3 - 4.4: Derive a dissimilarity value of each element 
# in Xu to each element in Xr (in this case dissimilarities will 
# not be used as additional predictor variables later in the 
# mbl function)

# Example 4.1
# the manhattan distance 
manhattanD <- dist(rbind(Xr, Xu), method = "manhattan") 
manhattanD <- as.matrix(manhattanD)

ctrl.udd <- mblControl(sm = "none", 
                       pcSelection = list("cumvar", 0.999), 
                       valMethod = c("NNv", "loc_crossval"), 
                       resampling = 10, p = 0.75,
                       scaled = FALSE, center = TRUE)

mbl.udd1 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                mblCtrl = ctrl.udd, 
                dissimilarityM = manhattanD,
                dissUsage = "predictors",
                k = seq(40, 150, by = 10), 
                method = "gpr")
mbl.udd1

#Example 4.2
# first derivative spectra
Xr.der.sp <- t(diff(t(rbind(Xr, Xu)), lag = 7, differences = 1)) 
Xu.der.sp <- t(diff(t(Xu), lag = 7, differences = 1)) 

# The principal components dissimilarity on the derivative spectra 
der.ortho <- orthoDiss(Xr = Xr.der.sp, X2 = Xu.der.sp,
                       Yr = Yr,
                       pcSelection = list("opc", 40),
                       method = "pls",
                       center = FALSE, scale = FALSE) 

der.ortho.diss <- der.ortho$dissimilarity

# mbl applied to the absorbance spectra
mbl.udd2 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                mblCtrl = ctrl.udd, 
                dissimilarityM = der.ortho.diss,
                dissUsage = "none",
                k = seq(40, 150, by = 10), 
                method = "gpr")
                                
#Example 4.3
# first derivative spectra
der.Xr <- t(diff(t(Xr), lag = 1, differences = 1)) 
der.Xu <- t(diff(t(Xu), lag = 1, differences = 1))
# the sid on the derivative spectra
der.sid <- sid(Xr = der.Xr, X2 = der.Xu, mode = "density", 
               center = TRUE, scaled = FALSE) 
der.sid <- der.sid$sid

mbl.udd3 <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
                mblCtrl = ctrl.udd, 
                dissimilarityM = der.sid,
                dissUsage = "none",
                k = seq(40, 150, by = 10), 
                method = "gpr")
mbl.udd3

# Example 5
# For running the mbl function in parallel
n.cores <- detectCores() - 1
if(n.cores == 0) n.cores <- 1

# Set the number of cores according to the OS
if (.Platform$OS.type == "windows") {
  require(doParallel)
  clust <- makeCluster(n.cores)   
  registerDoParallel(clust)
}else{
  require(doSNOW)
  clust <- makeCluster(n.cores, type = "SOCK")
  registerDoSNOW(clust)
  ncores <- getDoParWorkers()
}

ctrl <- mblControl(sm = "pc", pcSelection = list("opc", 40), 
                   valMethod = "NNv",
                   scaled = FALSE, center = TRUE)

mbl.p <- mbl(Yr = Yr, Xr = Xr, Yu = Yu, Xu = Xu,
             mblCtrl = ctrl, 
             dissUsage = "none",
             k = seq(40, 150, by = 10), 
             method = "gpr")
registerDoSEQ()
try(stopCluster(clust))
mbl.p

## End(Not run)

[Package resemble version 1.2.2 Index]