kenStone {prospectr} | R Documentation |
Select calibration samples from a large multivariate data using the Kennard-Stone algorithm
kenStone(X,k,metric,pc,group,.center = TRUE,.scale = FALSE)
X |
a numeric |
k |
number of desired calibration samples |
metric |
distance metric to be used: 'euclid' (Euclidean distance) or 'mahal' (Mahalanobis distance, default). |
pc |
optional. If not specified, distance are
computed in the Euclidean space. Alternatively, distance
are computed in the principal component score space and
|
group |
An optional |
.center |
logical value indicating whether the input matrix should be centered before Principal Component Analysis. Default set to TRUE. |
.scale |
logical value indicating whether the input matrix should be scaled before Principal Component Analysis. Default set to FALSE. |
The Kennard–Stone algorithm allows to select samples with a uniform distribution over the predictor space (Kennard and Stone, 1969). It starts by selecting the pair of points that are the farthest apart. They are assigned to the calibration set and removed from the list of points. Then, the procedure assigns remaining points to the calibration set by computing the distance between each unassigned points i_0 and selected points i and finding the point i_0 for which:
d_{selected} = \max\limits_{i_0}(\min\limits_{i}(d_{i,i_{0}}))
This essentially selects point i_0 which is the farthest apart from its closest neighbors i in the calibration set. The algorithm uses the Euclidean distance to select the points. However, the Mahalanobis distance can also be used. This can be achieved by performing a PCA analysis on the input data and computing the Euclidean distance on the truncated score matrix according to the following definition of the Mahalanobis H distance:
H^{2}_{ij} = ∑\limits_{a=1}^{A}{(\hat{t}_{ia}-\hat{t}_{ja})^{2}/\hat{λ}_{a}}
where \hat{t}_{ia} is the a^th principal component score of point i, \hat{t}_{ja} is the corresponding value for point j, \hat{λ}_a is the eigenvalue of principal component a and A is the number of principal components included in the computation.
a list
with components:
'model
' numeric vector
giving the row
indices of the input data selected for calibration
'test
' numeric vector
giving the row
indices of the remaining observations
'pc
'
if the pc
argument is specified, a numeric
matrix
of the scaled pc scores
Antoine Stevens & Leonardo Ramirez-Lopez
Kennard, R.W., and Stone, L.A., 1969. Computer aided design of experiments. Technometrics 11, 137-148.
duplex
, shenkWest
,
naes
, honigs
data(NIRsoil) sel <- kenStone(NIRsoil$spc,k=30,pc=.99) plot(sel$pc[,1:2],xlab='PC1',ylab='PC2') points(sel$pc[sel$model,1:2],pch=19,col=2) # points selected for calibration # Test on artificial data X <- expand.grid(1:20,1:20) + rnorm(1e5,0,.1) plot(X,xlab='VAR1',ylab='VAR2') sel <- kenStone(X,k=25,metric='euclid') points(X[sel$model,],pch=19,col=2)