nroKmeans {Numero} | R Documentation |
K-means clustering for multi-dimensional data.
nroKmeans(data, k = 3, subsample = NULL, balance = 0, metric = "euclid")
data |
A data frame or a matrix. |
k |
Number of centroids. |
subsample |
Number of randomly selected rows used during a single training cycle. |
balance |
Penalty parameter for size difference between clusters. |
metric |
Distance metric in data space, either "euclid" or "pearson". |
The K centroids are determined by Lloyd's algorithm with Euclidean
distances or by using 1 - Pearson correlation as the distance measure.
If subsample
is less than the number of data rows, a random subset
of the specified size is used for each training cycle.
If balance = 0.0
, the algorithm is applied with no balancing,
if balance = 1.0
all the clusters will be forced to be of equal size.
Intermediate values are permitted.
A list with four named elements: centroids
is a matrix of the
main results, layout
contains the best-matching centroid labels
and model residuals for each usable data point, history
is the
chronological record of training errors, and metric
is the distance
metric that was used.
Lloyd SP (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137
# Import data. fname <- system.file("extdata", "finndiane.txt", package = "Numero") dataset <- read.delim(file = fname) # Prepare training data. trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB") trdata <- scale.default(dataset[,trvars]) # Unbalanced K-means clustering. km0 <- nroKmeans(data = trdata, k = 5, balance = 0.0) print(table(km0$layout$BMC)) print(km0$centroids) # Balanced K-means clustering. km1 <- nroKmeans(data = trdata, k = 5, balance = 1.0) print(table(km1$layout$BMC)) print(km1$centroids)