nroKmeans {Numero}R Documentation

K-means clustering

Description

K-means clustering for multi-dimensional data.

Usage

nroKmeans(data, k = 3, subsample = NULL, balance = 0, metric = "euclid")

Arguments

data

A data frame or a matrix.

k

Number of centroids.

subsample

Number of randomly selected rows used during a single training cycle.

balance

Penalty parameter for size difference between clusters.

metric

Distance metric in data space, either "euclid" or "pearson".

Details

The K centroids are determined by Lloyd's algorithm with Euclidean distances or by using 1 - Pearson correlation as the distance measure. If subsample is less than the number of data rows, a random subset of the specified size is used for each training cycle.

If balance = 0.0, the algorithm is applied with no balancing, if balance = 1.0 all the clusters will be forced to be of equal size. Intermediate values are permitted.

Value

A list with four named elements: centroids is a matrix of the main results, layout contains the best-matching centroid labels and model residuals for each usable data point, history is the chronological record of training errors, and metric is the distance metric that was used.

References

Lloyd SP (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory, 28:129–137

Examples

# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname)

# Prepare training data.
trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB")
trdata <- scale.default(dataset[,trvars]) 

# Unbalanced K-means clustering.
km0 <- nroKmeans(data = trdata, k = 5, balance = 0.0)
print(table(km0$layout$BMC))
print(km0$centroids)

# Balanced K-means clustering.
km1 <- nroKmeans(data = trdata, k = 5, balance = 1.0)
print(table(km1$layout$BMC))
print(km1$centroids)

[Package Numero version 1.2.0 Index]