ClustMMDD-package {ClustMMDD} | R Documentation |
ClustMMDD
: Clustering by Mixture Models for Discrete Data.
ClustMMDD
stands for "Clustering by Mixture Models for Discrete Data". This package deals with the two-fold problem of variable selection and model-based unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Log-likelihood (ICL) or a general criterion with penalty function to be data-driven calibrated.
Package: | ClustMMDD |
Type: | Package |
Version: | 1.0.1 |
Date: | 2015-05-18 |
License: | GPL (>= 2) |
In this package, K
and S
are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a non-clustering variable has the same distribution in all clusters. We consider a general situation with data described by P random variables X^l, l=1,\cdots,P, where each variable X^l is an unordered set ≤ft\{X^{l,1},\cdots,X^{l,ploidy}\right\} of ploidy categorical variables. For all l, the random variables X^{l,1},\cdots,X^{l,ploidy} take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by ploidy = 2 unordered alleles.
The two-fold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S)
is defined, and are compared using penalized criteria. The penalized criteria are of the form
crit≤ft(K,S\right)=γ_n≤ft(K,S\right)+pen≤ft(K,S\right),
where
γ_n≤ft(K,S\right) is the maximum log-likelihood,
and pen≤ft(K,S\right) the penalty function.
The penalty functions used in this package are the following, where dim≤ft(K,S\right) is the dimension (number of free parameters) of the model defined by ≤ft(K,S\right) :
Akaike Information Criterion (AIC) :
pen≤ft(K,S\right) = dim≤ft(K,S\right)
Bayesian Information (BIC) :
pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)
Integrated Complete Likelihood (ICL) :
pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)+entropy≤ft(K,S\right),
where
entropy≤ft(K,S\right) = -∑_{i=1}^N∑_{k=1}^Kτ_{i,k}\log≤ft(τ_{i,k}\right)
and
τ_{i,k}=P≤ft(i\in\mathcal{C}_k\right)
.
More general penalty function :
pen≤ft(K,S\right) = α*λ*dim≤ft(K,S\right)
where
λ is a multiplicative parameter to be calibrated,
α a coefficient in [1.5,2] to be given by the user.
We propose a data driven procedure based the dimension jumb version of the so called "slope heuristics" (see Dominique Bontemps and Wilson Toussile (2013) and references therein).
The maximum log-likelihood is estimated via the Expectation and Maximisation algorithm. The maximum a posteriori classification is derived from the estimated parameters of the selected model.
Wilson Toussile
Maintainer: Wilson Toussile <wilson.toussile@gmail.com>
Dominique Bontemps and Wilson Toussile (2013) : Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics, Volume 7, 2344-2371, ISSN.
Wilson Toussile and Elisabeth Gassiat (2009) : Variable selection in model-based clustering using multilocus genotype data. Adv Data Anal Classif, Vol 3, number 2, 109-134.
The main functions :
em.cluster.R
Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization algorithm, for a given value of (K,S). The maximum a posteriori classification is then derived.
backward.explorer
Gather the most competitive models using a backward-stepwise strategy.
dimJump.R
Perform the data driven calibration of the penalty function via an estimation of λ. Two values are proposed and a graphic is proposed to help user in making a choice.
selectK.R
Perform the selection of the number K of clusters for a given subset of clustering variables.
model.selection.R
Perform a model selection from a collection of competing models.
data(genotype2) head(genotype2) data(genotype2_ExploredModels) head(genotype2_ExploredModels) #Calibration of the penalty function outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE) cte1 = outDimJump[[1]][1] outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE) outSlection