1 Simple Example beforehand: T-shirts’ Size

At the first, they don’t have these labels, they only have some information about customers. Let’s think about how tailor make your customer-tailored T-shirt. They may measure your neck width(collar), arm length, chest width, waistline and so on. But, for most apparel companies, they have to have as few as possible number of sizes so that they can save cost to cover most of their target customers. Let’s say they only want to have five sizes. So the problem is how to find these five sizes so that most of the customers can buy a comfortable one, and meanwhile, when they have the right size, the T-shirt is not too large or to small. In statistics, this problem is equivalent to finding five clusters based on provided information so that the variation within clusters is small, and between clusters variation is large.

go to top

2 Summary of Seeds data

We use the seeds data set to demonstrate cluster analysis in R. The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. A description of the dataset can be viewed at (https://archive.ics.uci.edu/ml/datasets/seeds). Seven geometric values of wheat kernels were measured. Assume we only have the information of the seven (7) measures (x) and our task is to cluster or group the 210 seeds (so we remove the V8).

seed <- read.table('http://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt', header=F)
seed <- seed[,1:7]
colnames(seed) <- c("area", "perimeter","campactness", "length", "width", "asymmetry", "groovelength")

Scale data to have zero mean and unit variance for each column:

seed <- scale(seed) 

go to top

3 K-means

The basic idea of k-means clustering is to define clusters then minimize the total intra-cluster variation (known as total within-cluster variation). The standard algorithm is the Hartigan-Wong algorithm (1979), which defines the total within-cluster variation as the sum of squared distances Euclidean distances between items and the corresponding centroid: \[W(C_k) = \sum_{x_i \in C_k}(x_i - \mu_k)^2,\] where:

For clustering, one can rely on all kinds of distance measures and it is critical point. The distance measures will show how similar two elements \((x, z)\) are and it will highly influence the results of the clustering analysis. The classical methods for distance measures are Euclidean and Manhattan distances, which are defined as follow:

Euclidean distance:

\(d_{euc}(x,z) = \sqrt{\sum^n_{i=1}(x_i - z_i)^2} \tag{1}\)

Manhattan distance:

\(d_{man}(x,z) = \sum^n_{i=1}|(x_i - z_i)| \tag{2}\)

Pearson correlation distance:

\(d_{cor}(x, z) = 1 - \frac{\sum^n_{i=1}(x_i-\bar x)(z_i - \bar z)}{\sqrt{\sum^n_{i=1}(x_i-\bar x)^2\sum^n_{i=1}(z_i - \bar z)^2}} \tag{3}\)

Before conducting K-means clustering, we can calculate the pairwise distances between any two rows (observations) to roughly check whether there are some observations close to each other. Specifically, we can use get_dist to calculate the pairwise distances (the default is the Euclidean distance). Then the fviz_dist will visualize a distance matrix generated from get_dist.

library(factoextra)
distance <- get_dist(seed)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

From the distance plot above, we can see there are some observations are very close to each other. For example, those observations in the bottom left rectangle may be quite similar to each other and can be grouped into one cluster.

In order to use k-means method for clustering and plot results, we can use kmeans function in R. It will group the data into a specified number of clusters, say \(k\) (it is the centers argument of kmeans). As mentioned before, the algorithm randomly select \(k\) objects as the initial cluster centers to start the iteration, the final results may vary based on different initial centers. The nstart option of this function can allow the algorithm to attempt multiple initial configurations and reports on the best one. I recommended to set a large value of nstart for this function, which could give stable result. (Here we will use k=2 below for an illustration. You shall definitely try k=3 etc.)

# K-Means Cluster Analysis
fit <- kmeans(seed, centers = 2, nstart = 25) # 2 clusters solution with 25 different initial configurations
# Display number of observations in each cluster
table(fit$cluster)
## 
##   1   2 
##  77 133
# The kmeans object that has a lot of components 
fit
## K-means clustering with 2 clusters of sizes 77, 133
## 
## Cluster means:
##         area perimeter campactness     length      width   asymmetry
## 1  1.1379346  1.145151   0.5424726  1.1260130  1.0640703 -0.14617607
## 2 -0.6588042 -0.662982  -0.3140631 -0.6519023 -0.6160407  0.08462825
##   groovelength
## 1    1.1468793
## 2   -0.6639828
## 
## Clustering vector:
##   [1] 2 2 2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1
##  [38] 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 196.2357 459.7972
##  (between_SS / total_SS =  55.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

kmeans returns an object of class “kmeans” which has a print and a fitted method. It is a list with at least the following components:

3.1 Visualization of kmeans clusters

3.1.1 fviz_cluster in factoextra pacakge

We can use fviz_cluster to view the result by providing a nice graph of the clusters. Usually, we have more than two dimensions (variables), fviz_cluster will perform a principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

fviz_cluster(fit, data = seed)

We can also visualize k-means results with more than 2 clusters.

k3 <- kmeans(seed, centers = 3, nstart = 25)
k4 <- kmeans(seed, centers = 4, nstart = 25)
k5 <- kmeans(seed, centers = 5, nstart = 25)

# plots to compare
p1 <- fviz_cluster(fit, geom = "point", data = seed) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = seed) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = seed) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = seed) + ggtitle("k = 5")

library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow = 2)

3.1.2 plotcluster in fpc pacakge

There is another package can visualize the clustering results, fpc. The plotcluster function in it can

install.packages("fpc")
library(fpc)
plotcluster(seed, fit$cluster)

# See exactly which items are in 1st group (Not run)
seed[fit$cluster==1,]
# get cluster means for scaled data
aggregate(seed,by=list(fit$cluster),FUN=mean)
# or alternatively, use the output of kmeans
fit$centers
##         area perimeter campactness     length      width   asymmetry
## 1  1.1379346  1.145151   0.5424726  1.1260130  1.0640703 -0.14617607
## 2 -0.6588042 -0.662982  -0.3140631 -0.6519023 -0.6160407  0.08462825
##   groovelength
## 1    1.1468793
## 2   -0.6639828

3.2 Determine number of clusters

  1. Here is an example of using a simple within group sum of squares method. In the plot, we can use the elbow method and choose k equals to 3 or 4 as the number of clusters.
# Determine number of clusters
wss <- (nrow(seed)-1)*sum(apply(seed,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(seed,
                                     centers=i)$withinss)
plot(1:12, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")

  1. The prediction strength is defined according to Tibshirani and Walther (2005), who recommend to choose an optimal number of cluster as the largest number of clusters that leads to a prediction strength above 0.8 or 0.9. Using this criterion, \(k=2\) will be chosen as shown below.
prediction.strength(seed, Gmin=2, Gmax=15, M=10,cutoff=0.8)
## Prediction strength 
## Clustering method:  kmeans 
## Maximum number of clusters:  15 
## Resampled data sets:  10 
## Mean pred.str. for numbers of clusters:  1 0.852435 0.7716917 0.4891853 0.418128 0.3657107 0.352803 0.2760359 0.2717175 0.2469654 0.179632 0.1745815 0.1430647 0.1310394 0.1140371 
## Cutoff value:  0.8 
## Largest number of clusters better than cutoff:  2
  1. fpc package has cluster.stat() function that can calculate other cluster validity measures such as Average Silhouette Coefficient (between -1 and 1, the higher the better), or Dunn index (between 0 and infinity, the higher the better):
d = dist(seed, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:15){
  cluster_result = kmeans(seed, i)
  clusterstat=cluster.stats(d, cluster_result$cluster)
  result[i-1,1]=i
  result[i-1,2]=clusterstat$avg.silwidth
  result[i-1,3]=clusterstat$dunn   
}
plot(result[,c(1,2)], type="l", ylab = 'silhouette width', xlab = 'number of clusters')

As shown in the above plot, the silhouette width suggest the number of clusters to be 2.

plot(result[,c(1,3)], type="l", ylab = 'dunn index', xlab = 'number of clusters')

By looking at Dunn index in the above plot, \(k=3\) will be number of clusters.

Remark: The package NbClust provides 30 indexes for determining the optimal number of clusters in a data set. For more sophisticated methods, see for example blog, or course notes.

Remark: This article on Cross Validated provides a great illustration of the situations when k-means would fail.

go to top

4 Hierarchical clustering

#Wards Method or Hierarchical clustering
#Calculate the distance matrix
seed.dist=dist(seed)
#Obtain clusters using the Wards method
seed.hclust=hclust(seed.dist, method="ward")
plot(seed.hclust)

#Cut dendrogram at the 3 clusters level and obtain cluster membership
seed.3clust = cutree(seed.hclust,k=3)

First, dist(seed) calculates the distance matrix between observations (how similar the observations are from each other judging from the 7 numerical variables). Then hclust() takes the distance matrix as input and gives a hierarchical cluster solution. In hierarchical clustering you do not need to give the number of how many clusters you want, it depends on how you cut the dendrogram.

#See exactly which item are in third group (Not run)
seed[seed.3clust==3,]
# get cluster means for raw data
# Centroid Plot against 1st 2 discriminant functions
# Load the fpc library needed for plotcluster function
library(fpc)
#plotcluster(ZooFood, fit$cluster)
plotcluster(seed, seed.3clust)

go to top

5 (Optional) Model-Based Cluster Analysis

A newer clustering appraoch, model-based cluster, treats the clustering problem as maximizing a Normal mixture model. Generating an observation in this model consists of first picking a centroid (mean of a multivariate normal distribution) at random and then adding some noise (variances). If the noise is normally distributed, this procedure will result in clusters of spherical shape. Model-based clustering assumes that the data were generated by a model and tries to recover the original model from the data. The model that we recover from the data then defines clusters and an assignment of documents to clusters. It can be thought as a generalization of \(K\)-means.

The model “recovering” process is done via Expectation-Maximization(EM) algorithm. It is an iterative approach to maximize the likelihood of a statistical model when the model contains unobserved variables.

One obvious advantage of the approach is that we can treat the question “How Many Clusters?” as a model selection problem.

For detailed description of the method and the package, see 1 and 2

install.packages('mclust')
library(mclust)
mclust_result = Mclust(seed)
summary(mclust_result)
## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust EEV (ellipsoidal, equal volume and shape) model with 4 components: 
## 
##  log-likelihood   n  df      BIC      ICL
##        416.1655 210 122 179.9839 172.5306
## 
## Clustering table:
##  1  2  3  4 
## 50 49 45 66

The BIC used in the package is the negative of the ‘usual’ BIC when we discussed regression models. Therefore we are trying to maximize the BIC here.

plot(mclust_result)

go to top