Lecture 11: Cluster Analysis Part 2
Massey University
Fall 2026
Understand the basics of K-means.
Understand the basics of validation.
\[ \mathbf{X}_{n \times p} = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} \Longrightarrow \mathbf{D} = \begin{bmatrix} 0 & d(x_1, x_2) & d(x_1, x_3) & \cdots & d(x_1, x_n) \\ d(x_2, x_1) & 0 & d(x_2, x_3) & \cdots & d(x_2, x_n) \\ d(x_3, x_1) & d(x_3, x_2) & 0 & \cdots & d(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d(x_n, x_1) & d(x_n, x_2) & d(x_n, x_3) & \cdots & 0 \end{bmatrix} \Longrightarrow \mathbf{C} = \begin{bmatrix} C_1 \\ C_2 \\ \vdots \\ C_n \\ \end{bmatrix} \]
Hierarchical clustering: these algorithms form clusters by connecting objects based on their distance. A cluster can be understood in terms of the maximum distance required to connect its elements.
Centroid based clustering: each cluster is represented by a central vector, which is not necessarily a member of the data set. The number of clusters is pre defined (k) and the algorithm assigns the objects to the nearest cluster center.
1. Select inputs.
Select k cluster centers.
Assign cases to closest center.
Update cluster centers.
Re-assign cases.
Repeat steps 4 and 5 until convergence.

2. Select k cluster centers.
Assign cases to closest center.
Update cluster centers.
Re-assign cases.
Repeat steps 4 and 5 until convergence.

Select inputs.
Select k cluster centers.
3. Assign cases to closest center.
Update cluster centers.
Re-assign cases.
Repeat steps 4 and 5 until convergence.

Select inputs.
Select k cluster centers.
Assign cases to closest center.
4. Update cluster centers.
Re-assign cases.
Repeat steps 4 and 5 until convergence.

Select inputs.
Select k cluster centers.
Assign cases to closest center.
Update cluster centers.
5. Re-assign cases.

Select inputs.
Select k cluster centers.
Assign cases to closest center.
4. Update cluster centers.
6. Repeat steps 4 and 5 until convergence.

Select inputs.
Select k cluster centers.
Assign cases to closest center.
Update cluster centers.
5. Re-assign cases.
6. Repeat steps 4 and 5 until convergence.

Strengths
Relatively efficient and intuitive.
Easy to implement.
Often comes up with good, if not best, solutions.
Can assign new cases to clusters
Weaknesses
Applicable only when mean is defined, but then what about categorical data?
Need to specify π in advance.
Sensitive to outliers.
Only good for clusters with convex shapes.
Sensitive to initial starting points. May not find optimal solution.
The results space is dependent on the starting point.
Within SS can fall into local minima.
For π-means, you need to choose the number of groups/clusters a priori.
Calculate the Within-group SS, which we want to minimize, for each of the π-means results for π = 2, 3, 4, 5, etc. groups and choose accordingly.
The number of seeds, k, typically translates to the final number of clusters obtained. The choice of k can be made using a variety of methods:
Subject-matter knowledge: you know k by context.
Convenience/constrain, i.e. it is convenient to market to three to four groups.
Based on the data (combined with Hierrarchical).
To validate a cluster means that the groupings have survived an independent test.
Cluster analysis is good for generating hypotheses about possible/groupings of samples.
Plot points on an ordination and see if clusters are clearly separate on the plot.
Compare the results of several (reasonable) clustering algorithms - underlying clusters are pretty robust if you get the same answer. But be aware that these are not formal tests.
Uses the data set itself to generate a test of the cluster results.
This can be done by re-sampling subsets of the data (jackknifing or bootstrapping).
Simulate data with the same parameters as your original variables under a certain model (Monte Carlo).
Analyse part of the data (traning), and then use the other part to test the cluster results.
Use information from an independent source.
Use a new data set to test cluster groupings previously obtained.
Compare cluster groupings with a previously defined hypothesis.
Important
Do NOT do analysis of variance (or any other such statistical test) on groups you have obtained from a cluster analysis on the same dataset. If you do, you will be bound to get significant results and that is not a βValidβ form of βValidationβ.






