
Lecture 10: Cluster Analysis
Massey University
Fall 2026
\[ \mathbf{X}_{n \times p} = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} \Longrightarrow \mathbf{D} = \begin{bmatrix} 0 & d(x_1, x_2) & d(x_1, x_3) & \cdots & d(x_1, x_n) \\ d(x_2, x_1) & 0 & d(x_2, x_3) & \cdots & d(x_2, x_n) \\ d(x_3, x_1) & d(x_3, x_2) & 0 & \cdots & d(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d(x_n, x_1) & d(x_n, x_2) & d(x_n, x_3) & \cdots & 0 \end{bmatrix} \]
Fill \(d(x_n,x_n)\) with your favorite distance method.
Choose your analysis: ordination or Clustering.
Clustering techniques intend to make groups out of the data.
Some applications are:
For example:
The aim of cluster analysis is to delineate “natural groups” of data, with high within-class similarity and low between-class similarity.
BUT this does not mean that the clusters actually exist!
Cluster Analysis will ALWAYS provide clusters (whether they exist or not).
Cluster profiling involves labelling a proposed cluster solution.
The objective is to identify the features, or combination of features, that uniquely describe each cluster.
It is useful to visualize clusters using an ordination, such as multidimensional scaling (MDS, which is also distance-based) or PCA.
can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster.
If the clusters are well-separated, almost any clustering method performs well.
\[ \mathbf{X}_{n \times p} = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} \Longrightarrow \mathbf{D} = \begin{bmatrix} 0 & d(x_1, x_2) & d(x_1, x_3) & \cdots & d(x_1, x_n) \\ d(x_2, x_1) & 0 & d(x_2, x_3) & \cdots & d(x_2, x_n) \\ d(x_3, x_1) & d(x_3, x_2) & 0 & \cdots & d(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d(x_n, x_1) & d(x_n, x_2) & d(x_n, x_3) & \cdots & 0 \end{bmatrix} \Longrightarrow \mathbf{C} = \begin{bmatrix} C_1 \\ C_2 \\ \vdots \\ C_n \\ \end{bmatrix} \]
Hierarchical clustering: these algorithms form clusters by connecting objects based on their distance. A cluster can be understood in terms of the maximum distance required to connect its elements.
Centroid based clustering: each cluster is represented by a central vector, which is not necessarily a member of the data set. The number of clusters is pre defined (k) and the algorithm assigns the objects to the nearest cluster center.
Model-based clustering: This approach models the data as arising from a mixture of probability distributions.
Density-based clustering: clusters are defined as areas of higher density than the remainder of the data set.
Grid-based clustering: clusters are created by classfing cells of an arbitrary grid into clusters based on desnsity of points in each cell.
Hierarchy: an arrangement of items (objects, names, values, categories, etc.) that are represented as being “above”, “below”, or “at the same level as” one another.
Agglomerative (bottom-up):begins with each data point as an individual cluster. Iteratevely, at each step, the algorithm merges the two most similar clusters based on a chosen distance metric and linkage criterion. This process continues until all data points are combined into a single cluster or a stopping criterion is met.
Divisive (top-down): starts with all data points in a single cluster and recursively splits the cluster into smaller ones. At each step, the algorithm selects a cluster and divides it into two or more subsets.
A dendogram is a branching diagram that represents the relationships of similarity among a group of entities.
As soon as you have a group, you’re dealing with more than one dissimilarity.
There are a number of ways to calculate an overall dissimilarity between groups of objects.
Some examples of linkage criterion are:
Nearest neighbour (single linkage): two objects or clusters fuse when their closest objects reach the similarity of the considered partition.
Complete linkage: two objects or clusters fuse when their most distant points reach the similarity of the considered partition.
Centroid linkage: the distance between clusters is the squared Euclidean distance between cluster centroids
Unweighted Pair-Group Method using Arithmetic averages (UPGMA): Two groups are joined with the highest average similarity between them (gives equal weights to original similarities).
Weighted Pair-Group Method using Arithmetic averages(WPGMA): Same as UPGMA, but gives equal weights to the branches of the dendrogram, rather than to the original similarities (e.g. for unequal sample sizes).
Ward’s minimum variance: finds the pair of objects or clusters whose fusion increases as little as possible the sum of the squared distances between objects and cluster centroids.
Grouping Customers by Product sales
\[ \begin{array}{c|cccccc} & P1 & P2 & P3 & P4 & P5 & P6 \\ \hline C1 & 9.7 & 21 & 19.4 & 7.7 & 32 & 36.5 \\ C2 & 8.1 & 16.7 & 18.3 & 7 & 30.3 & 32.9 \\ C3 & 13.5 & 27.3 & 26.8 & 10.6 & 41.9 & 48.1 \\ C4 & 11.5 & 24.3 & 24.5 & 9.3 & 40 & 44.6 \\ C5 & 10.7 & 23.5 & 21.4 & 8.5 & 28.8 & 37.6 \\ C6 & 9.6 & 22.6 & 21.1 & 8.3 & 34.4 & 43.1 \\ C7 & 10.3 & 22.1 & 19.1 & 8.1 & 32.2 & 35 \end{array} \Longrightarrow \begin{array}{c|ccccccc} & C1 & C2 & C3 & C4 & C5 & C6 & C7 \\ \hline C1 & 0 & & & & & & \\ C2 & 1.91 & 0 & & & & & \\ C3 & 5.38 & 7.12 & 0 & & & & \\ C4 & 3.38 & 5.06 & 2.14 & 0 & & & \\ C5 & 1.51 & 3.19 & 4.57 & 2.91 & 0 & & \\ C6 & 1.56 & 3.18 & 4.21 & 2.20 & 1.67 & 0 & \\ C7 & 0.66 & 2.39 & 5.12 & 3.24 & 1.26 & 1.71 & 0 \end{array} \]
\[ \tiny \begin{array}{c|ccccccc} & C1 & C2 & C3 & C4 & C5 & C6 & C7 \\ \hline C1 & 0 & & & & & & \\ C2 & 1.91 & 0 & & & & & \\ C3 & 5.38 & 7.12 & 0 & & & & \\ C4 & 3.38 & 5.06 & 2.14 & 0 & & & \\ C5 & 1.51 & 3.19 & 4.57 & 2.91 & 0 & & \\ C6 & 1.56 & 3.18 & 4.21 & 2.20 & 1.67 & 0 & \\ C7 & \color{purple}{0.66} & 2.39 & 5.12 & 3.24 & 1.26 & 1.71 & 0 \end{array} \]

\[ \tiny \begin{array}{c|ccccccc} & C1 & C2 & C3 & C4 & C5 & C6 & C7 \\ \hline C1 & 0 & & & & & & \\ C2 & 1.91 & 0 & & & & & \\ C3 & 5.38 & 7.12 & 0 & & & & \\ C4 & 3.38 & 5.06 & 2.14 & 0 & & & \\ C5 & 1.51 & 3.19 & 4.57 & 2.91 & 0 & & \\ C6 & 1.56 & 3.18 & 4.21 & 2.20 & 1.67 & 0 & \\ C7 & \color{purple}{0.66} & 2.39 & 5.12 & 3.24 & \color{red}{1.26} & 1.71 & 0 \end{array} \]

Step 1: C1 joins C7 at a distance of 0.66.
Step 2: C5 joins (C1,C7) at a distance of 1.26.
\[ \tiny \begin{array}{c|ccccccc} & C1 & C2 & C3 & C4 & C5 & C6 & C7 \\ \hline C1 & 0 & & & & & & \\ C2 & 1.91 & 0 & & & & & \\ C3 & 5.38 & 7.12 & 0 & & & & \\ C4 & 3.38 & 5.06 & 2.14 & 0 & & & \\ C5 & \color{green}{1.51} & 3.19 & 4.57 & 2.91 & 0 & & \\ C6 & \color{green}{1.56} & 3.18 & 4.21 & 2.20 & 1.67 & 0 & \\ C7 & \color{purple}{0.66} & 2.39 & 5.12 & 3.24 & \color{red}{1.26} & 1.71 & 0 \end{array} \]

Step 1: C1 joins C7 at a distance of 0.66.
Step 2: C5 joins (C1,C7) at a distance of 1.26.
Step 3: (C6) joins (C1, C7, C5) at a distance of 1.56
\[ \tiny \begin{array}{c|ccccccc} & C1 & C2 & C3 & C4 & C5 & C6 & C7 \\ \hline C1 & 0 & & & & & & \\ C2 & \color{orange}{1.91} & 0 & & & & & \\ C3 & 5.38 & 7.12 & 0 & & & & \\ C4 & 3.38 & 5.06 & 2.14 & 0 & & & \\ C5 & \color{green}{1.51} & 3.19 & 4.57 & 2.91 & 0 & & \\ C6 & \color{green}{1.56} & 3.18 & 4.21 & 2.20 & \color{orange}{1.67} & 0 & \\ C7 & \color{purple}{0.66} & 2.39 & 5.12 & 3.24 & \color{red}{1.26} & \color{orange}{1.71} & 0 \end{array} \]

Step 1: C1 joins C7 at a distance of 0.66.
Step 2: C5 joins (C1,C7) at a distance of 1.26.
Step 3: (C6) joins (C1, C7, C5) at a distance of 1.56.
Step 4: (C2) joins (C1, 5, 6, 7) at a distance of 1.91.
\[ \tiny \begin{array}{c|ccccccc} & C1 & C2 & C3 & C4 & C5 & C6 & C7 \\ \hline C1 & 0 & & & & & & \\ C2 & \color{orange}{1.91} & 0 & & & & & \\ C3 & 5.38 & 7.12 & 0 & & & & \\ C4 & 3.38 & 5.06 & \color{blue}{2.14} & 0 & & & \\ C5 & \color{green}{1.51} & 3.19 & 4.57 & 2.91 & 0 & & \\ C6 & \color{green}{1.56} & 3.18 & 4.21 & 2.20 & \color{orange}{1.67} & 0 & \\ C7 & \color{purple}{0.66} & 2.39 & 5.12 & 3.24 & \color{red}{1.26} & \color{orange}{1.71} & 0 \end{array} \]

Step 1: C1 joins C7 at a distance of 0.66.
Step 2: C5 joins (C1,C7) at a distance of 1.26.
Step 3: (C6) joins (C1, C7, C5) at a distance of 1.56.
Step 4: (C2) joins (C1, 5, 6, 7) at a distance of 1.91.
Step 5: (C3) joins (C4) at a distance of 2.14.
\[ \tiny \begin{array}{c|ccccccc} & C1 & C2 & C3 & C4 & C5 & C6 & C7 \\ \hline C1 & 0 & & & & & & \\ C2 & \color{orange}{1.91} & 0 & & & & & \\ C3 & 5.38 & 7.12 & 0 & & & & \\ C4 & 3.38 & 5.06 & \color{blue}{2.14} & 0 & & & \\ C5 & \color{green}{1.51} & 3.19 & 4.57 & 2.91 & 0 & & \\ C6 & \color{green}{1.56} & 3.18 & 4.21 & 2.20 & \color{orange}{1.67} & 0 & \\ C7 & \color{purple}{0.66} & 2.39 & 5.12 & 3.24 & \color{red}{1.26} & \color{orange}{1.71} & 0 \end{array} \]

Step 1: C1 joins C7 at a distance of 0.66.
Step 2: C5 joins (C1,C7) at a distance of 1.26.
Step 3: (C6) joins (C1, C7, C5) at a distance of 1.56.
Step 4: (C2) joins (C1, 5, 6, 7) at a distance of 1.91.
Step 5: (C3) joins (C4) at a distance of 2.14.
Step 6: (C3, 4) joins (C1, 5, 6, 7) at a distance of 2.20.
Problems occur when the data are actually along a continuum (e.g., along a gradient, where no natural “boundaries” exist).
Arbitrary decision must be made about where to “draw the line” across a dendrogram.
Clustering procedures will find groups even when there are none (i.e., in random data).
Groups obtained from a cluster analysis may or may not be “real”.
