Lecture 10

Cluster Analysis

Nick Knowlton

Cluster Analysis

What changes in an unsupervised problem?

So far, regression and classification have started with a known response or known class labels.
In supervised learning, we use those known outcomes to fit a prediction rule.
In clustering, the groups are unknown, so the first question is whether meaningful structure exists at all.
That is why similarity or dissimilarity becomes the modelling choice.
Different clustering algorithms make different assumptions about the data structure.
- Hierarchical clustering gives a nested tree.
- K-means gives compact partitions.
- K-medoids swaps means for representative observations.
- DBSCAN keeps dense shapes and leaves sparse points as noise.

Lecture roadmap

Review distance ideas for multivariate data.
Use PCA and UMAP as display maps for data.
Build and interpret a hierarchical clustering tree.
Compare centroid, medoid, and density-based clustering.
Finish by comparing today’s methods on a shared 2D data display.

Synthetic dataset

synthetic_points: 118 lecture-generated two-dimensional observations with blobs, an ellipse, a crescent, a ring, and noise.

Dissimilarity matrices

Minkowski distance: \[ d_{st}^{(q)} = \left(\sum_{j=1}^{p} |x_{sj} - x_{tj}|^q\right)^{1/q}, \qquad q \ge 1. \]

\(q\) value	Interpretation
1	Manhattan distance
2	Euclidean distance
Larger \(q\)	Larger coordinate differences receive more weight

Example distance matrix

Show code

import { syntheticDistanceWidget } from "./widgets/clustering/synthetic-distance-widget.js"
synthetic_distance_rows = await FileAttachment("widgets/clustering/synthetic_points.csv").csv({ typed: true })
syntheticDistanceWidget(synthetic_distance_rows)

D&D Monsters dataset

monsters: the TidyTuesday D&D Monsters data from 2025-05-27.

Each row is a monster, with a type label plus combat features like Strength, Dexterity, HP, challenge rating, and speed.
We will cluster on the numeric features, then use type afterward to help interpret the fitted groups.
The dataset has 1,000 monsters and 16 numeric features, so we will use dimension reduction to visualize the clusters in two dimensions.

To Scale or not to Scale?

Show code

monster_scale_compare |>
  ggplot(aes(x = value, y = feature, fill = scale_state)) +
  geom_vline(
    data = tibble(scale_state = "Z-score scaled", xint = 0),
    aes(xintercept = xint),
    inherit.aes = FALSE,
    linewidth = 0.7,
    linetype = "dashed",
    colour = "#8a3324"
  ) +
  geom_boxplot(outlier.alpha = 0.2, width = 0.7, orientation = "y") +
  facet_wrap(~ scale_state, ncol = 1, scales = "free_x") +
  guides(fill = "none") +
  labs(x = "Raw units / z-score", y = NULL)

The D&D ability scores live on a small scale, while hp_number, cr, and speed_base_number are much larger in raw units.
If we cluster on raw Euclidean distance, the large-scale variables dominate the smaller ones.
So the clustering examples below use z-score standardization before computing Euclidean distances.
Here scale() means subtract the column mean and divide by the column standard deviation, not min-max scaling.

Dimension reduction for visualization

PCA

PCA is a linear way to visualize high-dimensional numeric data in just a few coordinates.
It chooses orthogonal directions that capture as much variance as possible, so PC1 explains the biggest spread and PC2 explains the next biggest spread.
PCA is a projection method, so it can be used as a display map for any clustering method that uses Euclidean distance.

UMAP

UMAP is nonlinear and aims to keep nearby points nearby in the 2D map.
It is often useful when local neighborhood structure matters more than a single global linear projection.

Monster projections: PCA and UMAP

Hierarchical Clustering

Agglomerative hierarchical clustering

Start with each observation in its own cluster.
Find the closest pair of clusters.
Merge them.
Repeat until only one cluster remains.

There are several ways to define closest when we are comparing clusters rather than individual observations.

This is separate from the original point-to-point distance metric.
It controls how the clusters are merged.
Single linkage uses the closest pair of observations across two clusters.
Complete linkage uses the furthest pair of observations across two clusters.
Average linkage uses the mean pairwise distance across the two clusters.

Hierarchical clustering on monsters

The result is a nested tree of clusters, called a dendrogram.
The tree can be cut at different heights to give different clusterings.

Show code

plot(
  monster_hc,
  labels = FALSE,
  hang = -1,
  main = "D&D Monsters dendrogram",
  xlab = "Monsters in leaf order",
  ylab = "Merge dissimilarity",
  sub = ""
)

The leaves on the x-axis are the individual monsters in the tree order.
The height on the y-axis is the dissimilarity at which two clusters merge.

Interactivly building a hierarchical cluster

Keep the points fixed and change only the linkage rule.
Because “closest clusters” is defined differently, the merge order changes.
If you cut the tree at a fixed height later, those different merge histories can lead to different clusterings.

Linkage is not a minor detail

Single linkage is sensitive to chains of points.
Complete linkage prefers tight, compact groups.
Average linkage usually sits between them.

Cutting the tree gives an actual partition

cutree(monster_hc, k = 5) turns the hierarchy into five explicit cluster labels.
Cutting higher or lower would give a different number of clusters.
Move the cut-height slider to see how the partition changes as the horizontal cut moves up or down.

Show code

import { hierarchicalCutWidget } from "./widgets/clustering/hierarchical-widget.js"
hierarchicalCutWidget(
  monster_cut_ojs_points,
  monster_cut_ojs_merges,
  monster_cut_ojs_order,
  monster_cut_ojs_labels
)

Are the clusters stable?

After choosing a linkage rule and a tree cut, we can ask whether those same clusters would reappear if the data shifted slightly.
Resample observations with replacement to create a new dataset of the same size, then repeat the clustering on that new dataset.
Refit the clustering on each resample.
Compare the resulting partition with the original.
A stable cluster should survive small perturbations to the data.

Bootstrap stability on monsters

A bootstrap sample is a new dataset made by sampling the monsters with replacement from the original data.
The tibble below contains one adjusted Rand index (ARI) value per bootstrap resample; here we used 35 resamples.
The adjusted Rand index compares every pair of monsters across two partitions: did that pair stay together, or stay apart, in both clusterings? It then subtracts the agreement expected by chance, \[ \mathrm{ARI} = \frac{\sum_{ij}\binom{n_{ij}}{2} - E}{\frac{1}{2}\left[\sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2}\right] - E}, \] so 1 means a perfect match to the baseline clustering, 0 means about as much agreement as random assignment, and negative values mean worse than random agreement. Higher values mean more stable cluster structure.

Interactive resampling stability widget

K-means and K-medoids

K-means minimises within-cluster variation

K-means chooses cluster assignments and centroids to minimise the total within-cluster sum of squares.

For Euclidean distance the objective is \[ \sum_{k=1}^K \sum_{C(i)=k} \lVert x_i - \mu_k \rVert^2. \]

Algorithm:

Start by placing the centroids arbitrarily.
Assign each point to the nearest centroid.
Recompute centroids.
Repeat until the centroids reach convergence.

K-means on monsters

K-medoids / PAM

PAM stands for Partitioning Around Medoids.
Replace the centroid with a real observation, the medoid.
This makes the method less sensitive to outliers.
It also lets us use any dissimilarity, not just Euclidean distance.

For a dissimilarity matrix, PAM chooses the set of \(K\) medoids that minimizes total assignment dissimilarity: \[ M^* = \arg\min_{M: |M| = K} \sum_{i=1}^n \min_{m \in M} d(i,m). \]

Algorithm:

Start with \(K\) candidate medoids.
Assign each observation to its nearest medoid.
Propose swapping a medoid with a non-medoid.
Accept the swap if the objective decreases; repeat until no improving swap remains.

Interactive K-means versus K-medoids

Silhouette widths for PAM

A silhouette is a diagnostic for how well an observation fits into its assigned cluster compared to the next best alternative.

A silhouette asks whether a point is closer to its own cluster than to a nearby competing cluster.
Values near 1 indicate a well-matched point.
Values near 0 indicate a boundary point.
Negative values suggest the point may fit better in another cluster.
Silhouettes are a general clustering diagnostic, not something unique to PAM.
Use the K control to compare the PAM silhouettes for 2 through 5 clusters.

PAM diagnostic: choosing K

Show code

import { pamDiagnosticWidget } from "./widgets/clustering/pam-diagnostic-widget.js"
pamDiagnosticWidget(
  monster_pam_projection_diagnostic_ojs,
  monster_pam_silhouette_diagnostic_ojs,
  monster_pam_silhouette_summary_diagnostic_ojs
)

When K-means and PAM disagree

Agreement	Count	Percent
Same cluster after relabelling	233	70.6%
Different cluster	97	29.4%

Disagreement usually shows up around boundary observations or points affected by outliers.
K-means is faster and works well for compact numeric clusters.
PAM is more robust when the center should be a real observation or when outliers matter.

Disagreement across K values

Show code

import { kmeansPamDisagreementWidget } from "./widgets/clustering/cluster-comparison-widget.js"
kmeansPamDisagreementWidget(
  monster_disagreement_ojs,
  monster_disagreement_axis_labels_ojs,
  monster_disagreement_axis_limits_ojs
)

Density Clustering

Clustering by density: DBSCAN

DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.

It groups points that are packed closely together (densely) and labels isolated points as noise.
Unlike K-means, DBSCAN does not require us to choose the number of clusters in advance, and it can recover irregular shapes such as rings, crescents, or curved bands.

Advantages:

Can find clusters of arbitrary shape.
Can identify noise points that do not belong to any cluster.

Disadvantages:

Sensitive to the choice of parameters (eps and MinPts).

Terms to know:

Core point: at least MinPts observations in its eps-neighborhood.
Border point: not core, but within the neighborhood of a core point.
Noise point: neither core nor border.
eps or epsilon: the neighborhood radius around a point.
MinPts: the minimum number of observations in the epsilon-neighborhood needed for a core point. Here that count includes the point itself.

DBSCAN on our ring/crescent data

This is exactly the kind of geometry where centroid-based methods tend to struggle.
The ring and crescent are useful because they show how density-based clustering can recover non-convex shapes.

Show code

ggplot(dbscan_plot_data, aes(x = x, y = y, colour = cluster, shape = point_type)) +
  geom_point(size = 2.2, alpha = 0.88) +
  scale_shape_manual(values = c(core = 16, border = 1, noise = 4)) +
  labs(x = NULL, y = NULL, colour = NULL, shape = "Point status")

Interactive DBSCAN widget

HDBSCAN

HDBSCAN stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise.

The “H” in HDBSCAN stands for “hierarchical,” which means it builds a hierarchy of clusters based on varying density thresholds.

Show code

if (is.null(hdbscan_plot_data)) {
  tibble(status = "Install the dbscan package to render the HDBSCAN example.")
} else {
  ggplot(hdbscan_plot_data, aes(x = x, y = y, colour = cluster)) +
    geom_point(alpha = 0.9, size = 2.2) +
    labs(x = NULL, y = NULL, colour = NULL)
}

Advantages:

Can handle clusters of varying densities better than DBSCAN.
Provides a more robust clustering solution when the data contains clusters of different densities.

Disadvantages:

More complex than DBSCAN and may require more computational resources.
Like DBSCAN, it can be sensitive to parameter choices, although it has fewer parameters to tune.

Wrapping up: comparing methods on the same display

Monsters clustered using today’s methods

Practical comparison

Method	Strength	Risk	Needs scaling?
Hierarchical	Full tree, no need to pre-fix `K`	Linkage choice matters	Usually yes
K-means	Fast, familiar, easy to explain	Splits non-convex shapes	Yes
K-medoids	Robust, medoids are real points	Slower than K-means	Usually yes
DBSCAN	Handles noise and shape	Parameter tuning can be awkward	Yes
HDBSCAN	Better with varying density	Less intuitive	Yes