
Cluster Analysis
synthetic_points: 118 lecture-generated two-dimensional observations with blobs, an ellipse, a crescent, a ring, and noise.
The ring, crescent, blobs, ellipse, and noise are chosen on purpose: they let us see that no single clustering method handles every geometry equally well. The widgets use the same broad shapes, but they subsample points so each step remains legible on a slide.
Minkowski distance: \[ d_{st}^{(q)} = \left(\sum_{j=1}^{p} |x_{sj} - x_{tj}|^q\right)^{1/q}, \qquad q \ge 1. \]
| \(q\) value | Interpretation |
|---|---|
| 1 | Manhattan distance |
| 2 | Euclidean distance |
| Larger \(q\) | Larger coordinate differences receive more weight |
monsters: the TidyTuesday D&D Monsters data from 2025-05-27.
type label plus combat features like Strength, Dexterity, HP, challenge rating, and speed.type afterward to help interpret the fitted groups.monster_scale_compare |>
ggplot(aes(x = value, y = feature, fill = scale_state)) +
geom_vline(
data = tibble(scale_state = "Z-score scaled", xint = 0),
aes(xintercept = xint),
inherit.aes = FALSE,
linewidth = 0.7,
linetype = "dashed",
colour = "#8a3324"
) +
geom_boxplot(outlier.alpha = 0.2, width = 0.7, orientation = "y") +
facet_wrap(~ scale_state, ncol = 1, scales = "free_x") +
guides(fill = "none") +
labs(x = "Raw units / z-score", y = NULL)
hp_number, cr, and speed_base_number are much larger in raw units.scale() means subtract the column mean and divide by the column standard deviation, not min-max scaling.PC1 explains the biggest spread and PC2 explains the next biggest spread.Hover a monster type or a point to highlight that type in both panels at once. If uwot is unavailable during rendering, the PCA panel still appears and the UMAP panel explains that the embedding is missing.
There are several ways to define closest when we are comparing clusters rather than individual observations.
cutree(monster_hc, k = 5) turns the hierarchy into five explicit cluster labels.35 resamples.1 means a perfect match to the baseline clustering, 0 means about as much agreement as random assignment, and negative values mean worse than random agreement. Higher values mean more stable cluster structure.The matrix uses colour for the baseline cluster and transparency for how often a pair of monsters stays together across the summary resamples. The bars on the right summarise average within-cluster stability, so wider bars mean a cluster stays together more consistently.
For Euclidean distance the objective is \[ \sum_{k=1}^K \sum_{C(i)=k} \lVert x_i - \mu_k \rVert^2. \]
Algorithm:
For a dissimilarity matrix, PAM chooses the set of \(K\) medoids that minimizes total assignment dissimilarity: \[ M^* = \arg\min_{M: |M| = K} \sum_{i=1}^n \min_{m \in M} d(i,m). \]
Algorithm:
Switch between methods to compare how centroids and medoids react to bridge points and borderline observations.
A silhouette is a diagnostic for how well an observation fits into its assigned cluster compared to the next best alternative.
1 indicate a well-matched point.0 indicate a boundary point.K control to compare the PAM silhouettes for 2 through 5 clusters.Use K = 2 through 5 and switch between PCA and UMAP to compare the final PAM clustering with its silhouette diagnostic. Hover a silhouette bar or a plotted monster to link the same observation across both panels.
| Agreement | Count | Percent |
|---|---|---|
| Same cluster after relabelling | 233 | 70.6% |
| Different cluster | 97 | 29.4% |

Use K = 2 through 5 to see where K-means and PAM disagree after relabelling PAM clusters to the closest K-means labels. Disagreements are highlighted strongly; agreements stay visible as context.
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.
It groups points that are packed closely together (densely) and labels isolated points as noise.
Unlike K-means, DBSCAN does not require us to choose the number of clusters in advance, and it can recover irregular shapes such as rings, crescents, or curved bands.
MinPts observations in its eps-neighborhood.eps or epsilon: the neighborhood radius around a point.MinPts: the minimum number of observations in the epsilon-neighborhood needed for a core point. Here that count includes the point itself.DBSCAN repeats: inspect an unvisited point, test whether it is core, expand through its neighbors if it is core, and otherwise leave it as provisional noise until a later core point reaches it.
HDBSCAN stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise.
For hierarchical clustering, K-means, and PAM, the cluster count is chosen by us. For DBSCAN, eps and MinPts are inputs and the number of clusters is an output.
| Method | Strength | Risk | Needs scaling? |
|---|---|---|---|
| Hierarchical | Full tree, no need to pre-fix K |
Linkage choice matters | Usually yes |
| K-means | Fast, familiar, easy to explain | Splits non-convex shapes | Yes |
| K-medoids | Robust, medoids are real points | Slower than K-means | Usually yes |
| DBSCAN | Handles noise and shape | Parameter tuning can be awkward | Yes |
| HDBSCAN | Better with varying density | Less intuitive | Yes |
Chapter 7 of the course notes covers Gower dissimilarity, bootstrap stability, HDBSCAN, and UMAP with more mathematical detail.