161762 Multivariate Analysis for Big Data

Lecture 11: Cluster Analysis Part 2

Sergio Sclovich

Massey University

Fall 2026

Learning objectives

  • Understand the basics of K-means.

  • Understand the basics of validation.

Refresher from part 1

\[ \mathbf{X}_{n \times p} = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} \Longrightarrow \mathbf{D} = \begin{bmatrix} 0 & d(x_1, x_2) & d(x_1, x_3) & \cdots & d(x_1, x_n) \\ d(x_2, x_1) & 0 & d(x_2, x_3) & \cdots & d(x_2, x_n) \\ d(x_3, x_1) & d(x_3, x_2) & 0 & \cdots & d(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d(x_n, x_1) & d(x_n, x_2) & d(x_n, x_3) & \cdots & 0 \end{bmatrix} \Longrightarrow \mathbf{C} = \begin{bmatrix} C_1 \\ C_2 \\ \vdots \\ C_n \\ \end{bmatrix} \]

Types of clustering methods

  • Hierarchical clustering: these algorithms form clusters by connecting objects based on their distance. A cluster can be understood in terms of the maximum distance required to connect its elements.

  • Centroid based clustering: each cluster is represented by a central vector, which is not necessarily a member of the data set. The number of clusters is pre defined (k) and the algorithm assigns the objects to the nearest cluster center.

K-means method as a centroid based example

  • You choose π‘˜, and the algorithm classifies objects into π‘˜ classes.
  • Algorithm needs starting points. You can specify these or let it generate them randomly.
  • It is an iterative procedure.
  • Is also considered a partitioning method.
  • Partitive methods scale up linearly with the number of observations.
  • Often used when dataset is too large for hierarchical clustering.

K-means method

1. Select inputs.

  1. Select k cluster centers.

  2. Assign cases to closest center.

  3. Update cluster centers.

  4. Re-assign cases.

  5. Repeat steps 4 and 5 until convergence.

Training data

K-means method

  1. Select inputs.

2. Select k cluster centers.

  1. Assign cases to closest center.

  2. Update cluster centers.

  3. Re-assign cases.

  4. Repeat steps 4 and 5 until convergence.

Training data

K-means method

  1. Select inputs.

  2. Select k cluster centers.

3. Assign cases to closest center.

  1. Update cluster centers.

  2. Re-assign cases.

  3. Repeat steps 4 and 5 until convergence.

Training data

K-means method

  1. Select inputs.

  2. Select k cluster centers.

  3. Assign cases to closest center.

4. Update cluster centers.

  1. Re-assign cases.

  2. Repeat steps 4 and 5 until convergence.

Training data

K-means method

  1. Select inputs.

  2. Select k cluster centers.

  3. Assign cases to closest center.

  4. Update cluster centers.

5. Re-assign cases.

  1. Repeat steps 4 and 5 until convergence.

Training data

K-means method

  1. Select inputs.

  2. Select k cluster centers.

  3. Assign cases to closest center.

4. Update cluster centers.

  1. Re-assign cases.

6. Repeat steps 4 and 5 until convergence.

Training data

K-means method

  1. Select inputs.

  2. Select k cluster centers.

  3. Assign cases to closest center.

  4. Update cluster centers.

5. Re-assign cases.

6. Repeat steps 4 and 5 until convergence.

Training data

K-means considerations

Strengths

  • Relatively efficient and intuitive.

  • Easy to implement.

  • Often comes up with good, if not best, solutions.

  • Can assign new cases to clusters

Weaknesses

  • Applicable only when mean is defined, but then what about categorical data?

  • Need to specify π‘˜ in advance.

  • Sensitive to outliers.

  • Only good for clusters with convex shapes.

  • Sensitive to initial starting points. May not find optimal solution.

K-means Considerations

  • The results space is dependent on the starting point.

  • Within SS can fall into local minima.

How many groups?

For π‘˜-means, you need to choose the number of groups/clusters a priori.

  • Calculate the Within-group SS, which we want to minimize, for each of the π‘˜-means results for π‘˜ = 2, 3, 4, 5, etc. groups and choose accordingly.

  • The number of seeds, k, typically translates to the final number of clusters obtained. The choice of k can be made using a variety of methods:

    • Subject-matter knowledge: you know k by context.

    • Convenience/constrain, i.e. it is convenient to market to three to four groups.

    • Based on the data (combined with Hierrarchical).

Clusters validation

  • To validate a cluster means that the groupings have survived an independent test.

  • Cluster analysis is good for generating hypotheses about possible/groupings of samples.

Non statistical validations

  • Plot points on an ordination and see if clusters are clearly separate on the plot.

  • Compare the results of several (reasonable) clustering algorithms - underlying clusters are pretty robust if you get the same answer. But be aware that these are not formal tests.

Clusters validation

Statistical validation

Internal validation

Uses the data set itself to generate a test of the cluster results.

  • This can be done by re-sampling subsets of the data (jackknifing or bootstrapping).

  • Simulate data with the same parameters as your original variables under a certain model (Monte Carlo).

  • Analyse part of the data (traning), and then use the other part to test the cluster results.

External validations

Use information from an independent source.

  • Use a new data set to test cluster groupings previously obtained.

  • Compare cluster groupings with a previously defined hypothesis.

Clusters validation

Important

Do NOT do analysis of variance (or any other such statistical test) on groups you have obtained from a cluster analysis on the same dataset. If you do, you will be bound to get significant results and that is not a β€œValid” form of β€œValidation”.

Example

Example Dendrogram

Example PCoA (Formal MDS)

Example profiling