Lecture 5: Distance methods
Massey University
Fall 2026
Define the concept of distance.
Understand the basics of a distance method.
Recognize different types of distances and assess the right use for each type.
We can quantify the distance between any pair of sampling units to build a Distance Matrix.
This is a useful starting point for some multivariate analyses.
With a Var-Covar Matrix I can capture the relationships among variables.
With a Distance Matrix I can capture the relationships among individuals.
\[ \mathbf{X}_{n \times p} = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} \Longrightarrow \mathbf{D} = \begin{bmatrix} 0 & d(x_1, x_2) & d(x_1, x_3) & \cdots & d(x_1, x_n) \\ d(x_2, x_1) & 0 & d(x_2, x_3) & \cdots & d(x_2, x_n) \\ d(x_3, x_1) & d(x_3, x_2) & 0 & \cdots & d(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d(x_n, x_1) & d(x_n, x_2) & d(x_n, x_3) & \cdots & 0 \end{bmatrix} \]
The answer to this question largely depends on how you choose to operationalize (define) similarity.
Depending on the context it could have different names.
Different types of similarities/dissimilarities.
Euclidean.
Manhattan
Mahalanobis
Bray-Curtis
Sokal and Sneath
Rogers and Tanimoto
Yule coefficient
Pearson’s Phi
Sorensen’s coefficient
Russell and Rao
Kulczynski
Ochiai
Faith distance
Is a standard distance from one point to another “as the crow flies”.
\[ d_{ij} = \sqrt{ \sum_{k=1}^{p} \underbrace{(y_{ik} - y_{jk})^2}_{\text{difference on variable } k} } \\[4em] \begin{aligned} d_{ij} &: \text{distance between samples } i \text{ and } j \\ y_{ik} &: \text{value of variable } k \text{ for sample } i \\ y_{jk} &: \text{value of variable } k \text{ for sample } j \end{aligned} \]

\[c = \sqrt{a^2 + b^2}\]
\[d_{1,2} = \sqrt{(y_{1,1} - y_{2,1})^2 + (y_{1,2} - y_{2,2})^2}\]
\[d_{1,2} = \sqrt{(5-1)^2 + (4-1)^2} = 5\]
\[
d_{ij} = \sum_{k=1}^{p}
\underbrace{\left| y_{ik} - y_{jk} \right|}_{\text{absolute difference on variable } k}
\quad
\text{Less sensible to outliers than Euclidean
}
\] 
Let’s say we have to customers x and y. Also we have the data fo what products they bought.
| Customer | P1 | P2 | P3 | P4 |
|---|---|---|---|---|
| X | 1 | 1 | 0 | 0 |
| Y | 1 | 0 | 1 | 0 |
we will consider the following notations
Contingency table
| 1 | 0 | |
|---|---|---|
| 1 | a | b |
| 0 | c | d |
| Customer | P1 | P2 | P3 | P4 | P5 | P6 |
|---|---|---|---|---|---|---|
| X | 1 | 1 | 0 | 0 | 1 | 0 |
| Y | 1 | 0 | 1 | 0 | 1 | 1 |
Contingency table
| 1 | 0 | |
|---|---|---|
| 1 | a | b |
| 0 | c | d |
| Customer | P1 | P2 | P3 | P4 | P5 | P6 |
|---|---|---|---|---|---|---|
| X | 1 | 1 | 0 | 0 | 1 | 0 |
| Y | 1 | 0 | 1 | 0 | 1 | 1 |
Contingency table
| 1 | 0 | |
|---|---|---|
| 1 | 2 | b |
| 0 | c | d |
| Customer | P1 | P2 | P3 | P4 | P5 | P6 |
|---|---|---|---|---|---|---|
| X | 1 | 1 | 0 | 0 | 1 | 0 |
| Y | 1 | 0 | 1 | 0 | 1 | 1 |
Contingency table
| 1 | 0 | |
|---|---|---|
| 1 | 2 | 1 |
| 0 | c | d |
| Customer | P1 | P2 | P3 | P4 | P5 | P6 |
|---|---|---|---|---|---|---|
| X | 1 | 1 | 0 | 0 | 1 | 0 |
| Y | 1 | 0 | 1 | 0 | 1 | 1 |
Contingency table
| 1 | 0 | |
|---|---|---|
| 1 | 2 | 1 |
| 0 | 2 | d |
| Customer | P1 | P2 | P3 | P4 | P5 | P6 |
|---|---|---|---|---|---|---|
| X | 1 | 1 | 0 | 0 | 1 | 0 |
| Y | 1 | 0 | 1 | 0 | 1 | 1 |
Contingency table
| 1 | 0 | |
|---|---|---|
| 1 | 2 | 1 |
| 0 | 2 | 1 |
###Euclidian distance
\[ d_{xy} = \sqrt{b^2+c^2} \]
\[ d_{xy} = \dfrac{a+d}{a+b+c+d} \]
Nominal — categorical data with no natural order (e.g. colour, gender, ethnicity).
Ordinal — categorical data with a meaningful order, but unequal intervals between values (e.g. low/medium/high, survey ratings).
We can use Simple matching for this type of variables
\[d_(xy) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(x_i = y_)\]
or you can consider de opposite for dissimilarity
\[d_(xy) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(x_i \neq y_)\]
What if data has both categorical & numeric variables and you wish to use them both?
| Customer | Age | Income ($k) | Segment | Has App | Satisfaction |
|---|---|---|---|---|---|
| A | 25 | 40 | Retail | Yes | Low |
| B | 45 | 80 | Corporate | No | Medium |
| C | 35 | 60 | Retail | Yes | High |
| D | 50 | 90 | Corporate | No | Medium |
What type of variable are each of these?
\[ d_{ij} = \frac{\sum_{k=1}^{p} w_k \, d_{ijk}}{\sum_{k=1}^{p} w_k } \qquad d_{ijk} = \begin{cases} 1 - \frac{|x_{ik} - x_{jk}|}{R_k}, & \text{numeric} \\ 1, & x_{ik} = x_{jk} \ (\text{categorical}) \\ 0, & x_{ik} \ne x_{jk} \ (\text{categorical}) \end{cases} \\[3em] \begin{aligned} d_{ij} &: \text{distance between observations } i,j \\ w_k &: \text{variable weight} \\ d_{ijk} &: \text{partial distance} \\ R_k &: \text{range of variable } k \end{aligned} \]
Are two objects more similar because they both lack some particular characters?
\[ S_{ij}=\frac{a}{a+b+c} \]
| Customer | P1 | P2 | P3 | P4 | P5 | P6 |
|---|---|---|---|---|---|---|
| X | 1 | 1 | 0 | 0 | 1 | 0 |
| Y | 1 | 0 | 1 | 0 | 1 | 1 |
| 1 | 0 | |
|---|---|---|
| 1 | 2 | 1 |
| 0 | 2 | 1 |
\[ \text{Simple mistmatch} = \frac{2+1}{2+1+2+1} = 0.5 \\[1em] \text{Jaccard} = \frac{2}{2+1+2} = 0.4 \]
