161762 Multivariate Analysis for Big Data

Lecture 5: Distance methods

Sergio Sclovich

Massey University

Fall 2026

Learning objectives

Define the concept of distance.
Understand the basics of a distance method.
Recognize different types of distances and assess the right use for each type.

What is a Distance?

We can quantify the distance between any pair of sampling units to build a Distance Matrix.
This is a useful starting point for some multivariate analyses.

What is a Distance?

With a Var-Covar Matrix I can capture the relationships among variables.
With a Distance Matrix I can capture the relationships among individuals.

\[ \mathbf{X}_{n \times p} = \begin{bmatrix} x_{11} & \cdots & x_{1p} \\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{bmatrix} \Longrightarrow \mathbf{D} = \begin{bmatrix} 0 & d(x_1, x_2) & d(x_1, x_3) & \cdots & d(x_1, x_n) \\ d(x_2, x_1) & 0 & d(x_2, x_3) & \cdots & d(x_2, x_n) \\ d(x_3, x_1) & d(x_3, x_2) & 0 & \cdots & d(x_3, x_n) \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ d(x_n, x_1) & d(x_n, x_2) & d(x_n, x_3) & \cdots & 0 \end{bmatrix} \]

Another way of thinking of a distance is thinking about similarity (or disimilarity).

What is similarity?

The answer to this question largely depends on how you choose to operationalize (define) similarity.
Depending on the context it could have different names.
- Measures of association.
- coefficients or indices of resemblance.
- Similarity or dissimilarity measures.
- Distance measures.
Different types of similarities/dissimilarities.
Euclidean.
Manhattan
Mahalanobis
Bray-Curtis
Sokal and Sneath
Rogers and Tanimoto
Yule coefficient
Pearson’s Phi
Sorensen’s coefficient
Russell and Rao
Kulczynski
Ochiai
Faith distance

Euclidean Distance

Is a standard distance from one point to another “as the crow flies”.

\[ d_{ij} = \sqrt{ \sum_{k=1}^{p} \underbrace{(y_{ik} - y_{jk})^2}_{\text{difference on variable } k} } \\[4em] \begin{aligned} d_{ij} &: \text{distance between samples } i \text{ and } j \\ y_{ik} &: \text{value of variable } k \text{ for sample } i \\ y_{jk} &: \text{value of variable } k \text{ for sample } j \end{aligned} \]

Euclidean Distance

\[c = \sqrt{a^2 + b^2}\]

\[d_{1,2} = \sqrt{(y_{1,1} - y_{2,1})^2 + (y_{1,2} - y_{2,2})^2}\]

\[d_{1,2} = \sqrt{(5-1)^2 + (4-1)^2} = 5\]

In PCA the new orthogonal basis preserves the squared Euclidean distances between points as much as possible in the reduced-dimensional space.

Manhattan Distance (Taxicab)

\[ d_{ij} = \sum_{k=1}^{p} \underbrace{\left| y_{ik} - y_{jk} \right|}_{\text{absolute difference on variable } k} \quad \text{Less sensible to outliers than Euclidean } \]

We can measure similarity with qualitative variables

Binary variables

Let’s say we have to customers x and y. Also we have the data fo what products they bought.

Customer	P1	P2	P3	P4
X	1	1	0	0
Y	1	0	1	0

we will consider the following notations

a represents the count of times we have 1 for both cases.
b represents the count of times we have 0 for customer y and 1 for customer x.
c represents the count of times we have 1 for customer y and 0 for customer x.
d represents the count of times we have 0 for both customers.

Contingency table

	1	0
1	a	b
0	c	d

Binary variables

Customer	P1	P2	P3	P4	P5	P6
X	1	1	0	0	1	0
Y	1	0	1	0	1	1

Contingency table

	1	0
1	a	b
0	c	d

Binary variables

Customer	P1	P2	P3	P4	P5	P6
X	1	1	0	0	1	0
Y	1	0	1	0	1	1

Contingency table

	1	0
1	2	b
0	c	d

Binary variables

Customer	P1	P2	P3	P4	P5	P6
X	1	1	0	0	1	0
Y	1	0	1	0	1	1

Contingency table

	1	0
1	2	1
0	c	d

Binary variables

Customer	P1	P2	P3	P4	P5	P6
X	1	1	0	0	1	0
Y	1	0	1	0	1	1

Contingency table

	1	0
1	2	1
0	2	d

Binary variables

Customer	P1	P2	P3	P4	P5	P6
X	1	1	0	0	1	0
Y	1	0	1	0	1	1

Contingency table

	1	0
1	2	1
0	2	1

Now let’s measure some qualitative distance

###Euclidian distance

\[ d_{xy} = \sqrt{b^2+c^2} \]

Euclidean distance has min zero and no max.
It reflects dissimilarity.

Simple matching

\[ d_{xy} = \dfrac{a+d}{a+b+c+d} \]

Simple matching limits between 0 and 1.
In this case, dissimilarity is (1 – similarity)

We can expand now to other qualitative variables

Nominal — categorical data with no natural order (e.g. colour, gender, ethnicity).
Ordinal — categorical data with a meaningful order, but unequal intervals between values (e.g. low/medium/high, survey ratings).

We can use Simple matching for this type of variables

\[d_(xy) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(x_i = y_)\]

or you can consider de opposite for dissimilarity

\[d_(xy) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}(x_i \neq y_)\]

Distance measures for mixed data

What if data has both categorical & numeric variables and you wish to use them both?

Customer	Age	Income ($k)	Segment	Has App	Satisfaction
A	25	40	Retail	Yes	Low
B	45	80	Corporate	No	Medium
C	35	60	Retail	Yes	High
D	50	90	Corporate	No	Medium

What type of variable are each of these?

Numeric
- Age
- Income
Categorical (nominal)
- Segment
Binary
- Has App
Ordinal
- Satisfaction (Low < Medium < High)

Gower’s distance

Gower Distance

\[ d_{ij} = \frac{\sum_{k=1}^{p} w_k \, d_{ijk}}{\sum_{k=1}^{p} w_k } \qquad d_{ijk} = \begin{cases} 1 - \frac{|x_{ik} - x_{jk}|}{R_k}, & \text{numeric} \\ 1, & x_{ik} = x_{jk} \ (\text{categorical}) \\ 0, & x_{ik} \ne x_{jk} \ (\text{categorical}) \end{cases} \\[3em] \begin{aligned} d_{ij} &: \text{distance between observations } i,j \\ w_k &: \text{variable weight} \\ d_{ijk} &: \text{partial distance} \\ R_k &: \text{range of variable } k \end{aligned} \]

Rescale numeric variables so their range is [0,1].
Recode nominal variables with >2 levels into multiple binary variables.
Rescale ordinal variables (Podani’s method, and others).
Calculate the similarity of each variable and average to get a total similarity.
Calculate D = 1 – G.

Jaccard similarity

Are two objects more similar because they both lack some particular characters?

If 1 and 0 are equally important is a symmetric binary variable.
If 1 is more important than 0 is an asymmetric Binary variable.

\[ S_{ij}=\frac{a}{a+b+c} \]

symetric vs asymetric binary variables

Customer	P1	P2	P3	P4	P5	P6
X	1	1	0	0	1	0
Y	1	0	1	0	1	1

	1	0
1	2	1
0	2	1

\[ \text{Simple mistmatch} = \frac{2+1}{2+1+2+1} = 0.5 \\[1em] \text{Jaccard} = \frac{2}{2+1+2} = 0.4 \]