Lecture 1: Introduction and the Nature of Multivariate Data
Massey University
Fall 2026
When you hear a fire alarm:
Massey emergency communication channels:
If you feel anxious, stressed, or need help:
On successful completion, students will be able to:
You will be required to:
Course is structured around weekly lectures and labs with the following topics:
A typical dataset is a matrix:
Example:
| Customer | Product 1 | Product 2 | Product 3 | Churn |
|---|---|---|---|---|
| 1 | 5 | 6 | 1 | 0 |
| 2 | 3 | 5 | 1 | 0 |
| 3 | 3 | 6 | 2 | 0 |
| 4 | 3 | 8 | 4 | 0 |
| 5 | 2 | 2 | 4 | 1 |
Datasets can mix numeric predictors, binary indicators, continuous outcomes, and categorical group labels in the same analysis:
| ID | Age | Income ($) | Online? | Churn | Segment | Rating |
|---|---|---|---|---|---|---|
| 1 | 34 | 52 000 | 1 | 0 | Retail | 1 |
| 2 | 45 | 88 000 | 0 | 0 | Horeca | 5 |
| 3 | 29 | 41 000 | 1 | 1 | Retail | 3 |
| 4 | 51 | 115 000 | 1 | 0 | Horeca | 4 |
| 5 | 38 | 63 000 | 0 | 1 | Retail | 2 |
Example — the Advertising dataset (200 markets), all in $000 but ranges differ dramatically:
| Variable | Min | Max | Mean | SD |
|---|---|---|---|---|
| TV spend | 0.7 | 296.4 | 147.0 | 85.9 |
| Radio spend | 0.0 | 49.6 | 23.3 | 14.8 |
| Newspaper spend | 0.3 | 114.0 | 30.6 | 21.8 |
| Sales | 1.6 | 27.0 | 14.0 | 5.2 |
Without standardisation, TV spend would dominate any distance calculation.
Example — does ad spend predict sales?
| TV ($000) | Radio ($000) | Newspaper ($000) | Sales ($000) |
|---|---|---|---|
| 230.1 | 37.8 | 69.2 | 22.1 |
| 44.5 | 39.3 | 45.1 | 10.4 |
| 17.2 | 45.9 | 69.3 | 9.3 |
| 151.5 | 41.3 | 58.5 | 18.5 |
TV, Radio, Newspaper are predictors; Sales is the response.
Univariate analysis models one dependent variable at a time.
Examples:
For 200 Advertising markets, each column summarised separately:
| Variable | Mean | SD |
|---|---|---|
| TV | 147.0 | 85.9 |
| Radio | 23.3 | 14.8 |
| Newspaper | 30.6 | 21.8 |
| Sales | 14.0 | 5.2 |
Each column treated in isolation — joint patterns are missed.
A univariate model with one quantitative predictor:
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
Where:
Even with many predictors, there is still one response function:
Response vector with predictor matrix:
\[ \mathbf{y} = f(\mathbf{X}) + \boldsymbol{\epsilon} \]
If there are multiple response variables, collect them into a response matrix:
\[ \mathbf{Y} = f(\mathbf{X}) + \mathbf{E} \]
Each variable has a mean. Collect means into a mean vector:
\[ \boldsymbol{\mu} = (\mu_1, \mu_2, \ldots, \mu_p)^\top \]
Reminder: the sample mean (centroid) for variable \(j\):
\[ \bar{y}_j = \frac{1}{n}\sum_{i=1}^{n} y_{ij} \]
For example, in an Advertising dataset (\(n = 200\), \(p = 4\) variables):
\[ \bar{\mathbf{y}} = (147.0,\; 23.3,\; 30.6,\; 14.0)^\top \]
(TV, Radio, Newspaper, Sales in $000). The centroid is the point in 4D space all 200 markets balance around.
Advertising dataset — variances differ by a factor of 275×:
| Variable | Variance | SD |
|---|---|---|
| TV | 7 384.5 | 85.9 |
| Radio | 219.6 | 14.8 |
| Newspaper | 474.4 | 21.8 |
| Sales | 26.8 | 5.2 |
Knowing each variance separately tells us nothing about the direction the data cloud tilts.
If \(x\) increases and \(y\) increases on average, the covariance is positive.
Covariance extends variance to capture linear association between variables.
Five Advertising observations — higher TV spend tends to go with higher Sales:
| TV | Sales | TV − mean | Sales − mean | product |
|---|---|---|---|---|
| 230 | 22.1 | +83 | +8.1 | +673 |
| 45 | 10.4 | −102 | −3.6 | +367 |
| 17 | 9.3 | −130 | −4.7 | +611 |
| 152 | 18.5 | +5 | +4.5 | +23 |
| 181 | 12.9 | +34 | −1.1 | −37 |
Most products are positive → Cov(TV, Sales) > 0. Full dataset: \(s_{\text{TV},\,\text{Sales}} = 399.7\).
For \(p\) variables, summarise dispersion and shape with:
\[ \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1p}\\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{p1} & \sigma_{p2} & \cdots & \sigma_{pp} \end{pmatrix} \]
Properties:
Sample \(\mathbf{S}\) for TV, Radio, Sales (Advertising, \(n = 200\)):
\[ \mathbf{S} = \begin{pmatrix} 7384.5 & 45.3 & 399.7 \\ 45.3 & 219.6 & 53.8 \\ 399.7 & 53.8 & 26.8 \end{pmatrix} \]
TV has the largest variance and the strongest covariance with Sales.
Unbiased estimate of covariance between variables \(j\) and \(k\):
\[ s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n} (y_{ij} - \bar{y}_j)(y_{ik} - \bar{y}_k) \]
Sample covariance matrix:
\[ \mathbf{S} = (s_{jk}) \]
\[\mathbf{S} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top\]
\[\boldsymbol{\Sigma} \approx \boldsymbol{\Lambda}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}\]
where \(\boldsymbol{\Lambda}\) is the loading matrix and \(\boldsymbol{\Psi}\) is diagonal uniqueness
\[ J(\mathbf{w}) = \frac{\mathbf{w}^\top \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\top \mathbf{S}_W \mathbf{w}} \]
\[ \underset{C_1,\ldots,C_K}{\text{minimise}} \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]
\[ \text{Cov}(\mathbf{t},\; \mathbf{u})^2, \qquad \mathbf{t} = \mathbf{X}\mathbf{w}, \quad \mathbf{u} = \mathbf{Y}\mathbf{c} \]
Spot the outlier when looking at variables jointly:
Five wholesale customers plotted in Milk × Grocery space:
| Customer | Milk ($) | Grocery ($) | Note |
|---|---|---|---|
| A | 3 659 | 7 541 | typical |
| B | 1 981 | 2 675 | typical |
| C | 7 058 | 9 198 | typical |
| D | 2 976 | 5 484 | typical |
| E | 6800 | 3117 | possible outlier |
Customer E is far from the cluster — only obvious when both variables are considered together.
Mahalanobis distance.
| A | B | C | D | E | |
|---|---|---|---|---|---|
| A | 0.00 | 2.44 | 2.32 | 1.05 | 6.45 |
| B | 2.44 | 0.00 | 2.38 | 1.39 | 4.53 |
| C | 2.32 | 2.38 | 0.00 | 1.99 | 4.55 |
| D | 1.05 | 1.39 | 1.99 | 0.00 | 5.56 |
| E | 6.45 | 4.53 | 4.55 | 5.56 | 0.00 |