161762 Multivariate Analysis for Big Data

Lecture 1: Introduction and the Nature of Multivariate Data

Nick Knowlton

Massey University

Fall 2026

Health, safety, and wellbeing

While studying, you may work in new environments with unfamiliar equipment and materials.
Hazards are anything with potential to cause harm.
Know the safety rules and required PPE for your learning environment.
Ask before starting an activity if hazards are unclear.

Hazards and risks

Recognize hazards early.
Follow area-specific safety instructions.
Use appropriate protective equipment when required.
Ask questions before you start.

When things go wrong

Report incidents as soon as possible, including near misses.
Aim: learn what happened and prevent recurrence (no-blame approach).
Notify the person in charge of the area for reporting and next steps.
Know where first aid kits and trained first aiders are located.

Evacuation

When you hear a fire alarm:

Leave via the nearest safe exit.
Walk quickly and safely.
Take keys, phone, and wallet.
Leave bags and drinks behind to prevent falls and delays.
Assemble at the emergency assembly area.
Follow building warden instructions.
Do not re-enter until the all-clear is given.

Staying in touch in an emergency

Massey emergency communication channels:

Massey home page
Massey social channels
PING app (install ahead of time)

Wellbeing support

If you feel anxious, stressed, or need help:

Campus Student Health and Counselling
Email: studenthealth@massey.ac.nz
Use Student Life Services resources
For academic help:
- Contact your teaching team via the course Stream site
- Use online and face-to-face learning support services

Course information

Assessment

Computer Practicals (10): 20%
Assignments (2): 2 × 15%
Data analysis project (Group + Individual): 50%

Assignment schedule and deadlines

Deadlines (Friday hand-ins):
- Assignment 1 due: Fri 3 Apr 2026
- Assignment 2 due: Fri 29 May 2026
- Final project due: Fri 12 Jun 2026
Practical hand-in bundles (Friday hand-ins):
- Practicals Set 1 due: Fri 10 Apr 2026
- Practicals Set 2 due: Fri 1 May 2026
- Practicals Set 3 due: Fri 29 May 2026

Course objectives

On successful completion, students will be able to:

Explain properties of multivariate data and distinguish multivariate vs univariate analysis.
Explain problems and opportunities that big data brings to standard analytical methods.
Explain, interpret, and demonstrate:
- Cluster analysis
- Ordination techniques (PCA, factor analysis, MDS)
- Latent variable analyses
- Discrimination methods
Discriminate a priori groups using multivariate data and assess classification error.
Write up results in a structured, statistically sound report using the techniques above.

Multivariate project

You will be required to:

Define a scientific question of interest
Collect a dataset (often online)
Prepare the dataset properly
Analyse using appropriate models
Work in a team to interpret results and write a report
Communicate results clearly to a non-technical audience
Write a 5 to 10 page report (include graphics)

Stream (course hub)

Primary place to find information and contact others
Datasets and key files will be placed on Stream

SAS

SAS is the primary software for this course.
SAS Studio is the web-based interface for SAS University Edition.
SAS code examples will be provided in lectures and labs.

Course structure

Course is structured around weekly lectures and labs with the following topics:

Intro to multivariate analysis
PCA 1
PCA 2
PCA 3
Factor Analysis
MDS
Correspondence analysis
RDA / CCorA
LDA & QDA
Clustering 1
Clustering 2
PLS

Lecture 1: the nature of multivariate data

Table of numeric data is a matrix

A typical dataset is a matrix:

Rows: sample units (customers, patients, devices)
Columns: variables (products, biomarkers, measurements)
Optionally: a response label (e.g., churn, diagnosis)

Example:

Customer	Product 1	Product 2	Product 3	Churn
1	5	6	1	0
2	3	5	1	0
3	3	6	2	0
4	3	8	4	0
5	2	2	4	1

Types of data

Quantitative
- Discrete or continuous
- Binary (1, 0), for example presence or absence
- Semi-quantitative (ordinal estimation)
Frequencies, proportions, or percentages
Qualitative (nominal), for example factors

Table of mixed data (concept)

Datasets can mix numeric predictors, binary indicators, continuous outcomes, and categorical group labels in the same analysis:

ID	Age	Income ($)	Online?	Churn	Segment	Rating
1	34	52 000	1	0	Retail	1
2	45	88 000	0	0	Horeca	5
3	29	41 000	1	1	Retail	3
4	51	115 000	1	0	Horeca	4
5	38	63 000	0	1	Retail	2

Different units and different scales

Variables may be measured on different scales and in different units.
Decide whether you want some variables to dominate the analysis.
Standardisation is often required before distance-based or covariance-based methods.

Example — the Advertising dataset (200 markets), all in $000 but ranges differ dramatically:

Variable	Min	Max	Mean	SD
TV spend	0.7	296.4	147.0	85.9
Radio spend	0.0	49.6	23.3	14.8
Newspaper spend	0.3	114.0	30.6	21.8
Sales	1.6	27.0	14.0	5.2

Without standardisation, TV spend would dominate any distance calculation.

Response vs predictor

Response variable(s):
- the target of the scientific hypothesis
Predictor variable(s):
- variables, factors, or classifications that may explain variation in the response

Example — does ad spend predict sales?

TV ($000)	Radio ($000)	Newspaper ($000)	Sales ($000)
230.1	37.8	69.2	22.1
44.5	39.3	45.1	10.4
17.2	45.9	69.3	9.3
151.5	41.3	58.5	18.5

TV, Radio, Newspaper are predictors; Sales is the response.

Univariate analysis (single response)

Univariate analysis models one dependent variable at a time.

Examples:

means and standard deviations computed one variable at a time
correlation analysis examines pairwise associations
regression predicts one outcome from one or more predictors

For 200 Advertising markets, each column summarised separately:

Variable	Mean	SD
TV	147.0	85.9
Radio	23.3	14.8
Newspaper	30.6	21.8
Sales	14.0	5.2

Each column treated in isolation — joint patterns are missed.

Simple linear regression (SLR)

A univariate model with one quantitative predictor:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

Where:

$y_i$ is the response
$x_i$ is the predictor
$\epsilon_i$ is the error term

Multiple regression is still univariate

Even with many predictors, there is still one response function:

Fit a plane (or hyperplane) to predict one outcome from multiple regressors
Similar logic holds for n-way ANOVA, ANCOVA, and related models

Vector and matrix notation

Response vector with predictor matrix:

\[ \mathbf{y} = f(\mathbf{X}) + \boldsymbol{\epsilon} \]

$\mathbf{y}$: $n \times 1$ vector of responses (one outcome per observation)
$\mathbf{X}$: $n \times p$ matrix of predictors (rows = observations, columns = variables)
$\boldsymbol{\epsilon}$: $n \times 1$ vector of errors

If there are multiple response variables, collect them into a response matrix:

\[ \mathbf{Y} = f(\mathbf{X}) + \mathbf{E} \]

$\mathbf{Y}$: $n \times q$ matrix of responses ($q$ outcome variables measured on the same $n$ observations)
$\mathbf{E}$: $n \times q$ matrix of errors — one residual per observation per response variable

Multivariate data

Simultaneous response of many variables
Response can be quantified relative to:
- other variables
- classification criteria
- experimental treatments

Why multivariate analysis?

Model and interpret outcomes for many variables simultaneously
Use association structure among variables
Compared to univariate methods, multivariate methods can:
- save time and effort
- reveal patterns that are hard to see one variable at a time

Mathematical foundations

Mean Vector

Each variable has a mean. Collect means into a mean vector:

\[ \boldsymbol{\mu} = (\mu_1, \mu_2, \ldots, \mu_p)^\top \]

Reminder: the sample mean (centroid) for variable $j$:

\[ \bar{y}_j = \frac{1}{n}\sum_{i=1}^{n} y_{ij} \]

For example, in an Advertising dataset ($n = 200$, $p = 4$ variables):

\[ \bar{\mathbf{y}} = (147.0,\; 23.3,\; 30.6,\; 14.0)^\top \]

(TV, Radio, Newspaper, Sales in $000). The centroid is the point in 4D space all 200 markets balance around.

Variance captures spread, but not shape

Each variable has its own variance.
Variance explains spread parallel to axes.
Correlated structure (diagonal “shape”) is not explained by separate variances alone.

Advertising dataset — variances differ by a factor of 275×:

Variable	Variance	SD
TV	7 384.5	85.9
Radio	219.6	14.8
Newspaper	474.4	21.8
Sales	26.8	5.2

Knowing each variance separately tells us nothing about the direction the data cloud tilts.

Covariance captures shape

If $x$ increases and $y$ increases on average, the covariance is positive.

Covariance extends variance to capture linear association between variables.

Five Advertising observations — higher TV spend tends to go with higher Sales:

TV	Sales	TV − mean	Sales − mean	product
230	22.1	+83	+8.1	+673
45	10.4	−102	−3.6	+367
17	9.3	−130	−4.7	+611
152	18.5	+5	+4.5	+23
181	12.9	+34	−1.1	−37

Most products are positive → Cov(TV, Sales) > 0. Full dataset: $s_{\text{TV},\,\text{Sales}} = 399.7$.

Covariance matrix

For $p$ variables, summarise dispersion and shape with:

\[ \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1p}\\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{p1} & \sigma_{p2} & \cdots & \sigma_{pp} \end{pmatrix} \]

Properties:

square $p \times p$
symmetric ($\sigma_{jk} = \sigma_{kj}$)
diagonal entries are variances ($\sigma_{jj} = \sigma_j^2$)

Sample $\mathbf{S}$ for TV, Radio, Sales (Advertising, $n = 200$):

\[ \mathbf{S} = \begin{pmatrix} 7384.5 & 45.3 & 399.7 \\ 45.3 & 219.6 & 53.8 \\ 399.7 & 53.8 & 26.8 \end{pmatrix} \]

TV has the largest variance and the strongest covariance with Sales.

Sample covariance (estimation)

Unbiased estimate of covariance between variables $j$ and $k$:

\[ s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n} (y_{ij} - \bar{y}_j)(y_{ik} - \bar{y}_k) \]

Sample covariance matrix:

\[ \mathbf{S} = (s_{jk}) \]

PCA (concept)

Widely used for:
- visualisation
- variable reduction
Decomposes $\mathbf{S}$ into orthogonal directions of maximum variance:

\[\mathbf{S} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top\]

The eigenvectors of $\mathbf{S}$ define the principal directions of the point cloud.
The corresponding eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p$ give the variance explained along each direction.
PCA rotates axes to align with eigenvectors; the first axis explains the most variance.
In the PCA gallery, biplot arrows are proportional to loading vectors — the sample eigenvectors of $\mathbf{S}$.

Factor analysis (concept)

Superficially resembles PCA but differs in goals and assumptions
Used when an underlying factor structure is presumed to exist but is not directly observed
Each variable’s variance is split into communality (shared with factors) and uniqueness (specific to that variable)
Factors are rotated (e.g., Varimax) to improve interpretability
Key model:

\[\boldsymbol{\Sigma} \approx \boldsymbol{\Lambda}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}\]

where $\boldsymbol{\Lambda}$ is the loading matrix and $\boldsymbol{\Psi}$ is diagonal uniqueness

Discriminant analysis and LDA (concept)

Goal: find a linear projection that maximally separates pre-defined groups
Fisher’s criterion maximises the ratio of between-group to within-group scatter:

\[ J(\mathbf{w}) = \frac{\mathbf{w}^\top \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\top \mathbf{S}_W \mathbf{w}} \]

$\mathbf{S}_B$ = between-group scatter matrix; $\mathbf{S}_W$ = within-group scatter matrix
Solution: generalised eigenvector problem $\mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w}$
QDA relaxes the assumption of equal group covariance matrices
Used for: classification, group separation, assessing predictive accuracy

Cluster analysis and K-means (concept)

Goal: partition $n$ observations into $k$ groups without a pre-defined response
K-means minimises within-cluster sum of squares:

\[ \underset{C_1,\ldots,C_K}{\text{minimise}} \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]

The algorithm alternates assignment (each point → nearest centroid) and update (recompute centroids) until convergence
Choosing $k$: scree / elbow plots, silhouette width, gap statistic
Hierarchical alternatives (Ward, complete, single linkage) do not require pre-specifying $k$

Partial least squares regression (concept)

Handles collinear predictors that defeat ordinary least squares
Simultaneously decomposes $\mathbf{X}$ and $\mathbf{Y}$ to find latent components that maximise:

\[ \text{Cov}(\mathbf{t},\; \mathbf{u})^2, \qquad \mathbf{t} = \mathbf{X}\mathbf{w}, \quad \mathbf{u} = \mathbf{Y}\mathbf{c} \]

Contrast with PCR (which ignores $\mathbf{Y}$ when extracting components)
Number of components $k$ chosen by minimising cross-validated RMSE
Widely used in chemometrics, genomics, and marketing mix modelling

Multivariate analyses are complex

Often reduce high-dimensional problems to fewer dimensions.
Can be challenging to communicate clearly.
Emphasis: simplify results for a naive audience without losing correctness.
Can help identify unusual patterns, outliers, and clusters that are not apparent in univariate analyses.

Outlier detection example

Spot the outlier when looking at variables jointly:

Individually, a variable may look unremarkable.
Jointly, the same observation can appear far from the rest of the data cloud.
Multivariate outliers are visible as extreme points in a PCA biplot or a Mahalanobis distance plot.

Five wholesale customers plotted in Milk × Grocery space:

Customer	Milk ($)	Grocery ($)	Note
A	3 659	7 541	typical
B	1 981	2 675	typical
C	7 058	9 198	typical
D	2 976	5 484	typical
E	6800	3117	possible outlier

Customer E is far from the cluster — only obvious when both variables are considered together.

Outlier detection motivation (con’t)

Mahalanobis distance.

	A	B	C	D	E
A	0.00	2.44	2.32	1.05	6.45
B	2.44	0.00	2.38	1.39	4.53
C	2.32	2.38	0.00	1.99	4.55
D	1.05	1.39	1.99	0.00	5.56
E	6.45	4.53	4.55	5.56	0.00

General properties of multivariate data

Each variable has its own characteristics (mean and variance).
Variables have relationships with each other (covariance).
Direction and magnitude of response may differ by variable.
Variables may be dependent, but observations are generally assumed independent.