161762 Multivariate Analysis for Big Data

Lecture 1: Introduction and the Nature of Multivariate Data

Nick Knowlton

Massey University

Fall 2026

Health, safety, and wellbeing

  • While studying, you may work in new environments with unfamiliar equipment and materials.
  • Hazards are anything with potential to cause harm.
  • Know the safety rules and required PPE for your learning environment.
  • Ask before starting an activity if hazards are unclear.

Hazards and risks

  • Recognize hazards early.
  • Follow area-specific safety instructions.
  • Use appropriate protective equipment when required.
  • Ask questions before you start.

When things go wrong

  • Report incidents as soon as possible, including near misses.
  • Aim: learn what happened and prevent recurrence (no-blame approach).
  • Notify the person in charge of the area for reporting and next steps.
  • Know where first aid kits and trained first aiders are located.

Evacuation

When you hear a fire alarm:

  • Leave via the nearest safe exit.
  • Walk quickly and safely.
  • Take keys, phone, and wallet.
  • Leave bags and drinks behind to prevent falls and delays.
  • Assemble at the emergency assembly area.
  • Follow building warden instructions.
  • Do not re-enter until the all-clear is given.

Staying in touch in an emergency

Massey emergency communication channels:

  1. Massey home page
  2. Massey social channels
  3. PING app (install ahead of time)

Wellbeing support

If you feel anxious, stressed, or need help:

  • Campus Student Health and Counselling
  • Email: studenthealth@massey.ac.nz
  • Use Student Life Services resources
  • For academic help:
    • Contact your teaching team via the course Stream site
    • Use online and face-to-face learning support services

Course information

Assessment

  • Computer Practicals (10): 20%
  • Assignments (2): 2 × 15%
  • Data analysis project (Group + Individual): 50%

Assignment schedule and deadlines

  • Deadlines (Friday hand-ins):
    • Assignment 1 due: Fri 3 Apr 2026
    • Assignment 2 due: Fri 29 May 2026
    • Final project due: Fri 12 Jun 2026
  • Practical hand-in bundles (Friday hand-ins):
    • Practicals Set 1 due: Fri 10 Apr 2026
    • Practicals Set 2 due: Fri 1 May 2026
    • Practicals Set 3 due: Fri 29 May 2026

Course objectives

On successful completion, students will be able to:

  1. Explain properties of multivariate data and distinguish multivariate vs univariate analysis.
  2. Explain problems and opportunities that big data brings to standard analytical methods.
  3. Explain, interpret, and demonstrate:
    • Cluster analysis
    • Ordination techniques (PCA, factor analysis, MDS)
    • Latent variable analyses
    • Discrimination methods
  4. Discriminate a priori groups using multivariate data and assess classification error.
  5. Write up results in a structured, statistically sound report using the techniques above.

Multivariate project

You will be required to:

  • Define a scientific question of interest
  • Collect a dataset (often online)
  • Prepare the dataset properly
  • Analyse using appropriate models
  • Work in a team to interpret results and write a report
  • Communicate results clearly to a non-technical audience
  • Write a 5 to 10 page report (include graphics)

Stream (course hub)

  • Primary place to find information and contact others
  • Datasets and key files will be placed on Stream

SAS

  • SAS is the primary software for this course.
  • SAS Studio is the web-based interface for SAS University Edition.
  • SAS code examples will be provided in lectures and labs.

Course structure

Course is structured around weekly lectures and labs with the following topics:

  1. Intro to multivariate analysis
  2. PCA 1
  3. PCA 2
  4. PCA 3
  5. Factor Analysis
  6. MDS
  7. Correspondence analysis
  8. RDA / CCorA
  9. LDA & QDA
  10. Clustering 1
  11. Clustering 2
  12. PLS

Lecture 1: the nature of multivariate data

Table of numeric data is a matrix

A typical dataset is a matrix:

  • Rows: sample units (customers, patients, devices)
  • Columns: variables (products, biomarkers, measurements)
  • Optionally: a response label (e.g., churn, diagnosis)

Example:

Customer Product 1 Product 2 Product 3 Churn
1 5 6 1 0
2 3 5 1 0
3 3 6 2 0
4 3 8 4 0
5 2 2 4 1

Types of data

  • Quantitative
    • Discrete or continuous
    • Binary (1, 0), for example presence or absence
    • Semi-quantitative (ordinal estimation)
  • Frequencies, proportions, or percentages
  • Qualitative (nominal), for example factors

Table of mixed data (concept)

Datasets can mix numeric predictors, binary indicators, continuous outcomes, and categorical group labels in the same analysis:

ID Age Income ($) Online? Churn Segment Rating
1 34 52 000 1 0 Retail 1
2 45 88 000 0 0 Horeca 5
3 29 41 000 1 1 Retail 3
4 51 115 000 1 0 Horeca 4
5 38 63 000 0 1 Retail 2

Different units and different scales

  • Variables may be measured on different scales and in different units.
  • Decide whether you want some variables to dominate the analysis.
  • Standardisation is often required before distance-based or covariance-based methods.

Example — the Advertising dataset (200 markets), all in $000 but ranges differ dramatically:

Variable Min Max Mean SD
TV spend 0.7 296.4 147.0 85.9
Radio spend 0.0 49.6 23.3 14.8
Newspaper spend 0.3 114.0 30.6 21.8
Sales 1.6 27.0 14.0 5.2

Without standardisation, TV spend would dominate any distance calculation.

Response vs predictor

  • Response variable(s):
    • the target of the scientific hypothesis
  • Predictor variable(s):
    • variables, factors, or classifications that may explain variation in the response

Example — does ad spend predict sales?

TV ($000) Radio ($000) Newspaper ($000) Sales ($000)
230.1 37.8 69.2 22.1
44.5 39.3 45.1 10.4
17.2 45.9 69.3 9.3
151.5 41.3 58.5 18.5

TV, Radio, Newspaper are predictors; Sales is the response.

Univariate analysis (single response)

Univariate analysis models one dependent variable at a time.

Examples:

  • means and standard deviations computed one variable at a time
  • correlation analysis examines pairwise associations
  • regression predicts one outcome from one or more predictors

For 200 Advertising markets, each column summarised separately:

Variable Mean SD
TV 147.0 85.9
Radio 23.3 14.8
Newspaper 30.6 21.8
Sales 14.0 5.2

Each column treated in isolation — joint patterns are missed.

Simple linear regression (SLR)

A univariate model with one quantitative predictor:

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]

Where:

  • \(y_i\) is the response
  • \(x_i\) is the predictor
  • \(\epsilon_i\) is the error term

Multiple regression is still univariate

Even with many predictors, there is still one response function:

  • Fit a plane (or hyperplane) to predict one outcome from multiple regressors
  • Similar logic holds for n-way ANOVA, ANCOVA, and related models

Vector and matrix notation

Response vector with predictor matrix:

\[ \mathbf{y} = f(\mathbf{X}) + \boldsymbol{\epsilon} \]

  • \(\mathbf{y}\): \(n \times 1\) vector of responses (one outcome per observation)
  • \(\mathbf{X}\): \(n \times p\) matrix of predictors (rows = observations, columns = variables)
  • \(\boldsymbol{\epsilon}\): \(n \times 1\) vector of errors

If there are multiple response variables, collect them into a response matrix:

\[ \mathbf{Y} = f(\mathbf{X}) + \mathbf{E} \]

  • \(\mathbf{Y}\): \(n \times q\) matrix of responses (\(q\) outcome variables measured on the same \(n\) observations)
  • \(\mathbf{E}\): \(n \times q\) matrix of errors — one residual per observation per response variable

Multivariate data

  • Simultaneous response of many variables
  • Response can be quantified relative to:
    • other variables
    • classification criteria
    • experimental treatments

Why multivariate analysis?

  • Model and interpret outcomes for many variables simultaneously
  • Use association structure among variables
  • Compared to univariate methods, multivariate methods can:
    • save time and effort
    • reveal patterns that are hard to see one variable at a time

Mathematical foundations

Mean Vector

Each variable has a mean. Collect means into a mean vector:

\[ \boldsymbol{\mu} = (\mu_1, \mu_2, \ldots, \mu_p)^\top \]

Reminder: the sample mean (centroid) for variable \(j\):

\[ \bar{y}_j = \frac{1}{n}\sum_{i=1}^{n} y_{ij} \]

For example, in an Advertising dataset (\(n = 200\), \(p = 4\) variables):

\[ \bar{\mathbf{y}} = (147.0,\; 23.3,\; 30.6,\; 14.0)^\top \]

(TV, Radio, Newspaper, Sales in $000). The centroid is the point in 4D space all 200 markets balance around.

Variance captures spread, but not shape

  • Each variable has its own variance.
  • Variance explains spread parallel to axes.
  • Correlated structure (diagonal “shape”) is not explained by separate variances alone.

Advertising dataset — variances differ by a factor of 275×:

Variable Variance SD
TV 7 384.5 85.9
Radio 219.6 14.8
Newspaper 474.4 21.8
Sales 26.8 5.2

Knowing each variance separately tells us nothing about the direction the data cloud tilts.

Covariance captures shape

If \(x\) increases and \(y\) increases on average, the covariance is positive.

Covariance extends variance to capture linear association between variables.

Five Advertising observations — higher TV spend tends to go with higher Sales:

TV Sales TV − mean Sales − mean product
230 22.1 +83 +8.1 +673
45 10.4 −102 −3.6 +367
17 9.3 −130 −4.7 +611
152 18.5 +5 +4.5 +23
181 12.9 +34 −1.1 −37

Most products are positive → Cov(TV, Sales) > 0. Full dataset: \(s_{\text{TV},\,\text{Sales}} = 399.7\).

Covariance matrix

For \(p\) variables, summarise dispersion and shape with:

\[ \boldsymbol{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1p}\\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ \sigma_{p1} & \sigma_{p2} & \cdots & \sigma_{pp} \end{pmatrix} \]

Properties:

  • square \(p \times p\)
  • symmetric (\(\sigma_{jk} = \sigma_{kj}\))
  • diagonal entries are variances (\(\sigma_{jj} = \sigma_j^2\))

Sample \(\mathbf{S}\) for TV, Radio, Sales (Advertising, \(n = 200\)):

\[ \mathbf{S} = \begin{pmatrix} 7384.5 & 45.3 & 399.7 \\ 45.3 & 219.6 & 53.8 \\ 399.7 & 53.8 & 26.8 \end{pmatrix} \]

TV has the largest variance and the strongest covariance with Sales.

Sample covariance (estimation)

Unbiased estimate of covariance between variables \(j\) and \(k\):

\[ s_{jk} = \frac{1}{n-1}\sum_{i=1}^{n} (y_{ij} - \bar{y}_j)(y_{ik} - \bar{y}_k) \]

Sample covariance matrix:

\[ \mathbf{S} = (s_{jk}) \]

PCA (concept)

  • Widely used for:
    • visualisation
    • variable reduction
  • Decomposes \(\mathbf{S}\) into orthogonal directions of maximum variance:

\[\mathbf{S} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top\]

  • The eigenvectors of \(\mathbf{S}\) define the principal directions of the point cloud.
  • The corresponding eigenvalues \(\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_p\) give the variance explained along each direction.
  • PCA rotates axes to align with eigenvectors; the first axis explains the most variance.
  • In the PCA gallery, biplot arrows are proportional to loading vectors — the sample eigenvectors of \(\mathbf{S}\).

Factor analysis (concept)

  • Superficially resembles PCA but differs in goals and assumptions
  • Used when an underlying factor structure is presumed to exist but is not directly observed
  • Each variable’s variance is split into communality (shared with factors) and uniqueness (specific to that variable)
  • Factors are rotated (e.g., Varimax) to improve interpretability
  • Key model:

\[\boldsymbol{\Sigma} \approx \boldsymbol{\Lambda}\boldsymbol{\Lambda}^\top + \boldsymbol{\Psi}\]

where \(\boldsymbol{\Lambda}\) is the loading matrix and \(\boldsymbol{\Psi}\) is diagonal uniqueness

Discriminant analysis and LDA (concept)

  • Goal: find a linear projection that maximally separates pre-defined groups
  • Fisher’s criterion maximises the ratio of between-group to within-group scatter:

\[ J(\mathbf{w}) = \frac{\mathbf{w}^\top \mathbf{S}_B \mathbf{w}}{\mathbf{w}^\top \mathbf{S}_W \mathbf{w}} \]

  • \(\mathbf{S}_B\) = between-group scatter matrix; \(\mathbf{S}_W\) = within-group scatter matrix
  • Solution: generalised eigenvector problem \(\mathbf{S}_B \mathbf{w} = \lambda \mathbf{S}_W \mathbf{w}\)
  • QDA relaxes the assumption of equal group covariance matrices
  • Used for: classification, group separation, assessing predictive accuracy

Cluster analysis and K-means (concept)

  • Goal: partition \(n\) observations into \(k\) groups without a pre-defined response
  • K-means minimises within-cluster sum of squares:

\[ \underset{C_1,\ldots,C_K}{\text{minimise}} \sum_{k=1}^{K} \sum_{i \in C_k} \|\mathbf{x}_i - \boldsymbol{\mu}_k\|^2 \]

  • The algorithm alternates assignment (each point → nearest centroid) and update (recompute centroids) until convergence
  • Choosing \(k\): scree / elbow plots, silhouette width, gap statistic
  • Hierarchical alternatives (Ward, complete, single linkage) do not require pre-specifying \(k\)

Partial least squares regression (concept)

  • Handles collinear predictors that defeat ordinary least squares
  • Simultaneously decomposes \(\mathbf{X}\) and \(\mathbf{Y}\) to find latent components that maximise:

\[ \text{Cov}(\mathbf{t},\; \mathbf{u})^2, \qquad \mathbf{t} = \mathbf{X}\mathbf{w}, \quad \mathbf{u} = \mathbf{Y}\mathbf{c} \]

  • Contrast with PCR (which ignores \(\mathbf{Y}\) when extracting components)
  • Number of components \(k\) chosen by minimising cross-validated RMSE
  • Widely used in chemometrics, genomics, and marketing mix modelling

Multivariate analyses are complex

  • Often reduce high-dimensional problems to fewer dimensions.
  • Can be challenging to communicate clearly.
  • Emphasis: simplify results for a naive audience without losing correctness.
  • Can help identify unusual patterns, outliers, and clusters that are not apparent in univariate analyses.

Outlier detection example

Spot the outlier when looking at variables jointly:

  • Individually, a variable may look unremarkable.
  • Jointly, the same observation can appear far from the rest of the data cloud.
  • Multivariate outliers are visible as extreme points in a PCA biplot or a Mahalanobis distance plot.

Five wholesale customers plotted in Milk × Grocery space:

Customer Milk ($) Grocery ($) Note
A 3 659 7 541 typical
B 1 981 2 675 typical
C 7 058 9 198 typical
D 2 976 5 484 typical
E 6800 3117 possible outlier

Customer E is far from the cluster — only obvious when both variables are considered together.

Outlier detection motivation (con’t)

Mahalanobis distance.

A B C D E
A 0.00 2.44 2.32 1.05 6.45
B 2.44 0.00 2.38 1.39 4.53
C 2.32 2.38 0.00 1.99 4.55
D 1.05 1.39 1.99 0.00 5.56
E 6.45 4.53 4.55 5.56 0.00

General properties of multivariate data

  • Each variable has its own characteristics (mean and variance).
  • Variables have relationships with each other (covariance).
  • Direction and magnitude of response may differ by variable.
  • Variables may be dependent, but observations are generally assumed independent.

Fin