161762 Multivariate Analysis for Big Data

Lecture 4: visualizing Principal Components Analysis

Sergio Sclovich

Massey University

Fall 2026

Learning objectives

Explain properties of eigenvectors.
Understand the difference between using var-covar vs corr matrix for PCA.
Projecting onto the new space (making plots).

Recap from last class

A Principal components analysis (PCA)

Is a dimension reduction method that creates variables called principal components (Eigenvectors).
It creates as many components as there are input variables.

X (n × p)

↓

Covariance matrix Σ (p × p)

↓

Eigenvectors / eigenvalues

↓

Principal components

Eigenanalysis Procedure

\[ \mathbf{S}\mathbf{v}_i = \lambda_i \mathbf{v}_i \qquad \Longrightarrow \qquad \mathbf{S} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^\top \]

where

\[ \mathbf{S}_{p \times p} \qquad \boldsymbol{\Lambda} = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_p \end{bmatrix} \qquad \mathbf{V} = \begin{bmatrix} v_1 & v_2 & \cdots & v_p \end{bmatrix} \]

What are the eigenvectors?

\[ \mathbf{V} = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \cdots & v_p \\ | & | & & | \end{bmatrix} = \begin{bmatrix} v_{11} & v_{12} & \cdots & v_{1p} \\ v_{21} & v_{22} & \cdots & v_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ v_{p1} & v_{p2} & \cdots & v_{pp} \end{bmatrix} \] Each element is called a loading and represents the influence of each variable on the corresponding eigenvector (Principal Component, PC).

The scores of the PCs are linear combinations of the original variables

\[ \text{Original values for the first individual} \quad \mathbf{x}_1 = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{ip} \end{bmatrix} \\ \text{Loadings for the first PC} \quad \mathbf{v}_1 = \begin{bmatrix} v_{11} \\ v_{21} \\ \vdots \\ v_{p1} \end{bmatrix} \\ \\ \boldsymbol{z}_{11} = \mathbf{x}_1^{\top} \mathbf{v}_1 \\ \\ \text{Score for the first individual in the first PC} \\ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \]

The scores of the PCs are linear combinations of the original variables

I can build the score for each individual on every PC, every dimension in the new space.

\[ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \\ z_{12} = x_{11}v_{12} + x_{12}v_{22} + \cdots + x_{ip}v_{p1} \]

Biplots

Because the PC’s are a linear combination of the original variables, we can project the original variables into the PCA space to get a visual idea of their relative importance.
Distances among objects in the plot are approximations of their Euclidean distances in multidimensional space.
PCA is an unsupervised method. It does not pre suppose any grouping.

Biplots

Welcome to the Multiverse!!

Scale matters

Customer	Product 1	Product 2	Product 3
1	5	6000	1
2	3	5000	1
3	3	6000	2
4	3	8000	4

\[ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \]

PCA is sensible to scale.

What can you do when there are different scales?

flowchart LR
    A[Raw data] --> B[Standardized data] --> C[Covariance matrix] --> D([PCA])
    E[Raw data] --> F[Correlation matrix] -. will give the same result as .-> D

With correlation

Distances among objects are independent of measurement units.
All variables contribute according to their variation and all are expressed s.d. units.
Do PCA on the correlation matrix if variables are not dimensionally homogeneous or are on different scales.

With Var-Covar

Distances among objects depend on measurement units.
Variables contribute according to their scale.
Can do PCA on the covariance matrix if variables are dimensionally homogeneous and on similar scales.

What is the global influence of each variable?

Loadings reflect the individual influence of each variable.
They can be represented in the same space as the individuals.
Loadings are directly comparable and interpretable.

What is the global influence of each variables?

PCA does not have formal assumptions, but there are some considerations.

Quantitative data (covariances have no meaning for qualitative data).
The number of variables (𝑝) should be less than the number of experimental units (𝑛). General rule of thumb is 1:10.
Avoid using if there are lots of zeros in the data.
PCA works best to reduce dimensions when there is high correlation among variables.
Data are approx. normally distributed (reasonably symmetric with no gross outliers).
However, PCA is very useful for detecting multivariate outliers.

What is a multivariate outlier?

A univariate outlier is a data point that consists of an extreme value on one variable.
A multivariate outlier is a combination of unusual scores on at least three variables.
Always check any outlier. Never remove them by default.