161762 Multivariate Analysis for Big Data

Lecture 4: visualizing Principal Components Analysis

Sergio Sclovich

Massey University

Fall 2026

Learning objectives

  • Explain properties of eigenvectors.

  • Understand the difference between using var-covar vs corr matrix for PCA.

  • Projecting onto the new space (making plots).

Recap from last class

A Principal components analysis (PCA)

  • Is a dimension reduction method that creates variables called principal components (Eigenvectors).

  • It creates as many components as there are input variables.

X (n × p)

Covariance matrix Σ (p × p)

Eigenvectors / eigenvalues

Principal components

Eigenanalysis Procedure

\[ \mathbf{S}\mathbf{v}_i = \lambda_i \mathbf{v}_i \qquad \Longrightarrow \qquad \mathbf{S} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^\top \]

where

\[ \mathbf{S}_{p \times p} \qquad \boldsymbol{\Lambda} = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_p \end{bmatrix} \qquad \mathbf{V} = \begin{bmatrix} v_1 & v_2 & \cdots & v_p \end{bmatrix} \]

What are the eigenvectors?

\[ \mathbf{V} = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \cdots & v_p \\ | & | & & | \end{bmatrix} = \begin{bmatrix} v_{11} & v_{12} & \cdots & v_{1p} \\ v_{21} & v_{22} & \cdots & v_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ v_{p1} & v_{p2} & \cdots & v_{pp} \end{bmatrix} \] Each element is called a loading and represents the influence of each variable on the corresponding eigenvector (Principal Component, PC).

The scores of the PCs are linear combinations of the original variables

\[ \text{Original values for the first individual} \quad \mathbf{x}_1 = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{ip} \end{bmatrix} \\ \text{Loadings for the first PC} \quad \mathbf{v}_1 = \begin{bmatrix} v_{11} \\ v_{21} \\ \vdots \\ v_{p1} \end{bmatrix} \\ \\ \boldsymbol{z}_{11} = \mathbf{x}_1^{\top} \mathbf{v}_1 \\ \\ \text{Score for the first individual in the first PC} \\ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \]

The scores of the PCs are linear combinations of the original variables

I can build the score for each individual on every PC, every dimension in the new space.

\[ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \\ z_{12} = x_{11}v_{12} + x_{12}v_{22} + \cdots + x_{ip}v_{p1} \]

Biplots

  • Because the PC’s are a linear combination of the original variables, we can project the original variables into the PCA space to get a visual idea of their relative importance.

  • Distances among objects in the plot are approximations of their Euclidean distances in multidimensional space.

  • PCA is an unsupervised method. It does not pre suppose any grouping.

Biplots

Welcome to the Multiverse!!

Scale matters

Customer Product 1 Product 2 Product 3 Churn
1 5 6000 1 0
2 3 5000 1 0
3 3 6000 2 0
4 3 8000 4 0

\[ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \]

  • PCA is sensible to scale.

What can you do when there are different scales?

flowchart LR
    A[Raw data] --> B[Standardized data] --> C[Covariance matrix] --> D([PCA])
    E[Raw data] --> F[Correlation matrix] -. will give the same result as .-> D

With correlation

  • Distances among objects are independent of measurement units.
  • All variables contribute according to their variation and all are expressed s.d. units.
  • Do PCA on the correlation matrix if variables are not dimensionally homogeneous or are on different scales.

With Var-Covar

  • Distances among objects depend on measurement units.
  • Variables contribute according to their scale.
  • Can do PCA on the covariance matrix if variables are dimensionally homogeneous and on similar scales.

What is the global influence of each variable?

  • Loadings reflect the individual influence of each variable.

  • They can be represented in the same space as the individuals.

  • Loadings are directly comparable and interpretable.

What is the global influence of each variables?

PCA does not have formal assumptions, but there are some considerations.

  • Quantitative data (covariances have no meaning for qualitative data).

  • The number of variables (𝑝) should be less than the number of experimental units (𝑛). General rule of thumb is 1:10.

  • Avoid using if there are lots of zeros in the data.

  • PCA works best to reduce dimensions when there is high correlation among variables.

  • Data are approx. normally distributed (reasonably symmetric with no gross outliers).

  • However, PCA is very useful for detecting multivariate outliers.

What is a multivariate outlier?

  • A univariate outlier is a data point that consists of an extreme value on one variable.

  • A multivariate outlier is a combination of unusual scores on at least three variables.

  • Always check any outlier. Never remove them by default.