Lecture 4: visualizing Principal Components Analysis
Massey University
Fall 2026
Explain properties of eigenvectors.
Understand the difference between using var-covar vs corr matrix for PCA.
Projecting onto the new space (making plots).
A Principal components analysis (PCA)
Is a dimension reduction method that creates variables called principal components (Eigenvectors).
It creates as many components as there are input variables.
X (n × p)
↓
Covariance matrix Σ (p × p)
↓
Eigenvectors / eigenvalues
↓
Principal components
\[ \mathbf{S}\mathbf{v}_i = \lambda_i \mathbf{v}_i \qquad \Longrightarrow \qquad \mathbf{S} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^\top \]
where
\[ \mathbf{S}_{p \times p} \qquad \boldsymbol{\Lambda} = \begin{bmatrix} \lambda_1 & 0 & \cdots & 0 \\ 0 & \lambda_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \lambda_p \end{bmatrix} \qquad \mathbf{V} = \begin{bmatrix} v_1 & v_2 & \cdots & v_p \end{bmatrix} \]
\[ \mathbf{V} = \begin{bmatrix} | & | & & | \\ v_1 & v_2 & \cdots & v_p \\ | & | & & | \end{bmatrix} = \begin{bmatrix} v_{11} & v_{12} & \cdots & v_{1p} \\ v_{21} & v_{22} & \cdots & v_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ v_{p1} & v_{p2} & \cdots & v_{pp} \end{bmatrix} \] Each element is called a loading and represents the influence of each variable on the corresponding eigenvector (Principal Component, PC).
\[ \text{Original values for the first individual} \quad \mathbf{x}_1 = \begin{bmatrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{ip} \end{bmatrix} \\ \text{Loadings for the first PC} \quad \mathbf{v}_1 = \begin{bmatrix} v_{11} \\ v_{21} \\ \vdots \\ v_{p1} \end{bmatrix} \\ \\ \boldsymbol{z}_{11} = \mathbf{x}_1^{\top} \mathbf{v}_1 \\ \\ \text{Score for the first individual in the first PC} \\ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \]
I can build the score for each individual on every PC, every dimension in the new space.
\[ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \\ z_{12} = x_{11}v_{12} + x_{12}v_{22} + \cdots + x_{ip}v_{p1} \]
Because the PC’s are a linear combination of the original variables, we can project the original variables into the PCA space to get a visual idea of their relative importance.
Distances among objects in the plot are approximations of their Euclidean distances in multidimensional space.
PCA is an unsupervised method. It does not pre suppose any grouping.
| Customer | Product 1 | Product 2 | Product 3 | Churn |
|---|---|---|---|---|
| 1 | 5 | 6000 | 1 | 0 |
| 2 | 3 | 5000 | 1 | 0 |
| 3 | 3 | 6000 | 2 | 0 |
| 4 | 3 | 8000 | 4 | 0 |
\[ z_{11} = x_{11}v_{11} + x_{12}v_{21} + \cdots + x_{ip}v_{p1} \]
flowchart LR
A[Raw data] --> B[Standardized data] --> C[Covariance matrix] --> D([PCA])
E[Raw data] --> F[Correlation matrix] -. will give the same result as .-> D
With correlation
With Var-Covar
Loadings reflect the individual influence of each variable.
They can be represented in the same space as the individuals.
Loadings are directly comparable and interpretable.
Quantitative data (covariances have no meaning for qualitative data).
The number of variables (𝑝) should be less than the number of experimental units (𝑛). General rule of thumb is 1:10.
Avoid using if there are lots of zeros in the data.
PCA works best to reduce dimensions when there is high correlation among variables.
Data are approx. normally distributed (reasonably symmetric with no gross outliers).
However, PCA is very useful for detecting multivariate outliers.
A univariate outlier is a data point that consists of an extreme value on one variable.
A multivariate outlier is a combination of unusual scores on at least three variables.
Always check any outlier. Never remove them by default.
