SVD and low-rank structure

evergreen#foundations#linear-algebra

Bar: visualize what SVD does to a unit circle, and relate it to PCA, embeddings, and matrix factorization without notes.

Intuition first

Every matrix $A$ (any shape) factors into rotate → scale → rotate:

A = U\Sigma V^{\top}

$V^{\top}$ rotates input space onto the "right singular" axes,
$\Sigma$ (diagonal, $\sigma_1\ge\sigma_2\ge\dots\ge 0$ ) stretches each axis,
$U$ rotates into output space along the "left singular" axes.

A unit circle becomes an ellipse whose semi-axis lengths are the singular values $\sigma_i$ . The number of nonzero $\sigma_i$ is the rank. Small $\sigma_i$ = directions that barely matter → throw them away.

Low-rank approximation (the whole point)

Keep the top $k$ singular triplets:

A_k = \sum_{i=1}^{k}\sigma_i u_i v_i^{\top}

Eckart–Young:

A_k

is the best possible rank-

k

approximation in Frobenius (and spectral) norm, and the error is exactly the dropped energy:

\|A-A_k\|_F^2 = \sum_{i>k}\sigma_i^2

That is compression, denoising, and "find the latent structure" all at once.

Worked example (by hand)

A=\begin{bmatrix}1&1\\1&1\end{bmatrix}

$A$ is symmetric, so SVD = eigen-decomposition. Eigenvalues: $2$ (eigenvector $\tfrac1{\sqrt2}(1,1)$ ) and $0$ (eigenvector $\tfrac1{\sqrt2}(1,-1)$ ). Hence

\sigma_1=2,\ \sigma_2=0,\qquad A = 2\,(\tfrac1{\sqrt2},\tfrac1{\sqrt2})^{\top}(\tfrac1{\sqrt2},\tfrac1{\sqrt2}).

A

is rank 1: one direction

(1,1)

explains everything; the orthogonal direction has zero stretch. The best rank-1 approximation of

A

A

itself, with error

\sigma_2^2 = 0

How it connects to the rest of ML

PCA = SVD of the (mean-centered) data matrix. Right singular vectors $v_i$ = principal directions; $\sigma_i^2/(n-1)$ = variance explained. PCA is SVD with a centering step.
Embeddings / latent factors: factor a big sparse matrix (users × pages, words × contexts) into low-rank $U\Sigma V^{\top}$ → dense vectors that capture similarity. Classic recommender = truncated SVD.
Matrix factorization for clickstream: a user×page interaction matrix is low-rank because behavior is driven by a few latent intents → cluster journeys, fill gaps, recommend next page.

By-hand exercise (meets the bar)

Sketch what $A=\begin{bmatrix}3&0\\0&1\end{bmatrix}$ does to the unit circle (ellipse with axes 3 and 1; $U=V=I$ , $\sigma=(3,1)$ ).
For a rank-2 approximation of a matrix with $\sigma=(5,4,2,1)$ , what fraction of energy is kept? ($ \tfrac{25+16}{25+16+4+1}=\tfrac{41}{46}\approx 89%$.)