Disclaimer: This blog is my attempt to primarily document and organize concepts that I teach to myself, and in no way is it guaranteed that the reasoning developed in my posts is error-free. Through my blog I aspire to let all these ideas mature while receiving constructive feedback, eventually developing intuition. I warmly welcome constructive insights, and discussions so let’s keep this a friendly, respectful space for learning and knowledge sharing. Your contribution highly matters to me and is appreciated!
The SVD Decomposition into shifts and rotations.
Image source: Georg-Johann
SVD is the backbone of a lot of core concepts in Machine Learning and Data Science. One of the most celebrated tools based on SVD is the Principal Component Analysis (PCA), which we will talk about later in the future. In this post, we will be having a look at some quick Linear Algebra warm-up and then proceed to get a grasp the SVD’s core functionality.
Useful Linear Algebra Observations
A few observations from Linear Algebra to consider:
- Matrices do not necessarily have orthogonal eigenvectors. In general, if a matrix has GM(λi) = AM(λi) for each eigenvalue λi, the matrix is diagonalisable. Here, GM(λi) is the Geometric Multiplicity of λi (= the number of linearly independent eigenvectors) and AM(λi) is the Arithmetic Multiplicity of λi (= the number of roots at λi).
- Symmetric matrices have real eigenvalues; however, it is not guaranteed that these eigenvalues are also distinct.
- The $n$ eigenvectors of symmetric matrices can be chosen orthogonal, regardless of eigenvalue multiplicity.
- As a consequence of I.2., any symmetric matrix is diagonalisable.
- A matrix $A$ that satisfies $A^\top A=I$ (i.e. $A$ is orthogonal), also satisfies $AA^\top=I$.
Diagonalisation and Other Decompositions
A diagonalisable square matrix $A\in \mathbb{R}^{n\times n}$ can be expressed in the form $A=X \Lambda X^{-1}$, where $X$ is an invertible matrix whose columns are the eigenvectors of $A$, and $\Lambda$ is a diagonal matrix containing the corresponding eigenvalues of $A$.
Additionally, if $A$ is symmetric, then due to I.3. the eigenvectors can be chosen orthonormal. Then, $A=U \Lambda U^{\top}$, where $U$ is a matrix whose columns contain the eigenvectors of $A$ and satisfies $U^{\top}U=I$ (i.e. $U$ is orthogonal).
In the case of positive definite (real) matrix (i.e. symmetric with nonnegative eigenvalues $\lambda_i$) $A$, its LU decomposition yields a specific form $A=L\sqrt{P}L^\top$, where $L$ is lower triangular and $\sqrt{P}$ is a diagonal matrix containing the square roots of pivots. Choose $\tilde{L}:=L\sqrt{P}^\top$. Then, $A=\tilde{L}\tilde{L}^\top$. This is known as the Cholesky decomposition of $A$.
From the above observations, we can deduce that a positive definite (real) matrix $A$ can always be expressed in the following forms:
- $A = BB^\top=LPL^\top$, where $B:=L\sqrt{P}$ is lower triangular, and
- $A = BB^\top=Q\Lambda Q^\top$, where $B:=Q\sqrt{\Lambda}Q^\top$ is a symmetric matrix (symmetry is easy to validate). One can use I.5. to see this.
The SVD
Pheww! That was a nostalgic throwback to undergrad level linear algebra. You may have noticed that in the previous section we focused on symmetric, positive definite matrices. This is going to be useful later on when discussing one of the fundamental applications of SVD, the PCA.
Without further ado, let’s get to the core of this post. Since the matrix $A$ containing samples and records of data does not necessarily have to be square (i.e., the number of features is not necessarily the same as the number of samples), we need to generalize the notion of eigenvalues. It is straightforward to see how the definition of eigenvalues and eigenvectors fail rectangular matrices in general. This leads us to the introduction of singular values of $A\in\mathbb{R}^{n\times m}$ as a generalization of eigenvalues of $A$. Define $\sigma_i$, $1\leq i \leq r$, where $r$ is the rank of $A$, $u_i$ where $1\leq i\leq m$ and $v_i$ where $1\leq i\leq n$. Then, we have: $$ Au_i = \sigma_i v_i, \hspace{5px} 1\leq i \leq r,$$ $$ Au_i = 0, \hspace{5px} r+1\leq i \leq m,$$ From the above we can see that $u_i$ for $r+1\leq i \leq m$ belongs to the nullspace $N(A)$ of $A$. Moreover, $u_i$ for $r+1\leq i \leq m$ belongs to the nullspace $N(A^\top)$ of $A^\top$. Our matrix $A$ can be decomposed as $$A=V\Sigma U^\top,$$ where $U$ and $V$ contain all the $u$’s and $v$’s as columns. $\Sigma=\begin{bmatrix}\text{diag}(\sigma_1,\dots,\sigma_r) & \mathbf{0} \\ \mathbf{0} & \mathbf{0} \end{bmatrix}\in\mathbb{n\times m}$.
Let’s now check properties of $A^\top A$ and $AA^\top$: $$ A^\top A = (V\Sigma U^\top)^\top(V\Sigma U^\top) = U\Sigma^\top\Sigma U^\top $$ $$ A A^\top = (V\Sigma U^\top)(V\Sigma U^\top)^\top = V\Sigma\Sigma^\top V^\top $$ Notice that both $\Sigma^\top\Sigma$ and $\Sigma\Sigma^\top$ are square matrices that are always diagonal. We can clearly see how both these matrices can be brought into the form $A^\top A = B_1B_1^\top$ and $AA^\top = B_2B_2^\top$, where $B_1:=U\sqrt{\Sigma^\top\Sigma}U^\top$ and $B_2:=V\sqrt{\Sigma\Sigma^\top}V^\top$ are symmetric matrices. Therefore, $U$ contains the eigenvectors of $A^\top A$ as columns and $V$ contains the eigenvectors of $A A^\top$ as columns.
The SVD decomposition can be interpreted as successive rotations and scalings applied to the unit matrix. More specifically, the orthogonal matrices $U$, $V$ can be seen as rotation matrices, while the diagonal matrix $\Sigma$ can be seen as a scaling matrix. This idea is showcased in the chosen post image above.
Finally, let us now further expand the SVD decomposition into a sum of “smaller” matrices. $$ A = V\Sigma U^\top = \sigma_1v_1u_1^\top + \sigma_2v_2u_2^\top + \dots + \sigma_rv_ru_r^\top$$ Notice how any product $xy^\top$ ($x,y$ column vectors) results to rank-1 matrices. The SVD expresses a matrix into a weighted sum of rank-1 matrices. This is a fascinating insight into our matrix; given our choice of orthogonal matrices $U$ and $V$, we can have our matrix $A$ expressed as a sum of most “foundational” components, which are the rank-1 matrices. In other words, rank-1 matrices are the “smallest”/most “foundational” parts we can break our initial matrix into. Each of these foundational parts represents a combination between a feature pattern and a measurement pattern.
From the expansion above, it is now very clear that we can rewrite the SVD of $A$ in another equivalent form: $$A=V_r\Sigma_r U_r^\top,$$ where $V_r$ and $U_r$ only contain the first $r$ columns of $V$ and $U$, respectively. $\Sigma_r$ is a $r\times r$ square matrix.
This $\Sigma_r$ contains the singular values of the matrix, which denote the intensity of each such foundational component. That is, some components may have higher significance than others. We need to pay attention to this detail as this can give us more insight about actual patterns across features and measurements, as some weaker ones can be false patterns attributed, for example, to measurement noise. Thus, a correct interpretation of $\Sigma_r$ can help us retain useful information when it comes to extracting features, and we can even choose to represent the initial matrix using fewer (but truly useful) of these foundational parts — thus less can be more!
On the next post we will explore these ideas through a fun small project!