Linear Regression - least squares with orthogonal projection

Compared to the previous article where we simply used vector derivatives we’ll now try to derive the formula for least squares simply by the properties of linear transformations and the four fundamental subspaces of linear algebra. These are:

  • Kernel $Ker(A)$: The set of all solutions to $Ax = 0$. Sometimes we can say nullspace $N(A)$ instead of kernel.
  • Image $Im(A)$: The set of all right sides $b$, for which there is a solution $Ax = b$. We’ll show that this is equal to the column space $C(A)$, which is the span of the column vectors in $A$.
  • Row space $R(A)$: Span of the row vectors in $A$, sometimes also referred to as $Im(A^T)$ (the image of $A^T$). We can also refer to this as $C(A^T)$, because since $Im(A) = C(A)$, then $Im(A^T) = C(A^T) = R(A)$.
  • Left kernel $Ker(A^T)$ (or left nullspace): The set of all solutions to $A^T x = 0$. The name comes from left multiplying by $x$, specifically the set of solutions to $x^T A = 0^T$.

For this derivation we assume that $Ker(A) \perp R(A)$ and $Im(A) \perp Ker(A^T)$.

When A is not invertible (could be rectangular), there is no exact solution to $Ax = b$, because $b$ has a component in $Ker(A^T)$, which is outside the range of $A$ (literally). We can define $b = b_i + b_n$ where $b_i$ is the ortogonal projection of $b$ onto $Im(A)$, and $b_n$ is the ortogonal projection of $b$ onto $Ker(A^T)$. In other words, $b_i \perp b_n$.

The above is valid, because we assume $Im(A) \perp Ker(A^T)$, and that $span(Im(A) \cup Ker(A^T)) = rng(A)$, in other words that $Im(A)$ and $Ker(A^T)$ together generate the whole range of our linear mapping $A$. Now just using basic algebra:

$$ \begin{align} b &= b_i + b_n \\\
b &= Ax + b_n & \text{left multiply by $A^T$} \\\
A^T b &= A^T A x + A^T b_n & \text{since $b_n \in Ker(A^T)$, we know $A^T b_n = 0$} \\\
A^T b &= A^T A x & \text{$A^T A$ is invertible, see note below} \\\
(A^T A)^{-1} A^T b &= x & \text{finally, we get the normal equation} \end{align} $$

Here we used the fact that $A^T A$ is always a symmetric positive semi-definite matrix, and in case we have linearly independent columns, it is actually positive-definite, which means it is also invertible. This is actually easy to show.

First we show that $A^T A$ is symmetric. This is easy to see, because $(A^T A)_{ij}$ is just the dot product of $i$-th row of $A^T$ with the $j-th$ column of $A$. Note that $i$-th row of $A^T$ is actually $i$-th column of $A$. From this it is obvious that $(A^T A)_{ij} = (A^T A)_{ji}$, because dot product is symmetric.

Now we show that $A^T A$ is positive semi-definite. For an arbitrary matrix $M$, we say that $M$ is positive semi-definite if and only if $x^T M x \geq 0$ for all $x \in \mathbb{R}$. We can directly substitute $A^T A$ and use the same trick as below:

$$ x^T A^T A x = (A x)^T A x = ||A x||^2 \geq 0 $$

Since $A^T A$ satisfies the definition directly, it is positive-semidefinite. $\square$

There is also another very nice way to show that $A^T A$ is invertible, without showing that it is positive semi-definite.

Lemma $Ker(A^T A) = Ker(A)$.

Starting with $Ker(A^T A) \supseteq Ker(A)$, this is trivial, since $Ax = 0 \implies A^T (Ax) = 0$.

Next $Ker(A^T A) \subseteq Ker(A)$: $A^T A x = 0$, left multiply by $x^T$ and we get:

$$ 0 = x^T A^T A x = (A x)^T A x = || Ax ||^2. $$

Since the $L_2$ norm is zero only if the vector is zero, we get that any vector $x$ for which $A^T A x = 0$, it is also true that $|| Ax ||^2 = 0$, which can only be true when $A x = 0$, and hence $x \in Ker(A)$. $\square$

Because $Ker(A^T A) = Ker(A)$, we also know that $rank(A^T A) = rank(A)$, which means if $A$ has linearly independent columns, $A^T A$ is invertible, because it has a full rank (this is because $A^T A$ is square and has the same number of rows/columns as $A$ has columns).