The matrix inverse is a cornerstone of linear algebra, taught, along with its applications, since high school. The inverse of a matrix $latex A$, if it exists, is the matrix $latex A\^{-1}$ such that $latex AA\^{-1} = A\^{-1}A = I_n$. Based on the requirement that the left and right multiplications should be equal, it follows that it only makes sense to speak of inverting square matrices. But just the square shape is not enough: for a matrix $latex A$ to have an inverse, $latex A$ must be full rank.
The inverse provides an elegant (on paper) method of finding solutions to systems of $latex n$ equations with $latex n$ unknowns, which correspond to solving $latex Ax = b$ for $latex x$. If we’re lucky and $latex A\^{-1}$ exists, then we can find $latex x = A\^{-1}b$. For this to work, it must be the case that:
- We have exactly as many unknowns as equations
- No equation is redundant, i.e. can be expressed as a linear combination of the others
In this setting, there is a unique solution for $latex x$.
The Moore-Penrose pseudoinverse¶
What if we have more equations than unknowns? It is most likely the case that we cannot satisfy all the equations perfectly, so let’s settle for a solution that best fits the constraints, in the sense of minimising the sum of squared errors. We solve $latex \operatorname{arg\,min}_x ||b - Ax||$.
And how about the other extreme, where we have a lot of unknowns, but just a few equations constraining them. We will probably have an infinity of solutions, how can we choose one? A popular choice is to take the one of least $latex \ell_2$ norm: $latex \operatorname{arg\,min}_x ||x|| \operatorname{s.t.} Ax = b$. Is there a way to generalize the idea of a matrix inverse for this setting?
The pseudoinverse of an arbitrary-shaped matrix $latex A$, written $latex A\^{+}$, has the same shape as $latex A\^{T}$ and solves our problem: the answer to both optimization methods above is given by $latex x = A\^{+}y$.
The theoretical definition of the pseudoinverse is given by the following conditions. The intuitive way to read them is as properties of $latex AA\^+$ or $latex A\^+A$:
- $latex AA\^+A = A$
- $latex A\^+AA\^+ = A\^+$
- $latex (AA\^+)\^T = AA\^+$
- $latex (A\^+A)\^T = A\^+A$
These conditions do not however give us a way to get our hands on a pseudoinverse, so we need something else.
How to compute the pseudoinverse on paper¶
The first time I ran into the pseudoinverse, I didn’t even know its definition, only the expression of the closed-form solution of such a problem, and given as:
$latex A\^+ = (A\^T A)\^{-1}A\^T$
What can we see from this expression:
- It gives us a way to compute the pseudoinverse, and hence to solve the problem
- If $latex A$ is actually invertible, it means $latex A\^T$ is invertible, so we have $latex A\^+ = A\^{-1}(A\^T)\^{-1}A\^T = A\^{-1}$
- Something bad happens if $latex A\^TA$ is not invertible.
The pseudoinverse is still defined, and unique, when $latex A\^TA$ is not invertible, but we cannot use the expression above to compute it.
Numerical issues¶
Before going on, we should clarify and demystify some of the urban legends about numerical computation of least squares problems. You might have heard the following unwritten rules:
- Never compute $latex A\^{-1}$, solve the system directly
- If you really need $latex A\^{-1}$, use
pinv
and notinv
The first of these rules is based on some misguided beliefs, but is
still good advice. If your goal is a one-shot answer to a system,
there’s no use in explicitly computing a possibly large inverse, when
all you need is $latex x$. But this paper shows that computing the
inverse is not necessarily a bad thing. The key to this is conditional
accuracy, and as long as the inv
function used has good conditional
bounds, you will get as good results as with a least squares solver.
The second rule comes from numerical stability, and will definitely bite you if misunderstood. If $latex A$ is a square matrix with a row full of zeros, it’s clearly not invertible, so an algorithm attempting to compute the inverse will fail and you will be able to catch that failure. But what if the row is not exactly zero, but the sum of several other rows, and a slight loss of precision is propagated at every step?
Numerical rank vs. actual rank¶
The rank of a matrix $latex A$ is defined as the number of linearly independent rows (or equivalently, columns) in $latex A$. In other words, the number of non-redundant equations in the system. We’ve seen before that if the rank is less than the total number of rows, the system cannot have a unique solution anymore, so the matrix $latex A$ is not invertible.
The rank of a matrix is a computationally tricky problem. On paper, with small matrices, you would look at minors of decreasing size, until you find the first non-zero one. This is unfeasible to implement on a computer, so numerical analysis has a different approach. Enter the singular value decomposition!
The SVD of a matrix $latex A$ is $latex A = USV\^{T}$, where $latex S$ is diagonal and $latex U, V$ are orthogonal. The elements on the diagonal of $latex S$ are called the singular values of $latex A$. It can be seen that to get a row full of zeros when multiplying three such matrices, a singular value needs to be exactly zero.
The ugly thing that could happen is that one (or usually more) singular values are not exactly zero, but very low values, due to propagated imprecision. Why is this a problem? By looking at the SVD and noting its properties, it becomes clear that $latex A\^{-1} = VS\^{-1}U\^{T}$ and since $latex S$ is diagonal, its inverse is formed by taking the inverse of all the elements on the diagonal. But if a singular value is very small but not quite zero, its inverse is very large and it will blow up the whole computation of the inverse. The right thing to do here is either to tell the user that $latex A$ is numerically rank deficient, or to return a pseudoinverse instead. A pseudoinverse would mean: give up on trying to get $latex AA\^+$ to be the identity matrix, simply aim for a diagonal matrix with approximately ones and zeroes. In other words, when singular values are very low, set them to 0.
How do you set the threshold? This is actually a delicate issue, being discussed on the numeric Python mailing list.
Scipy implementations¶
Scipy exposes inv
, pinv
and pinv2
. inv
secretly invokes LAPACK,
that ancient but crazy robust code that’s been used since the 70s, to
first compute a pivoted LU decomposition that is then used to compute
the inverse. pinv
also uses LAPACK, but for computing the
least-squares solution to the system $latex AX = I$. pinv2
computes
the SVD and transposes everything like shown above. Both pinv
and
pinv2
expose cond
and rcond
arguments to handle the treatment of
very small singular values, but (attention!) they behave differently!
The different implementations also lead to different speed. Let’s look at inverting a random square matrix:
[sourcecode lang=”python”]
In [1]: import numpy as np
In [2]: from scipy import linalg
In [3]: a = np.random.randn(1000, 1000)
In [4]: timeit linalg.inv(a)
10 loops, best of 3: 132 ms per loop
In [5]: timeit linalg.pinv(a)
1 loops, best of 3: 18.8 s per loop
In [6]: timeit linalg.pinv2(a)
1 loops, best of 3: 1.58 s per loop
[/sourcecode]
Woah, huge difference! But do all three methods return the “right” result?
[sourcecode lang=”python”]
In [7]: linalg.inv(a)[:3, :3]
Out[7]:
array([[ 0.03636918, 0.01641725, 0.00736503],
[-0.04575771, 0.03578062, 0.02937733],
[ 0.00542367, 0.01246306, 0.0122156 ]])
In [8]: linalg.pinv(a)[:3, :3]
Out[8]:
array([[ 0.03636918, 0.01641725, 0.00736503],
[-0.04575771, 0.03578062, 0.02937733],
[ 0.00542367, 0.01246306, 0.0122156 ]])
In [9]: linalg.pinv2(a)[:3, :3]
Out[9]:
array([[ 0.03636918, 0.01641725, 0.00736503],
[-0.04575771, 0.03578062, 0.02937733],
[ 0.00542367, 0.01246306, 0.0122156 ]])
In [10]: np.testing.assert_array_almost_equal(linalg.inv(a), linalg.pinv(a))
In [11]: np.testing.assert_array_almost_equal(linalg.inv(a),
linalg.pinv2(a))
[/sourcecode]
Looks good! This is because we got lucky, though, and a
was invertible
to start with. Let’s look at its spectrum:
[sourcecode lang=”python”]
In [12]: _, s, _ = linalg.svd(a)
In [13]: np.min(s), np.max(s)
Out[13]: (0.029850235603382822, 62.949785645178906)
[/sourcecode]
This is a lovely range for the singular values of a matrix, not too small, not too large. But what if we built the matrix in a way that would always pose problems? Specifically, let’s look at the case of covariance matrices:
[sourcecode lang=”python”]
In [14]: a = np.random.randn(1000, 50)
In [15]: a = np.dot(a, a.T)
In [16]: _, s, _ = linalg.svd(a)
In [17]: s[-9:]
Out[17]:
array([ 7.40548924e-14, 6.48102455e-14, 5.75803505e-14,
5.44263048e-14, 4.51528730e-14, 3.55317976e-14,
2.46939141e-14, 1.54186776e-14, 5.08135874e-15])
[/sourcecode]
a
has at least 9 tiny singular values. Actually it’s easy to see why
there are 950 of them:
[sourcecode lang=”python”]
In [18]: np.sum(s \< 1e-10)
Out[18]: 950
[/sourcecode]
How do our functions behave in this case? Instead of just looking at a corner, let’s use our gift of sight:[]
The small eigenvalues are large enough that inv
thinks the matrix is
full rank. pinv
does better but it still fails, you can see a group of
high-amplitude noisy columns. pinv2
is faster and it also gives us a
useful result in this case.
Wait, does this mean that pinv2
is simply better, and pinv
is useless?
Not quite. Remember, we are now trying to actually invert matrices, and degrade gracefully in case of rank deficiency. But what if we need the pseudoinverse to solve an actual non-square, wide or tall system?
[sourcecode lang=”python”]
In [19]: a = np.random.randn(1000, 50)
In [20]: timeit linalg.pinv(a)
10 loops, best of 3: 104 ms per loop
In [21]: timeit linalg.pinv(a.T)
100 loops, best of 3: 7.08 ms per loop
In [22]: timeit linalg.pinv2(a)
10 loops, best of 3: 114 ms per loop
In [23]: timeit linalg.pinv2(a.T)
10 loops, best of 3: 126 ms per loop
[/sourcecode]
Huge victory for pinv
in the wide case! Hurray! With all this insight,
we can draw a line and see what we learned.
- If you are 100% sure that your matrix is invertible, use
inv
for a huge speed gain. The implementation ofinv
from Scipy is based on LAPACK’s*getrf
+*getri
, known to have good bounds. - If you are trying to solve a tall or wide system, use
pinv
. - If your matrix is square but might be rank deficient, use
pinv2
for speed and numerical gain.
Improving the symmetric case¶
But wait a second, can’t we do better? $latex AA\^T$ is symmetric,
can’t we make use of that to speed up the computation even more?
Clearly, if $latex A$ is symmetric, in its SVD $latex A = USV\^T$,
we must have $latex U = V$. But this is exactly the eigendecomposition
of a symmetric matrix $latex A$. The eigendecomposition can be
computed cheaper than the SVD using Scipy eigh
, that uses LAPACK’s
*evr
. As part of my GSoC this year, with help from Jake
VanderPlas, we made a pull request to Scipy containing a pinvh
function that is equivalent to pinv2
but faster for symmetric matrices.
[sourcecode lang=”python”]
In [24]: timeit linalg.pinv2(a)
1 loops, best of 3: 1.54 s per loop
In [25]: timeit linalg.pinvh(a)
1 loops, best of 3: 621 ms per loop
In [26]: np.testing.assert_array_almost_equal(linalg.pinv2(a),
linalg.pinvh(a))
[/sourcecode]
[]: http://localhost:8001/wp-content/uploads/2012/08/pseudoinverses.png
Comments !