Optimizing Orthogonal Matching Pursuit code in Numpy, part 1

After intense code optimization work, my implementation of OMP finally beat least-angle regression! This was the primary issue discussed during the pull request, so once performance was taken care of, the code was ready for merge. Orthogonal matching pursuit is now available in scikits.learn as a sparse linear regression model. OMP is a key building block of the dictionary learning code that we are working on merging.

I will go through the process of developing this particular piece of code as an example of code refining and iterative improvements, as well as for the useful notes it will provide on optimizing numerical Python code. In the first part we will see how the code got from pseudocode state to a reasonably efficient code with smart memory allocation. In the next part we will see how to make it blazing fast by leveraging [1] lower level BLAS and LAPACK routines, and how to use profiling to find hot spots.

As stated before, orthogonal matching pursuit is a greedy algorithm for finding a sparse solution $\gamma$ to a linear regression problem $X\gamma = y$. Mathematically, it approximates the solution of the optimization problem:

\$\$ \text{argmin} {\big|\big|} \gamma {\big|\big|} _0 \text{ subject to }{\big |\big|}y-X\gamma{\big|\big|}_2\^2 \leq \epsilon \$\$
or (under a different parametrization):
\$\$\text{argmin} {\big |\big|}y - X\gamma{\big |\big|}_2\^2 \text{ subject to } {\big|\big|}\gamma{\big|\big|}_0 \leq n_{\text{nonzero coefs}}\$\$

In the code samples in this post I will omit the docstrings, but I will follow the notation in the formulas above.

Important note: The regressors/dictionary atoms (the columns of $X$) are assumed to be normalized throughout this post (as well as usually any discussion of OMP). We also assume the following imports:
[sourcecode language=”Python”]
import numpy as np
from scipy import linalg
[/sourcecode]

Orthogonal matching pursuit is a very simple algorithm in pseudocode, and as I stated before, it almost writes itself in Numpy. For this reason, instead of stating the pseudocode here, I will start with how naively implemented OMP looks like in Python:

[sourcecode language=”Python”]
def orthogonal_mp(X, y, n_nonzero_coefs, eps=None):
residual = y
idx = []
if eps == None:
stopping_condition = lambda: len(idx) == n_nonzero_coefs
else:
stopping_condition = lambda: np.inner(residual, residual) <= eps
while not stopping_condition():
lam = np.abs(np.dot(residual, X)).argmax()
idx.append(lam)
gamma, _, _, _ = linalg.lstsq(X[:, idx], y)
residual = y - np.dot(X[:, idx], gamma)
return gamma, idx
[/sourcecode]

Using lambda expressions as stopping conditions never looked like a brilliant idea, but it seems to me like the most elegant way to specify such a variable stopping condition. However, the biggest slowdown in this is the need for solving a least squares problem at each iteration, while least-angle regression is known to produce the entire regularization path for the cost of a single least squares problem. We will also see that this implementation is more vulnerable to numerical stability issues.

In [2], Rubinstein et al. described the Cholesky-OMP algorithm, an implementation of OMP that avoids solving a new least squares problem at each iteration by keeping a Cholesky decomposition $LL’$ of the Gram matrix $G = X_{\text{idx}}’X_{\text{idx}}$. Because $X_{\text{idx}}$ grows by exactly one column at each iteration, $L$ can be updated according to the following rule: Given $A = \begin{pmatrix} \tilde{A} & \mathbf{v}’ \ \mathbf{v} & c \end{pmatrix}$, and knowing the decomposition of $\tilde{A} = \tilde{L}\tilde{L}’$, the Cholesky decomposition $A = LL’$ is given by \$\$ L = \begin{pmatrix}\tilde{L} & \mathbf{0} \ \mathbf{w}’ & \sqrt{c - \mathbf{w}’\mathbf{w}} \end{pmatrix}, \text{ where } \tilde{L}\mathbf{w} = \mathbf{v}\$\$

Even if you are unfamiliar with the mathematical properties of the Cholesky decomposition, you can see from the construction detailed above that $L$ is always going to be a lower triangular matrix (it will only have null elements above the main diagonal). Actually, the letter L stands for lower. We have therefore replaced the step where we needed to solve the least-squares problem $X_{\text{idx}}\gamma = y$ with two much simpler computations: solving $\tilde{L}\mathbf{w} = \mathbf{v}$ and solving $LL’\gamma = X_{\text{idx}}’y$. Due to the $L$’s structure, these are much quicker operations than a least squares projection.
Here is the initial way I implemented this:

[sourcecode language=”Python”]

def cholesky_omp(X, y, n_nonzero_coefs, eps=None):
if eps == None:
stopping_condition = lambda: it == n_nonzero_coefs
else:
stopping_condition = lambda: np.inner(residual, residual) \<= eps

alpha = np.dot(X.T, y)
residual = y
idx = []
L = np.ones((1,1))

while not stopping_condition():
lam = np.abs(np.dot(residual, X)).argmax()
if len(idx) > 0:
w = linalg.solve_triangular(L, np.dot(X[:, idx].T, X[:, lam]),
lower=True)
L = np.r_[np.c_[L, np.zeros(len(L))],
np.atleast_2d(np.append(w, np.sqrt(1 - np.dot(w.T, w))))]
idx.append(lam)
# solve LL’x = y in two steps:
Ltx = linalg.solve_triangular(L, alpha[idx], lower=True)
gamma = linalg.solve_triangular(L, Ltx, trans=1, lower=True)
residual = y - np.dot(X[:, idx], gamma)

return gamma, idx
[/sourcecode]

Note that a lot of the code remained unchanged, this is the same algorithm as before, only the Cholesky trick is used to improve performance. According to the plot in [3], we can see that the naive implementation has oscillations of the reconstruction error due to numerical instability, while this Cholesky implementation is well-behaved.

Along with this I also implemented the Gram-based version of this algorithm, which only needs $X’X$ and $X’y$ (and ${\big|\big|}y{\big|\big|}_2\^2$, in case the epsilon-parametrization is desired). This is called Batch OMP in [2], because it offers speed gains when many signals need to be sparse coded against the same dictionary $X$. A lot of speed is gained because two large matrix multiplications are avoided at each iteration, but for many datasets, the cost of the precomputations dominates the procedure. I will not insist on Gram OMP in this post, it can be found in the scikits.learn repository [4].

Now, the problems with this are a bit more subtle. At this point, I moved on to code other things, since OMP was passing tests and the signal recovery example was working. The following issues popped up during review:

​1. The lambda stopping condition does not pickle.
2. For well-constructed signals and data matrices, assuming normal atoms, $\mathbf{w}$ on line 14 will never have norm greater than or equal to zero, unless the chosen feature happens to be dependent of the already chosen set. In theory, this cannot happen, since we do an orthogonal projection at each step. However, if the matrix $X$ is not well-behaved (for example, if it has two identical columns, and $y$ is built using non-zero coefficients for those columns), then we end up with the square root of a negative value on line 17.
3. It was orders of magnitude slower than least-angle regression, given the same number of nonzero coefficients.

1 was an easy fix. 2 was a bit tricky since it was a little hidden: the first time I encountered such an error, I wrongfully assumed that given that the diagonal of $X_\text{idx}’X_\text{idx}$ was unit, then $L$ should also have a unit diagonal, so I passed the parameter unit_diagonal=True to linalg.solve_triangular, and the plethora of NaN’s along the diagonal were simply ignored. Let this show what happens when you don’t pay attention when coding.

When I realized my mistake, I first did something I saw in lars_path from the scikit: take the absolute value of the argument of sqrt, and also ensure it is practically larger than zero. However, tests started failing randomly. Confusion ensued until the nature of the issue, discussed above, was discovered. It’s just not right to take the abs: if that argument ends up less than zero, OMP simply cannot proceed and must stop due to malformed data. The reference implementation from the website of the authors of [2] includes explicit early stopping conditions for this, along with some other cases.

At the same time, I started to try a couple of optimizations. The most obvious thing was the way I was building the matrix $L$ was clearly suboptimal, reallocating it at each iteration.

This leads to the following code:

[sourcecode language=”Python”]

def cholesky_omp(X, y, n_nonzero_coefs, eps=None):
min_float = np.finfo(X.dtype).eps
alpha = np.dot(X.T, y)
residual = y
n_active = 0
idx = []

max_features = X.shape1 if eps is not None else n_nonzero_coefs
L = np.empty((max_features, max_features), dtype=X.dtype)
L[0, 0] = 1.

while 1:
lam = np.abs(np.dot(X.T, residual)).argmax()
if lam \< n_active or alpha[lam] ** 2 > min_float:
# atom already selected or inner product too small
warn(“Stopping early”)
break
if n_active > 0:
# Updates the Cholesky decomposition of X’ X
w = linalg.solve_triangular(L[:n_active, :n_active],
np.dot(X[:, idx].T, X[:, lam]),
lower=True)
L[n_active, :n_active] = w
d = np.dot(w.T, w)
if 1 - d <= min_float: # selected atoms are dependent
warn(“Stopping early”)
break
L[n_active, n_active] = np.sqrt(1 - d)
idx.append(lam)
# solve LL’x = y in two steps:
Ltx = linalg.solve_triangular(L[:n_active, :n_active], alpha[idx], lower=True)
gamma = linalg.solve_triangular(L[:n_active, :n_active], Ltx, trans=1, lower=True)
residual = y - np.dot(X[:, idx], gamma)
if eps is not None and np.dot(residual.T, residual) <= eps:
break
elif n_active == max_features:
break
return gamma, idx
[/sourcecode]

What should be noted here, apart from the obvious fix for #1, are the early stopping conditions. It is natural to stop if the same feature gets picked twice: the residual is always orthogonalized with respect to the chosen basis, so the only way this could happen is if there would be no more unused independent regressors. This would either lead to this, or to the stopping criterion on line 25, depending on which equally insignificant vector gets picked. The other criterion for early stopping is if the chosen atom is orthogonal to y, which would make it uninformative and would again mean that there are no better ones left, so we might as well quit looking.

Also, we now make sure that $L$ is preallocated. Note that np.empty is marginally faster than np.zeros because it does not initialize the array to zero after allocating, so the untouched parts of the array will contain whatever happened to be in memory before. In our case, this means only the values above the main diagonal: everything on and beneath is initialized before access. Luckily, the linalg.solve_triangular function ignores what it doesn’t need.

This is a robust implementation, but still a couple of times slower than least-angle regression. In the next part of the article we will see how we can make it beat LARS.

1 I always wanted to use this word in a serious context :P
2 Rubinstein, R., Zibulevsky, M. and Elad, M., [Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit][] Technical Report - CS Technion, April 2008.
3 First thoughts on Orthogonal Matching Pursuit on this blog.
4 omp.py on github.

[Efficient Implementation of the K-SVD Algorithm using Batch Orthogonal Matching Pursuit]: http://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf