Optimizing Orthogonal Matching Pursuit code in Numpy, part 2

EDIT: There was a bug in the final version of the code presented here. It is fixed now, for its backstory, check out my blog post on it.

When we last saw our hero, he was fighting with the dreaded implementation of least-angle regression, knowing full well that it was his destiny to be faster.

We had come up with a more robust implementation, catching malformed cases that would have broken the naive implementation, and also it was orders of magnitude faster than said implementation. However, our benchmark [1] showed that it was a couple of times slower than least-angle regression.

By poking around the scikits.learn codebase, I noticed that there is a triangular system solver in scikits.learn.utils.arrayfuncs. Unlike the scipy.linalg one, this one only works with lower triangular arrays, and it forcefully overwrites b. Even though if weren’t faster, it should still be used: scikits.learn aims to be as backwards-compatible with SciPy as possible, and linalg.solve_triangular was added in 0.9.0. Anyway, let’s just see whether it’s faster:

[sourcecode language=”python”]
In 1: import numpy as np

In 2: from scipy import linalg

In [3]: from scikits.learn.datasets import make_spd_matrix

In [4]: from scikits.learn.utils.arrayfuncs import solve_triangular

In [5]: G = make_spd_matrix(1000)

In [6]: L = linalg.cholesky(G, lower=True)

In [7]: x = np.random.randn(1000)

In [8]: y = x.copy()

In [9]: timeit solve_triangular(L, x)
100 loops, best of 3: 3.45 ms per loop

In [10]: timeit linalg.solve_triangular(L, y, lower=True, overwrite_b=True)
10 loops, best of 3: 134 ms per loop
 [/sourcecode]

Wow! That’s 40x faster. We’re catching two rabbits with one stone here, let’s do the change! Notice that we can just copy [latex] \mathbf{v}[/latex] into the appropriate place in [latex] L[/latex] and then solve in place.

But whoops! When solving the [latex] LL’[/latex] system, we take advantage of the transpose attribute in linalg.solve_triangular, which the scikits.learn version does not expose. We could think of a solution, but here’s a better idea: Shouldn’t there be some way to directly solve the entire system in one go?

Well, there is. It is an LAPACK function by the name of potrs. If you are not aware, LAPACK is a Fortran library with solvers for various types of linear systems and eigenproblems. LAPACK along with BLAS (on which it is based) pretty much powers all the scientific computation that happens. BLAS is an API with multiple implementations dating from 1979, while LAPACK dates from 1992. If you ever used Matlab, this is what was called behind the scenes. SciPy, again, provides a high-level wrapper around this, the linalg.cho_solve function.

But SciPy also gives us the possibility to import functions directly from LAPACK, through the use of linalg.lapack.get_lapack_funcs. Let’s see how the low-level LAPACK function compares to the SciPy wrapper, for our use case:

[sourcecode language=”python”]
In [11]: x = np.random.randn(1000)

In [12]: y = x.copy()

In [13]: timeit linalg.cho_solve((L, True), x)
1 loops, best of 3: 95.4 ms per loop

In [14]: potrs, = linalg.lapack.get_lapack_funcs((‘potrs’,), (G,))

In [15]: potrs
Out[15]: <fortran object>

In [16]: timeit potrs(L, y)
100 loops, best of 3: 9.49 ms per loop
 [/sourcecode]

That’s 10 times faster! So now we found an obvious way to optimize the code:

[sourcecode language=”python”]
def cholesky_omp(X, y, n_nonzero_coefs, eps=None):
min_float = np.finfo(X.dtype).eps
potrs, = get_lapack_funcs((‘potrs’,), (X,))
alpha = np.dot(X.T, y)
residual = y
n_active = 0
idx = []

max_features = X.shape1 if eps is not None else n_nonzero_coefs
L = np.empty((max_features, max_features), dtype=X.dtype)
L[0, 0] = 1.

while 1:
lam = np.abs(np.dot(X.T, residual)).argmax()
if lam < n_active or alpha[lam] ** 2 < min_float:
# atom already selected or inner product too small
warn(“Stopping early”)
break
if n_active > 0:
# Updates the Cholesky decomposition of X’ X
L[n_active, :n_active] = np.dot(X[:, idx].T, X[:, lam]
solve_triangular(L[:n_active, :n_active], L[n_active, :n_active])
d = np.dot(L[n_active, :n_active].T, L[n_active, :n_active])
if 1 - d <= min_float: # selected atoms are dependent
warn(“Stopping early”)
break
L[n_active, n_active] = np.sqrt(1 - d)
idx.append(lam)
# solve LL’x = y in two steps:
gamma, _ = potrs(L[:n_active, :n_active], alpha[idx], lower=True,
overwrite_b=False)
residual = y - np.dot(X[:, idx], gamma)
if eps is not None and np.dot(residual.T, residual) <= eps:
break
elif n_active == max_features:
break
return gamma, idx
 [/sourcecode]

Woohoo! But we still lag behind. Now that we delegated the trickiest parts of the code to fast and reliable solvers, it’s time to use a profiler and see what the bottleneck is now. Python has excellent tools for this purpose. What solved the problem in this case was line_profiler [2]. There is a great article by Olivier Grisel here 2 regarding how to use these profilers. I’m just going to say that line_profiler‘s output is very helpful, basically printing the time taken by each line of code next to that line.

Running the profiler on this code, we found that 58% of the time is spent on line 14, 20.5% on line 21, and 20.5% on line 32, with the rest being insignificant (potrs takes 0.1%!). The code is clearly dominated by the matrix multiplications. By running some more timings with IPython, I found that multiplying such column-wise views of the data as X[:, idx] is considerably slower then multiplying a contiguous array. The least-angle regression code in scikits.learn avoids this by swapping columns towards the front of the array as they are chosen, so we can replace X[:, idx] with X[:, :n_active]. The nice part is that if the array is stored in Fortran-contiguous order (ie. column contiguous order, as opposed to row contiguous order, as in C), swapping two columns is a very fast operation!. Let’s see some more benchmarks!

[sourcecode language=”python”]
In [17]: X = np.random.randn(5000, 5000)

In [18]: Y = X.copy(‘F’) # fortran-ordered

In [19]: a, b = 1000, 2500

In [20]: swap, = linalg.get_blas_funcs((‘swap’,), (X,))

In [21]: timeit X[:, a], X[:, b] = swap(X[:, a], X[:, b])
100 loops, best of 3: 6.29 ms per loop

In [22]: timeit Y[:, a], Y[:, b] = swap(Y[:, a], Y[:, b])
10000 loops, best of 3: 111 us per loop
 [/sourcecode]

We can see that using Fortran-order takes us from the order of miliseconds to the order of microseconds!

Side note: I almost fell into the trap of swapping columns the pythonic way. That doesn’t work:
[sourcecode language=”python”]
In [23]: X[:, a], X[:, b] = X[:, b], X[:, a]

In [24]: np.testing.assert_array_equal(X[:, a], X[:, b])

In [25]:
 [/sourcecode]

However this trick works great for swapping elements of one-dimensional arrays.

Another small optimization that we can do: I found that on my system, it’s slightly faster to compute the norm using the BLAS function nrm2. So by putting all of these together, we end up with the final version of our code:

[sourcecode language=”python”]
def cholesky_omp(X, y, n_nonzero_coefs, eps=None, overwrite_X=False):
if not overwrite_X:
X = X.copy(‘F’)
else: # even if we are allowed to overwrite, still copy it if bad order
X = np.asfortranarray(X)

min_float = np.finfo(X.dtype).eps
nrm2, swap = linalg.get_blas_funcs((‘nrm2’, ‘swap’), (X,))
potrs, = get_lapack_funcs((‘potrs’,), (X,))

indices = range(len(Gram)) # keeping track of swapping
alpha = np.dot(X.T, y)
residual = y
n_active = 0

max_features = X.shape1 if eps is not None else n_nonzero_coefs
L = np.empty((max_features, max_features), dtype=X.dtype)
L[0, 0] = 1.

while True:
lam = np.abs(np.dot(X.T, residual)).argmax()
if lam < n_active or alpha[lam] ** 2 < min_float:
# atom already selected or inner product too small
warn(“Stopping early”)
break
if n_active > 0:
# Updates the Cholesky decomposition of X’ X
L[n_active, :n_active] = np.dot(X[:, :n_active].T, X[:, lam])
solve_triangular(L[:n_active, :n_active], L[n_active, :n_active])
v = nrm2(L[n_active, :n_active]) ** 2
if 1 - v <= min_float: # selected atoms are dependent
warn(“Stopping early”)
break
L[n_active, n_active] = np.sqrt(1 - v)
X.T[n_active], X.T[lam] = swap(X.T[n_active], X.T[lam])
alpha[n_active], alpha[lam] = alpha[lam], alpha[n_active]
indices[n_active], indices[lam] = indices[lam], indices[n_active]
n_active += 1
# solves LL’x = y as a composition of two triangular systems
gamma, _ = potrs(L[:n_active, :n_active], alpha[:n_active], lower=True,
 overwrite_b=False)

residual = y - np.dot(X[:, :n_active], gamma)
if eps is not None and nrm2(residual) ** 2 <= eps:
break
elif n_active == max_features:
 break

return gamma, indices[:n_active]
 [/sourcecode]

Now, the benchmark at [1] indicates victory over least-angle regression! I hope you have enjoyed this short tour. See you next time!

1[]
2[]

Comments !