EDIT: There was a bug in the final version of the code presented here. It is fixed now, for its backstory, check out my blog post on it.

When we last saw our hero, he was fighting with the dreaded implementation of least-angle regression, knowing full well that it was his destiny to be faster.

We had come up with a more robust implementation, catching malformed cases that would have broken the naive implementation, and also it was orders of magnitude faster than said implementation. However, our benchmark [1] showed that it was a couple of times slower than least-angle regression.

By poking around the `scikits.learn`

codebase, I noticed that there is a
triangular system solver in `scikits.learn.utils.arrayfuncs`

. Unlike the
`scipy.linalg`

one, this one only works with lower triangular arrays,
and it forcefully overwrites `b`

. Even though if weren’t faster, it
should still be used: `scikits.learn`

aims to be as backwards-compatible
with SciPy as possible, and `linalg.solve_triangular`

was added in
0.9.0. Anyway, let’s just see whether it’s faster:

[sourcecode language=”python”]

In 1: import numpy as np

In 2: from scipy import linalg

In [3]: from scikits.learn.datasets import make_spd_matrix

In [4]: from scikits.learn.utils.arrayfuncs import solve_triangular

In [5]: G = make_spd_matrix(1000)

In [6]: L = linalg.cholesky(G, lower=True)

In [7]: x = np.random.randn(1000)

In [8]: y = x.copy()

In [9]: timeit solve_triangular(L, x)

100 loops, best of 3: 3.45 ms per loop

In [10]: timeit linalg.solve_triangular(L, y, lower=True,
overwrite_b=True)

10 loops, best of 3: 134 ms per loop

[/sourcecode]

Wow! That’s 40x faster. We’re catching two rabbits with one stone here, let’s do the change! Notice that we can just copy [latex] \mathbf{v}[/latex] into the appropriate place in [latex] L[/latex] and then solve in place.

But whoops! When solving the [latex] LL’[/latex] system, we take
advantage of the `transpose`

attribute in `linalg.solve_triangular`

,
which the `scikits.learn`

version does not expose. We could think of a
solution, but here’s a better idea: Shouldn’t there be some way to
directly solve the entire system in one go?

Well, there is. It is an LAPACK function by the name of `potrs`

. If you
are not aware, LAPACK is a Fortran library with solvers for various
types of linear systems and eigenproblems. LAPACK along with BLAS (on
which it is based) pretty much powers all the scientific computation
that happens. BLAS is an API with multiple implementations dating from
1979, while LAPACK dates from 1992. If you ever used Matlab, this is
what was called behind the scenes. SciPy, again, provides a high-level
wrapper around this, the `linalg.cho_solve`

function.

But SciPy also gives us the possibility to import functions directly
from LAPACK, through the use of `linalg.lapack.get_lapack_funcs`

. Let’s
see how the low-level LAPACK function compares to the SciPy wrapper, for
our use case:

[sourcecode language=”python”]

In [11]: x = np.random.randn(1000)

In [12]: y = x.copy()

In [13]: timeit linalg.cho_solve((L, True), x)

1 loops, best of 3: 95.4 ms per loop

In [14]: potrs, = linalg.lapack.get_lapack_funcs((‘potrs’,), (G,))

In [15]: potrs

Out[15]: <fortran object>

In [16]: timeit potrs(L, y)

100 loops, best of 3: 9.49 ms per loop

[/sourcecode]

That’s 10 times faster! So now we found an obvious way to optimize the code:

[sourcecode language=”python”]

def cholesky_omp(X, y, n_nonzero_coefs, eps=None):

min_float = np.finfo(X.dtype).eps

potrs, = get_lapack_funcs((‘potrs’,), (X,))

alpha = np.dot(X.T, y)

residual = y

n_active = 0

idx = []

max_features = X.shape1 if eps is not None else n_nonzero_coefs

L = np.empty((max_features, max_features), dtype=X.dtype)

L[0, 0] = 1.

while 1:

lam = np.abs(np.dot(X.T, residual)).argmax()

if lam < n_active or alpha[lam] ** 2 < min_float:

# atom already selected or inner product too small

warn(“Stopping early”)

break

if n_active > 0:

# Updates the Cholesky decomposition of X’ X

L[n_active, :n_active] = np.dot(X[:, idx].T, X[:, lam]

solve_triangular(L[:n_active, :n_active], L[n_active, :n_active])

d = np.dot(L[n_active, :n_active].T, L[n_active, :n_active])

if 1 - d <= min_float: # selected atoms are dependent

warn(“Stopping early”)

break

L[n_active, n_active] = np.sqrt(1 - d)

idx.append(lam)

# solve LL’x = y in two steps:

gamma, _ = potrs(L[:n_active, :n_active], alpha[idx], lower=True,

overwrite_b=False)

residual = y - np.dot(X[:, idx], gamma)

if eps is not None and np.dot(residual.T, residual) <= eps:

break

elif n_active == max_features:

break

return gamma, idx

[/sourcecode]

Woohoo! But we still lag behind. Now that we delegated the trickiest
parts of the code to fast and reliable solvers, it’s time to use a
profiler and see what the bottleneck is now. Python has excellent tools
for this purpose. What solved the problem in this case was
`line_profiler`

[2]. There is a great article by Olivier Grisel here
2 regarding how to use these profilers. I’m just going to say that
`line_profiler`

‘s output is very helpful, basically printing the time
taken by each line of code next to that line.

Running the profiler on this code, we found that 58% of the time is
spent on line 14, 20.5% on line 21, and 20.5% on line 32, with the rest
being insignificant (`potrs`

takes 0.1%!). The code is clearly dominated
by the matrix multiplications. By running some more timings with
IPython, I found that multiplying such column-wise views of the data as
`X[:, idx]`

is considerably slower then multiplying a contiguous array.
The least-angle regression code in `scikits.learn`

avoids this by
swapping columns towards the front of the array as they are chosen, so
we can replace `X[:, idx]`

with `X[:, :n_active]`

. The nice part is that
if the array is stored in Fortran-contiguous order (ie. column
contiguous order, as opposed to row contiguous order, as in C), swapping
two columns is a very fast operation!. Let’s see some more benchmarks!

[sourcecode language=”python”]

In [17]: X = np.random.randn(5000, 5000)

In [18]: Y = X.copy(‘F’) # fortran-ordered

In [19]: a, b = 1000, 2500

In [20]: swap, = linalg.get_blas_funcs((‘swap’,), (X,))

In [21]: timeit X[:, a], X[:, b] = swap(X[:, a], X[:, b])

100 loops, best of 3: 6.29 ms per loop

In [22]: timeit Y[:, a], Y[:, b] = swap(Y[:, a], Y[:, b])

10000 loops, best of 3: 111 us per loop

[/sourcecode]

We can see that using Fortran-order takes us from the order of miliseconds to the order of microseconds!

Side note: I almost fell into the trap of swapping columns the pythonic
way. That doesn’t work:

[sourcecode language=”python”]

In [23]: X[:, a], X[:, b] = X[:, b], X[:, a]

In [24]: np.testing.assert_array_equal(X[:, a], X[:, b])

In [25]:

[/sourcecode]

However this trick works great for swapping elements of one-dimensional arrays.

Another small optimization that we can do: I found that on my system,
it’s slightly faster to compute the norm using the BLAS function `nrm2`

.
So by putting all of these together, we end up with the final version of
our code:

[sourcecode language=”python”]

def cholesky_omp(X, y, n_nonzero_coefs, eps=None,
overwrite_X=False):

if not overwrite_X:

X = X.copy(‘F’)

else: # even if we are allowed to overwrite, still copy it if bad
order

X = np.asfortranarray(X)

min_float = np.finfo(X.dtype).eps

nrm2, swap = linalg.get_blas_funcs((‘nrm2’, ‘swap’), (X,))

potrs, = get_lapack_funcs((‘potrs’,), (X,))

indices = range(len(Gram)) # keeping track of swapping

alpha = np.dot(X.T, y)

residual = y

n_active = 0

max_features = X.shape1 if eps is not None else n_nonzero_coefs

L = np.empty((max_features, max_features), dtype=X.dtype)

L[0, 0] = 1.

while True:

lam = np.abs(np.dot(X.T, residual)).argmax()

if lam < n_active or alpha[lam] ** 2 < min_float:

# atom already selected or inner product too small

warn(“Stopping early”)

break

if n_active > 0:

# Updates the Cholesky decomposition of X’ X

L[n_active, :n_active] = np.dot(X[:, :n_active].T, X[:, lam])

solve_triangular(L[:n_active, :n_active], L[n_active, :n_active])

v = nrm2(L[n_active, :n_active]) ** 2

if 1 - v <= min_float: # selected atoms are dependent

warn(“Stopping early”)

break

L[n_active, n_active] = np.sqrt(1 - v)

X.T[n_active], X.T[lam] = swap(X.T[n_active], X.T[lam])

alpha[n_active], alpha[lam] = alpha[lam], alpha[n_active]

indices[n_active], indices[lam] = indices[lam], indices[n_active]

n_active += 1

# solves LL’x = y as a composition of two triangular systems

gamma, _ = potrs(L[:n_active, :n_active], alpha[:n_active],
lower=True,

overwrite_b=False)

residual = y - np.dot(X[:, :n_active], gamma)

if eps is not None and nrm2(residual) ** 2 <= eps:

break

elif n_active == max_features:

break

return gamma, indices[:n_active]

[/sourcecode]

Now, the benchmark at [1] indicates victory over least-angle regression! I hope you have enjoyed this short tour. See you next time!

## Comments !