An overview of dictionary learning: Terminology

My GSoC proposal is titled “Dictionary learning in scikits.learn” and in the project, I plan to implement methods used in state of the art research and industry applications in signal and image processing. In this post, I want to clarify the terminology used.

Usually the terms dictionary learning and sparse coding are used interchangably. Also my proposal contains methods such as Sparse PCA which are technically not sparse coding but closely related problems.

The basic idea is the approximation of a signal vector [latex] x \in \mathbb{R}\^d[/latex] by a linear combination of components, as good as possible, under certain constraints. This can be formulated as a basic (unconstrained) loss function measuring the quality of the approximation: [latex] \mathcal{L}(x, D, \alpha) = \big|\big|x - D\alpha\big|\big|\^2_2, D \in \mathbb{R}\^{d \times r}, \alpha \in \mathbb{R}\^r[/latex], where [latex] r[/latex] is the dimension of the dictionary (the number of components).

When working with a dataset of more signal vectors, the overall basic loss for such an approximation is [latex] \mathcal{L}(X, D, A) = \sum_{i = 1}\^N \mathcal{L}(x_i, D, \alpha_i) = \big|\big|X - DA\big|\big|\^2_F[/latex]. Minimizing such a loss function amounts to finding the closest (in the Frobenius sense) matrix factorization [latex] DA[/latex] that approximates the data matrix [latex] X[/latex]

This generic problem is called a matrix factorization problem. Many classical problems are matrix factorization problems with additional constraints. For example, PCA is a matrix factorization that constrains [latex] D[/latex] to be orthogonal. NMF constrains both [latex] D[/latex] and [latex] A[/latex] to have no negative elements. Sparse variants of these two decompositions are useful for getting local components, such as parts of faces. These are obtained by adding an aditional constraint on the dictionary columns.

It can be sometimes useful to consider the dictionary fixed. The signal processing community has introduced over the years many such dictionaries, for examples wavelets. These are used, for example, in the JPEG2000 compression standard.

A very useful represenation is when the dictionary is overcomplete ([latex] r > d[/latex]). The wavelets are an example of this. Given such a dictionary, we are interested in an efficient encoding of a vector [latex] x[/latex], in the sense of sparseness: we want to use as few dictionary components as possible in our representation. Such a solution is the vector [latex]\alpha[/latex] minimizing [latex] \mathcal{L}(x, D, \alpha) + \lambda\big|\big|\alpha\big|\big|_1[/latex] but other sparsity-inducing constraints can be used. Such a vector is a sparse coding of [latex] x[/latex] and it can be solved using algorithms such as least-angle regression and orthogonal matching pursuit.

However, we are not limited to using precomputed dictionaries. The term dictionary learning refers to methods of inferring, given [latex] X[/latex], a (usually overcomplete) dictionary that will perform good at sparsely encoding the data in [latex] X[/latex]. Such methods are more expensive than using precomputed dictionaries, but they perform better, since the dictionary is optimized for the current dataset.

Because usually such loss functions are non-convex in [latex] D[/latex] and [latex] A[/latex] simultaneously, dictionary learning algorithms alternate between minimizing each while keeping the other fixed. The step that minimizes [latex] D[/latex] is sometimes called the dictionary update step, and the one minimizing [latex] A[/latex] is (similarily to the case where the dictionary is always fixed) the sparse coding step. Dictionary learning algorithms differ in the method used for each of this step.

To resume, many problems can be posed as matrix factorization problems. Depending on the constraints imposed, the problem becomes interesting for different applications. Dictionary learning is very good for image reconstruction. Matrix decompositions with sparse undercomplete dictionaries such as Sparse PCA can be used to find local features that constitute the dataset, for example parts of faces, for a dataset of facial images. NMF can be used in both under and overcomplete settings and it offers a good model for additive data such as text or images. We are interested in these variants and they are planned for implementation in my GSoC proposal.

Julien Mairal’s presentation of his work in this domain, available here, shows the theoretical background of such methods, along with examples showing state of the art results in image processing.

Comments !