# K-Means for dictionary learning

[![Dictionary learned with K-Means on the LFW dataset with whitening PCA][]][][![Dictionary learned with K-Means on the LFW dataset without whitening PCA][]][]

One of the simplest, and yet most heavily constrained form of matrix factorization, is vector quantization (VQ). Heavily used in image/video compression, the VQ problem is a factorization $X=WH$ where $H$ (our dictionary) is called the codebook and is designed to cover the cloud of data points effectively, and each line of $W$ is a unit vector.

This means that each each data point $x_i$ is approximated as $x_i \approx h_{k} = \sum_{j=1}\^{r} \delta_{kj}h_{j}$. In other words, the closest row vector (codeword/dictionary atom) $h_k$ of $H$ is chosen as an approximation, and this is encoded as a unit vector $(\delta_{k1}, …, \delta_{kr})$. The data representation $W$ is composed of such vectors.

There is a variation called gain-shape VQ where instead of approximating each point as one of the codewords, we allow a scalar multiplication invariance: $x_i \approx \alpha_ih_k$. This model requires considerably more storage (each data point needs a floating point number and an unsigned index, as opposed to just the index), but it leads to a much better approximation.
Gain-shape VQ can equivalently be accomplished by normalizing each data vector prior to fitting the codebook.

In order to fit a codebook $H$ for efficient VQ use, the K-Means Clustering [1] algorithm is a natural thought. K-means is an iterative algorithm that incrementally improves the dispersion of k cluster centers in the data space until convergence. The cluster centers are initialized in a random or procedural fashion, then, at each iteration, the data points are assigned to the closest cluster center, which is subsequently moved to the center of the points assigned to it.

The scikits.learn.decomposition.KMeansCoder object from our work-in-progress dictionary learning toolkit can learn a dictionary from image patches using the K-Means algorithm, with optional local contrast normalization and a PCA whitening transform. Using a trained object to transform data points with orthogonal matching pursuit, with the parameter n_atoms=1 is equivalent to gain-shape VQ. Of course you are free to use any method of sparse coding such as LARS. The code used to produce the example images on top of this post can be found in [2].

Using K-Means for learning the dictionary does not optimize over linear combinations of dictionary atoms, like standard dictionary learning methods do. However, it’s considerably faster, and Adam Coates and Andrew Ng suggest in [3] that as long as the dictionary is filled with a large enough number of atoms and it covers well enough the cloud of data (and of future test data) points, then K-Means, or even random sampling of image patches, can perform remarkably well for some tasks.

[1] [Wikipedia article on K-Means clustering][]
[2] [K-Means Coder example][]
[3] [**The importance of encoding versus training with sparse coding and vector quantization**, Adam Coates and Andrew Y. Ng. In Proceedings of the Twenty-Eighth International Conference on Machine Learning, 2011.][]

[![Dictionary learned with K-Means on the LFW dataset with whitening PCA][]]: http://localhost:8001/wp-content/uploads/2011/07/kmeans_w.png

[![Dictionary learned with K-Means on the LFW dataset without whitening PCA][]]: http://localhost:8001/wp-content/uploads/2011/07/kmeans_no_w.png