We propose a convolutional form of sparse coding in which the input is
a whole image (instead of a patch), the dictionary elements are
convolution kernels (intead of vectors), and the code is a series of
feature maps (instead of a vector). This would be similar to Zeiler et
al's deconvolutional network, except that we include an encoder that
predicts the sparse code in an efficient feed-forward manner, as in
PSD. With convolutional training the system no longer needs to learn
shifted version of each filters, which causes the learned fitlers to
be considerably more diverse than with patch-based sparse coding.
They include center-surround filters, corner detectors, cross
detectors, in addition to the usual Gabor-like oriented filters.  The
method yields better recognition results than non-convolutional
training on Caltech-101: 57.6% vs 54.2% for a single-stage feature
system, 66.3% vs 65.5% for a two-stage system.  The method also give
excellent results on the Caltech Pedestrian detection dataset: 11.5%
miss rate for 1 false positive per image.