We propose a convolutional form of sparse coding in which the input is a whole image (instead of a patch), the dictionary elements are convolution kernels (intead of vectors), and the code is a series of feature maps (instead of a vector). This would be similar to Zeiler et al's deconvolutional network, except that we include an encoder that predicts the sparse code in an efficient feed-forward manner, as in PSD. With convolutional training the system no longer needs to learn shifted version of each filters, which causes the learned fitlers to be considerably more diverse than with patch-based sparse coding. They include center-surround filters, corner detectors, cross detectors, in addition to the usual Gabor-like oriented filters. The method yields better recognition results than non-convolutional training on Caltech-101: 57.6% vs 54.2% for a single-stage feature system, 66.3% vs 65.5% for a two-stage system. The method also give excellent results on the Caltech Pedestrian detection dataset: 11.5% miss rate for 1 false positive per image.