
Current Projects
PhD
During my PhD, in 1985, I proposed and
published (in French), an early version of the learning algorithm known
as error backpropagation
[LeCun, 1985]
[LeCun, 1986].
Although a special form of my algorithm was equivalent to backpropagation
it was based on backpropagating "virtual target values" for the hidden
units rather than gradients. I had actually proposed the algorithm during
my Diplome d'Etudes Approfondies work in 1983/1984 but had no
theoretical justification for it back then.
The final form of the backpropagation algorithm has had a profound impact on
the fields of Machine Learning, and was the primary cause for the surge
of interest in Neural Networks in the mid 80's. Subsequently, I refined the
backpropagation algorithm, and proposed a consistent theoretical framework
for its analysis. I applied backpropagation to handwriting recognition,
medical diagnosis of abdominal pains, and intron/exon splicing site
detection in DNA sequences
[LeCun, 1987].
For the medical diagnosis application, I designed a network with
specific groups of hidden units to represent "syndromes". Connections
were established from symptoms (inputs) to syndroms, and from
syndromes to diagnoses (outputs) if there was a known causal relationship
between them (an early form of graphical models with logistic nodes, I guess).
I also applied multilayer nets to associative data retrieval
[LeCun and FogelmanSoulie, 1987]
[Gallinari et al., 1987]
[FogelmanSoulié et al., 1987c]
[FogelmanSoulié et al., 1987b]
[FogelmanSoulié
et al., 1987a]
At the end of my PhD, my friend
Leon Bottou and I
started writing a software system called SN to experiment with machine
learning and neural networks. It was built around a homegrown Lisp
interpreter that eventually morphed into the
Lush language.
I completed this software while I was a Research
Associate in Geoff Hinton's group at the University of Toronto.
[LeCun and Bottou, 1988]
[Bottou and LeCun, 1988]
and has been using its various embodiments ever since.
Learning Theory, Learning AlgorithmsDuring my PostDoc in Toronto, I extended and published my thesis work on the theoretical foundation of backpropagation, and showed that backpropagation and several of its generalizations could be derived rigorously using Lagrange functions [LeCun, 1988], reprinted in [LeCun, 1992].I also briefly worked on a new "perturbative" learning algorithm called GEMINI [LeCun, Gallant, and Hinton, 1989e]. With various colleagues, I published a theoretical analysis of learning curves [Solla, LeCun, 1991]; proposed a precursor of a now common Bayesian methods for producing class probability estimates from the output of a neural network [Denker, LeCun 1991]; and experimented with a practical boosting method for training and combining the outputs of multiple pattern recognition systems [Drucker et al. 1994]. I also worked on a method for estimating the effective capacity (or effective VCdimension) of a learning machine by fitting the training error and testing error curves [Vapnik, Levin, LeCun 1994]. One of my longtime interests is the inherently subjective nature of the notions of information, probability, complexity, and entropy (including the definitions of those used in physics). John Denker and I wrote a short paper on the subject [Denker, LeCun, 1993].
The Dynamics of (fast) Neural Net Training
There are many hacks in the literature to accelerate the convergence
of learning algorithms for multilayer neural nets. There have been
many claims that secondorder methods ought to converge faster than
plain gradient descent. Unfortunately, most second order methods found
in the classical nonlinear optimization litterature are practically
useless for neuralnet training because they are all designed to work
in "batch" mode (with accurate gradients and line searches). Neural
net training is known to be considerably faster with "online"
learning (also called stochastic gradient) where the parameters are
updated after every training sample.
Therefore, during my PhD thesis, I proposed a stochastic version of the LevenbergMarquardt algorithm with a diagonal approximation of the Hessian. The Diagonal Hessian in neural nets was shown to be very easy to compute with a "backprop" like procedure [LeCun, 1987]. Comparisons with classical nonlinear optimization algorithms have been performed [Becker and LeCun, 1989]. Recent descriptions and analyses of the stochastic diagonal LevenbergMarquardt method can be found in [LeCun et al. 1998] and [LeCun, Bottou, Orr, Muller 1998]. With Ido Kanter and Sara Solla, I published a formula for the distribution of eigenvalues of certain types of random matrices (covariance matrices of random vectors). This result was used to predict the convergence speed of adaptive filter algorithms and neuralnet learning procedures [LeCun, Kanter, Solla 1991] [Kanter, LeCun, Solla 1991]. A surprising result is that Linear Least Square converges fastest when the ratio of number of samples to number of input variables is about 4. A major headache when training neural nets is how to choose the step size (learning rate) of the learning algorithm. I proposed a simple (and well justified) method for automatically computing (and adjusting) the step size for gradientbased neuralnet training algorithms. It is based on a very efficient method for estimating and tracking the largest eigenvalue of the Hessian matrix of a neural nets error surface (without computing the Hessian itself) [LeCun, Simard, Pearlmutter 1993]. A recent tutorial paper on my work on efficient neural net training can be found in [LeCun, Bottou, Orr, Muller 1998].
Learning Algorithms, Generalization, and RegularizationI proposed a "pruning" (weight elimination) methods for neural networks called Optimal Brain Damage that has become quite popular over the years [LeCun, Denker, Solla, 1990].Harris Drucker and I worked on a regularization method called double backpropagation that improves the smoothness of the functions learned by multilayer neural nets (and therefore their robustness to small distortions of the inputs) [Drucker, LeCun 1992] [Drucker, LeCun 1991a] [Drucker, LeCun 1991b].
Later, Patrice Simard and I (with J. Denker and B. Victorri) found a
regularization method called Tangent Prop that can force neural nets
to become invariant (or robust) to a set of chosen transformations of
the inputs.
[Simard, Victorri, LeCun, Denker 1992]
[Simard, LeCun, Denker, Victorri 1992].
A recent review paper on Tangent Prop (and Tangent Distance) is available
[Simard, LeCun, Denker, Victorri 1998].
Invariant Recognition: Convolutional Neural Networks
Other Classification Algorithms: Tangent Distance, Boosting, SVM,....Patrice Simard, John Denker and I worked on a method called Tangent Distance for measuring similarities between shapes while being robust to small distortions and displacements of the input pattern. We applied this method to handwriting recognition [Simard, LeCun, Denker 1993] [Simard, LeCun, Denker 1994]. A recent and rather complete paper on Tangent Prop and Tangent Distance is available [Simard, LeCun, Denker, Victorri 1998].
Applications: Zipcode and Check Readers
Applications: Online Handwriting Recognition
Another line of work, in collaboration with Isabelle Guyon, concerned
handwriting recognizers for penbased computers using the idea of
"Subsampled TimeDelay Neural Network" (a special type of temporal
convolutional net) to recognize the trajectory of a stylus
[Guyon et al. 1991]
[Guyon et al. 1990]
[Guyon et al. 1992].
I also worked on an automatic signature verification system based on siamese networks that compute distances between objects [Bromley et al. 1994] [Bromley et al. 1993]. I briefly worked on applying similar ideas to Chinese handwriting recognition [Wu, LeCun, Jackel, Jeng 1993],
Later, Yoshua Bengio and I started a new line of work on online
handwriting recognition based on "pictorial" representations
of the written words rather than on the temporal pen trajectories.
Word images enhanced with the stroke direction information were fed
to convolutional nets with great success. We also found principled
ways of combining neural nets with Hidden Markov Model. This method
allowed us to train the entire system with a discriminative
criterion at the word level.
[Bengio, LeCun, Nohl, Burges 1995]
[LeCun, Bengio 1994]
[Bengio, LeCun 1994]
[Bengio, LeCun, Henderson 1994]
A few cute applications were built around this system, including a penbased
visitor registration system for corporate buildings
[Matic, Henderson, LeCun,
Bengio 1994].
We realized after a while that, although handwriting recognition is an interesting problem in itself, it is also a very inefficient way to communicate with a computer, except in a small number of very specific situations [Guyon, Warwick, LeCun 1996]. So we lost interest in the problem. But by working on it, we had figured out how to train a large and complex system at the word level without requiring manual segmentation. Those ideas led to the Graph Transformer Network concept and were immediately applied to recognizing bank checks.
HMM/Neural Net Hybrids, Graph Transformer Networks, Global Training of Heterogeneous Learning Machines
Document Image Compression and Digital LibrariesBetween 1996 and 2001, much of my work has been centered on the DjVu project. DjVu is an image compression technology and a software platform for distributing highresolution scanned documents over the Web.At the end of 1995, AT&T decided to spin off Lucent and NCR. Mosaic and the World Wide Web were only about 2 years old, and, as a book lover, I thought (naively) that digital libraries would become one of the Web's killer apps (naturally, just like with TV, the Web became the ultimate tool for entertainment and commerce more than for education and culture). I had just become head of Image Processing Research in the newly created AT&T Labs, and felt that our work on handwriting recognition had reached a point of diminishing return (particularly since NCR, the main outlet for our research, was spun off). So, I was looking for ways to apply our expertise in document understanding to Webbased digital libraries. I soon realized that there was no good way of distributing highresolution color images of scanned documents on the web. There was significant expertise in the Lab on compressing B&W textual images (with Paul Howard's work on JBIG2), and continuoustone images (with various work on JPEG, MPEG, and wavelets) [Cox et al. 1998] [Haskell et al. 1998]. I thought that it would be nice if we could segment the text from the pictures and backgrounds so as to compress them separately. DjVu was born. DjVu can compress scanned documents at 300dpi in B&W to 5KB30KB per page (38 times better than TIFF  Group IV), and color documents to 30KB100KB per page (510 times better than JPEG) [Bottou et al. 2000] [Haffner et al. 1999] [Haffner, Bottou, Howard, LeCun 1999] [LeCun, Bottou, Haffner, Howard 1998] [Bottou et al. 1998] [Haffner et al. 1998] [Cox et al. 1998]. Our first research prototype was ready at the end of 1996, our first practical system at the end of 1997, and our first public release of free compressors and browser plugins in March 1998. Commercial versions appeared in early 1999. Version 3.0 of the software appeared in summer 2000 and is commercialized by LizardTech, who partnered with AT&T. DjVu is now used by hundreds of website around the world to distribute documents (mostly scanned documents). I maintain several DjVurelated web sites including DjVuZone, Any2DjVu, Bib2Web, and DjVuLibre.
Face Detection
More recently, we developed a realtime system that can detect human faces in images and simultaneously estimate their pose. This system uses convolutional networks trained with a loss function derived from the "EnergyBased Model" concept [Osadchy, Miller, and LeCun, 2004].
Object Detection and Recognition
Robotics
EnergyBased Models
EnergyBased Models (EBMs) are a nonprobabilistic form of Graphical
Model (more precisely, a unnormalized form of Factor Graph).
They associate a scalar energy to configurations of variables
that are observed and variables that must be predicted.
Given the values of the observed variables, the inference process in an EBM consists in finding the value of the variables to be predicted that minimizes the energy. Learning consists in "shaping" the energy landscape so that it has minima near the desired values of the variables to be predicted. We characterize the type of loss functions that, when minimized with respect to the trainable parameters of the system, will shape the energy landscape in suitable ways [LeCun and Huang, 2005]. We demonstrate an application to face detection and pose estimation [Osadchy, Miller, and LeCun, 2004].

Yann LeCun, Professor
The Courant Institute of Mathematical Sciences
Room 1221, 715 Broadway, New York, NY 10012, USA
tel: (212)9983283
Copyright © 20002004 Yann LeCun.