YOSHUA BENGIO: [INAUDIBLE].
Thank you [INAUDIBLE].
So I'll talk about [INAUDIBLE].
I'll talk about representations and learning
And the word deep here, I'll explain what it means.
So my goal is to contribute to building intelligent machines,
also known as AI.
And how do we get a machine to be smart--
to take good decisions?
Well, it needs knowledge.
[? researchers ?]
from the early days--
'50s, '60s, '70s--
tried to give the knowledge to the machine--
the knowledge we have exclusively.
And it didn't work quite as well as was hoped.
One reason is that a lot of our knowledge is not something
we can communicate verbally and we can
write that in a program.
So that knowledge has to be taken somewhere else.
And basically what we have found is you can get that
knowledge through observing the world around us.
That means learning.
OK-- so we need learning for AI.
What is learning?
What is machine learning?
It's not about learning things by heart.
That's just a fact.
What it is about is generalizing from the examples
you've seen to new examples.
And what I like to tell my students is it's taking
probability mass-- that is, on the training examples and
somehow guessing where it should go-- which new
configurations of the things we see make
sense or are plausible.
This is what learning is about.
At first we can measure [INAUDIBLE] we can guess.
And I'll mention something about dimensionality and
geometry that comes up when we think about this [INAUDIBLE].
And one of the messages will be that we can maybe fight
this [? dimensionality ?]
problem by allowing the machine to discover underlying
causes-- the underlying factors that explain the data.
And this is a little bit like [INAUDIBLE] is about.
So let's start from learning, an easy [INAUDIBLE]
Let's say we observe x,y pairs where x is a number--
y is a number.
And the stars here represent the examples we've seen of x,y
So we want to [? generalize ?] for new configurations.
In other words, for example, in this problem, typically we
want to predict a y given a new x.
And there's an underlying relationship between y and x,
meaning the expected value of the y given x, which is given
with this purple curve.
But we don't know it.
That's the problem with machine learning.
We're trying to discover something
we don't know already.
And we can guess some function.
This is the predicted or learned function.
So how could we go about this?
One of the most basic principles by which machine
learning algorithms are able to do this is assume something
very simple about the world around us-- about the data
we're getting or the function we're trying to discover.
It's just assuming that the function we're trying to
discover is smooth, meaning if I know the value of the
function that's come from the x, and I want to know the
value at some nearby point x prime, then it's reasonable to
assume that the value x prime of the function we want to
learn is close to the value of x.
I mean, you can formalize that and [INAUDIBLE] in many
different ways and exploit it in many ways.
And what it means here is if I ask you why
should we at this point--
what I'm going to do is look up the value of y that I
observed at nearby points.
And combining these--
make a reasonable guess like this one.
And if I do that on problems like this, it's actually going
to work quite well.
And a large fraction of the applications that we're
sharing use this principle.
enough of just this principle.
But if we only rely on this principle virtualization,
we're going to be in trouble.
That's one of the messages I want to explain here.
So why are we going to be in trouble?
Well, basically we're doing some kind of interpolation.
So if I see enough examples--
the green stars here-- to cover the ups and down of the
function I'm trying to learn, then I'm going to be fine.
But what if the function I want to learn has many more
ups and downs than I can possibly observe through data?
Because even Google has a finite number of examples.
Even if you have millions or billions of examples, the
functions we want to learn for AI are not like this one.
the number of configurations of articles of interest-- that
may be exponentially large.
So something maybe bigger than the number of
items in the universe.
So there's no way we're going to have enough examples to
cover all the configurations.
For example, think of the number of different English
sentences, which is something that Google is interested in.
And this problem is illustrated by the so-called
curse of dimensionality where you consider what happens when
you have not just one variable but many variables and all of
How many configurations of [? N ?] variables do you have?
Well, you have an exponential number of configurations.
So if I wanted to learn about a single
variable, I can just divide--
[? it ?]
[? takes ?] a real variable.
And I divide its value into intervals.
And I count how many of those bins I've seen in my data.
I can estimate probability of different intervals coming up.
So that's easy Because i only want to know about a small
number of different configurations.
But if I'm looking at two variables, then the number of
configurations may be [INAUDIBLE]
[? square ?]
[? bigger, ?] and [? it'd ?] have [? 390-- ?] even more.
But typically, I'm going to have hundreds-- if you're
thinking about images, it's thousands-- tens of
thousands-- hundreds of thousands.
So it's crazy how many configurations there are.
So how do we possibly generalize to new
We cannot just break up this space into small cells and
count how many things happen in each cell because the new
examples that we want to [? carry-- ?] new
configurations that [INAUDIBLE] asked about might
be in some region where we hadn't [INAUDIBLE].
So that's the problem of generalizing [INAUDIBLE].
So there's one thing that can help us, but it's not going to
It's something that happens with the iPhones.
It's very often [INAUDIBLE] vision, [INAUDIBLE]
processing and understanding and many other problems where
the set of configurations of variables that are plausible--
that can happen in the real world--
occupy a very small volume of all this set of possible
So let me give an example.
In images, if I choose the pixels in an image randomly--
in other words, if I sample an image from completely uniform
distribution, I'm going to get things like this.
And I can repeat this for eons and eons.
And I'm never going to assemble something that looks
like a face.
So what it means is that faces--
images of faces--
are very rare in the space of images.
They occupy a very small volume, much less than what
this picture would suggest.
And so this is a very important hint.
It means that actually the task is to find out where this
I have another example here.
If you take the image of a four like this one and you do
some geometry transformations to it like rotating it,
scaling it, you get slightly different images.
And if at each point, you allow yourself to make any of
these transformations, you can create a so-called manifold--
so a surface of possible images.
Each point here corresponds to a different image.
And the number of different changes that you make is
basically the dimensionality of this manifold.
So in this case, even though the data lives in the high
dimension space, the actual variations we care about are
of low dimensionality.
And knowing that, we can maybe do
better in terms of learning.
One thing about curves of dimensionality is I don't like
the name curves of dimensionality because it's
not really dimensionality.
You can have many dimensions but have
a very simple function.
What really matters is how many variations does the
function have-- how many ups and downs?
So we actually had some fairly [? cool ?] results about--
the number of examples you would need if you were only
relying on this [INAUDIBLE] assumption, essentially is
linear-- the number of ups and downs of the function
So let's come back to this idea of learning where to put
probability [? math. ?]
So in machine learning, what we have is data.
Each example is a configuration of variables.
And we know that this configuration [? occurred ?]
in the real world.
So we can say the probability for this configuration.
So this is the [? space ?] of configuration
I'm showing in 2D.
So we know that this configuration is plausible.
So we can just put a [? beacon ?]
And we can put a [? beacon ?] at every example.
The question is how do we take this probability mass and sort
of give a little bit of that to other places.
In particular, we'd like to put mass in between if there
really was a manifold that has some structure and if we could
discover that structure, it would be great.
So the classical machine learning way of doing things
is say that the distribution function-- the function that
you're trying to [? learn ?] in this case is smooth.
So if it's very probable here, it must be also probable in
So we can just do some mathematical equation that
will shift some mass from here to the [? different ?]
Then we get a distribution like this as our model.
And that works reasonably well.
But it's not the right thing to do.
It's putting mass in many directions
we don't care about.
Instead, what we're going to do is to discover that there
is something about this data.
There is some structure.
There is some abstraction that allows us to be very specific
about where we're going to put probability mass.
And we might discover with something like this, which in
2D doesn't look like a big difference.
But in high dimensions, the number of directions you're
allowed to move here is very small compared to the number
of dimensions here.
And the volume goes exponentially with dimension.
So you can have a huge [? gain ?] by guessing
probably which direction things move--
are allowed to keep high probability.
So, now to the core of this presentation which is about
I've talked about learning in general
and some of the issues--
some of the challenges with applying learning to AI.
Now, when you look at how machine learning is applied in
industry, what people do for 90% of time-- what they do
with the effort of engineers is not really
improve machine learning.
They use existing machine learning.
But to make the machine learning [INAUDIBLE] work
well, they do [INAUDIBLE]
So that means taking the raw data and transforming it--
extracting some features-- deciding what matters--
throwing away the things that we think don't matter.
And that's essentially using humans and our intelligence
and our understanding of the problem to figure out the
factors that matter-- to figure out the dependencies
that matter and so on.
So what representation learning is about is trying to
do with machines what humans do right now, which is
extracting those features--
discovering what is a good representation for your data.
And one way to think about it is the machine
is trying to guess--
not just those features or those computations that are
useful for us to explain our [INAUDIBLE]
but really what are the underlying factors that
explain the [INAUDIBLE]?
What are the underlying causes?
And the guesses about what these are for our particular
example is exactly what we'd like to have as our
Of course, this is hard to define because we don't know
what the right factors are, what are the
right causes of it.
This is the objective we have.
This is [INAUDIBLE] by the way.
So there is a very important [? family ?] of algorithms as
[INAUDIBLE] mentioned [INAUDIBLE] that have been
around since at least [? those ?]
that have multiple levels like this that have been around
since the '80s.
And they have multiple layers of [? computations. ?]
And one of the things I've been trying to do is to find
some properties that they have that other algorithms may have
that may be useful and try to understand why these
properties are useful.
In particular, there's the [INAUDIBLE] of depth.
So the idea of deep learning is that not only are you going
to have representations of the data [INAUDIBLE]
But you're going to have multiple levels of
And why would it matter to have multiple levels of
Because you're going to have low level and high level
representations where high level representations are
going to be more abstract--
capture structure that is less obvious in the data.
So what we call deep learning is when the learning algorithm
can discover these representations and even
decide how many levels there should be.
So I mentioned [INAUDIBLE]
as the original example of deep learning.
What these algorithms do is they learn some computation--
some function that takes an [INAUDIBLE] vector and map it
to some output, which could be a vector, through different
levels of representation where each level is composed of
units which do a computation that's inspired by how the
neurons in the brain work.
So they have a property which you don't find in many
learning algorithms called [INAUDIBLE].
So let's first see how these other [INAUDIBLE]
how they generalize.
Remember, this is going to be very similar to when I talked
They rely on this [INAUDIBLE] assumption.
Deep learning also relies on this [INAUDIBLE] assumption
but introduces additional [INAUDIBLE]-- additional
knowledge, if you will.
So when you only rely on this [INAUDIBLE] assumption, the
way you work is you essentially take your
[? input ?] space--
[INAUDIBLE] space, and break it up into regions.
For example, this is what happens with clustering,
nearest neighbors, any SVMs, any classical statistical
non-parametric algorithms, decision trees and
[? so on. ?]
So what happens is after seeing the data, you break up
the [INAUDIBLE] space into regions, and
you generalize locally.
So if you have a function that outputs something here--
because you've seen an example here for example--
you can generalize and say, well in the neighborhood, the
output is going to be similar and maybe some kind of
interrelation with the neighboring regions is going
to be performed.
But the crucial point from a mathematical point of view is
that there's a counting argument here, which is how
many parameters-- how many degrees of freedom do we have
to define this partition?
Well, basically, you need at least one
parameter per region.
The number of parameters is going to grow with the number
See-- if I want to distinguish two regions, I need to say
where the first one is or how to separate between these two.
And for example, [INAUDIBLE]
specifying the center of each region.
So the number of things I have to specify from the data is
essentially equal to the number of regions I can
So you can think well, there's no other way you could do
I mean, how could you possibly create a new region for each
[INAUDIBLE] not see any data, and distinguish it
Well, you can.
Let me give you an example.
So this is what happens with distributed representations,
which happens with things like factor models, PCA, RBM,
neural nets, sparse coding, and deep learning.
What you're going to do is you're going to still break up
the [INAUDIBLE] space at the regions and be able to
generalize locally in a sense that things that are nearby
are going to have similar outcomes.
But the way you're going to learn that
is completely different.
So, for example, you can [INAUDIBLE]
What I'm then doing is that I'm going to break it down in
different ways that are not mutually exclusive.
So, here, what I'm thinking about when I'm building this
model is there are three factors that explain the two
inputs that I'm seeing.
So this is two-dimensional input space.
And I'm bringing the space into different regions.
So, for example, the black line here tells me that you're
either on that side of it or the other side of it.
On that side, it's [? T1 ?] equals 1.
On that side, it's [? T1 ?] equals 0.
So this is a bit that tells me whether I'm in
this set or this set.
And I have this other data that tells me whether I'm on
this set and this set, or that set.
And now you can see that the number of regions I've defined
in this way could be much larger than the number of
Because the number of parameters was [? beginning ?]
with a number of factors-- the number of these [INAUDIBLE].
So by being smart about how we define those regions by
allowing the [INAUDIBLE] to help us, you can get
potentially exponential gain in expressive power.
Of course, from the machine learning point of view, this
comes with an assumption.
The assumption is that when I learn about being on that side
or that side, it's meaningful [INAUDIBLE] in some sense--
not quite in a statistical sense--
of what happens with the other configurations--
the other half of it.
So that makes sense if you think of, OK, this is images.
And this one is telling me is this a male or a female?
This one's telling me, does he wear glasses or not?
Is he tall or short, something like that.
So if you think about these factors as [INAUDIBLE]
meaningful things, usually, you can vary them [INAUDIBLE],
like the causes that explain the world around us.
And that's why you're able to generalize.
You're assuming something about the world that gives you
a kind of exponential power of representation.
Now, of course, in the new world, the features we care
about, the factors we care about are not going to be
simple, linear, or separated.
So that's one reason why we need deep representations.
Otherwise, just the same old [? level ?] will be enough.
Let me move on because time is flying.
So this is stolen from my brother, Samy, who gave a talk
here not long ago where they used this idea of observations
in a very interesting way where you have data of two
You have images.
And you have text queries--
short sequence of words.
And they learned representation for images, so
they map the image to some hyperdimensional vector and
they learn a function that represents queries.
So they map query through this also hyperdimensional point in
the same space.
And they learn them in such a way that when someone types
"dolphin" and then is shown an image of a dolphin and then
clicks on it, the representation for the image
and the representation for the query end up
close to each other.
And in this way, once you learn that, you can of course
[INAUDIBLE] things like answering new queries you've
never seen and find images that match queries that
somehow you haven't seen before.
One question that people outside of machine learning
ask when they've considered what machine learning are
doing is this is crazy.
Humans can learn from very few examples.
And you guys need thousands or millions of examples.
I mean, you're doing something wrong.
And they're right.
So how do humans manage to constantly learn something
very complicated from just a few examples?
Like, how do students learn something?
Well, ther are a number of answers.
One is brains don't start from scratch.
They have some priors.
And in particular, I'm interested in generic priors
that allow us to generalize the things that [INAUDIBLE]
didn't train our species to do.
But still they do very well.
So we have some very general purpose priors we are born
with, and I'd like to figure out which they are because we
can exploit them as well.
Also-- and this is very, very important--
if you ask a newborn to do something, it
wouldn't work very well.
But of course, an adult has learned a lot of things before
you give him a few examples.
And so he's transferring knowledge from [INAUDIBLE].
This is [? crucial. ?]
And the way he's doing that is he's built in his mind
representations of the objects-- of the types of the
modalities which are given in the examples.
And these representations capture the relationships
between the factors-- the explanatory factors that
explain what is going on in your particular
setup of the new task.
And he's able to do that from unlabeled data--
from examples that were unrelated to the task we're
trying to solve.
So one of the things that humans are able to do is to do
what's called semi-supervised learning.
They're able to use examples that are not specifically for
the task you care about to generalize.
They are people who use information about the
statistical structure of the things around us to
[? greatly ?] answer new questions.
So here, let's say someone gives me just two examples.
We want to discriminate between the
green and the blue.
And the classical algorithm would do something like put a
straight line in between.
But what if you knew that there are all these other
points that are not [INAUDIBLE]
related to your task.
But these are the configurations that are
plausible in the [INAUDIBLE] distribution.
So those [INAUDIBLE] ones, you don't know if they
are green or blue.
But by the structure here, we guess that these ones are all
blue and these ones are all green.
And so you would put your decision like this.
So we're trying to take advantage of data from other
tasks that are unable to find something generic about the
world like [INAUDIBLE] usually happen in this direction and
use that to quickly generalize from very few
examples to new examples.
So of the motivations for learning about depth, there
are [? vertical ?] motivations that come from the discovery
of families of functions-- mathematical functions--
that can be represented very efficiently if you--
[? the ?]
[? longer ?] representations have longer levels that might
require exponentially more numbers--
if you're only allowed one or two levels.
Even though one or two levels are enough to [? observe ?]
any function, it might be very inefficient.
And of course, there are biological motivations, like
the brain seems to have the [INAUDIBLE]
[? picture. ?]
[? It's especially ?] true of the visual cortex, which is
the part we understand best.
And that the cortex seems to have a generic learning
algorithm which the principles seem to be at work in terms of
learning everywhere in the cortex.
Finally, there are cognitive motivations [INAUDIBLE].
We learn simpler things first.
And then we [? compose ?] to these simpler things to build
high level abstractions.
This has been exploited, for example, in the work of
[? Stanford-- ?]
by [INAUDIBLE] and [INAUDIBLE] and others to show how
[INAUDIBLE] representations can learn simple things like
[? edges-- ?]
combine [? them ?]
[? from ?] parts-- combine them to form faces
and things like that.
Another sort of simple motivation is how do you
Do we program computers by having a main program that has
a bunch of lines of code?
Or do we program computers by having functions or
subroutine, [? like call ?] subroutine, [? then call ?]
This is [? the new ?] program.
If we were forced to program that way, it
wouldn't work very well.
But most of machine learning is basically trying to solve
the [INAUDIBLE] in this--
not in the programs they use but in the structure of the
functions that are learned.
And there are also, of course, motivations from looking at
what can be achieved by exploiting depth.
So I'm stealing this slide from another Google [? talk ?]
led by Geoff Hinton last summer, which shows how deep
nets, compared to the standard way, which has been the
state-of-the-art in speech recognition for 30 years, can
be substantially improved by exploiting these multiple
levels of representation--
even-- and this is something new that impressed me a lot--
even when the amount of data available is huge, the gain in
using these representations is-- representation learning
And this all comes from something that happened in
2006 when first Geoff Hinton followed by a group here in
Montreal and [INAUDIBLE] group in NYU in New York found that
you could actually train your deep neural network by using a
few simple tricks.
And the simple tricks essentially that we're going
to train layer by layer using [INAUDIBLE]
learning, although recent work now allows us to train deep
networks without this trick and using other tricks.
This has given rise to lots of industrial interest, as I
not only in [INAUDIBLE] conditions but also in
[INAUDIBLE], for example.
I'm going to talk about some competitions we've won using
So, last year we won sort of a transfer learning competition,
where you were trying to take the representations, learn
from some data, and apply them on other data that relates to
similar but different tasks.
And so there was one competition where the results
were announced at ICML 2011--
[? 2011 ?] and another one at NIPS 2011.
So this is less than a year ago.
And what we see in those pictures is how the
[INAUDIBLE] improves with more layers.
But what precisely each of these graphs has on the
x-axis, a lot of the number of [? label ?] examples used for
training the machine.
And the y-axis is [INAUDIBLE]
essentially, so you want this to be [? high. ?]
And for this task, as you add more levels of representation,
what happens is you especially get better in the case where
you have very few [? label ?] examples-- the thing I was
talking about that humans can do so well--
generalize from very few examples.
Because they've learned the representation earlier on
using lots of other data.
One of the learning algorithms that came out of my lab that
has been used for this is called the denoising
And what it does-- in principle, it's pretty simple.
And to learn representation, you take each input example
and you corrupt it by, say, saying some of the [INAUDIBLE]
And then you learn a representation so that you can
reconstruct the info.
But you want to construct the uncorrupted info-- the clean
info-- that's why it's called denoising.
And then you try to make this as close
as possible to [? it. ?]
I mean, this is close as possible to the [? raw, ?]
And we can show this essentially models the density
of the [INAUDIBLE] distribution.
And you can learn these representations and stack them
on top of each other.
How am I doing with time?
MALE SPEAKER: 6:19.
YOSHUA BENGIO: Huh?
MALE SPEAKER: 6:19.
YOSHUA BENGIO: I have until [? when? ?]
MALE SPEAKER: [? Tomorrow ?]
[? morning. ?]
MALE SPEAKER: As long as [INAUDIBLE].
MALE SPEAKER: Just keep going.
YOSHUA BENGIO: OK.
YOSHUA BENGIO: OK, so I [INAUDIBLE] here a connection
between those denoising auto-encoders and the manifold
learning idea that I was mentioning earlier.
So how do these algorithms discover the manifolds-- the
regions where the configurations [INAUDIBLE] the
variables are plausible-- where the distribution
So, we're back on the same picture as before.
So these are our examples.
And what we're trying to do is to learn a representation.
So mapping from the [? info ?] space [INAUDIBLE] here that we
[INAUDIBLE] to a new space, such that we can essentially
recover the input-- in other words, we don't lose
But at the same time because of the denoising part,
actually, you can [? show that ?] what this is
trying to do is throw away all the information.
So it seems crazy but if you want to keep all the
information, but you want to throw away all the
But there's a catch.
Here, you want to only be able to
reconstruct these examples--
not necessarily any configuration if inputs.
So you're trying to find the function which will preserve
the information for these guys.
In other words, it's able to reconstruct them
[? by the identity ?] function.
But it's applied on these guys.
But when you apply it in other places, it's allowed to do
anything it wants.
And it's also learning this [? new ?] function.
So in order to do that, let's see what happens.
Let's consider a particular point here--
It needs to distinguish this one from its neighbor.
In the representation, [INAUDIBLE].
The representation you learn from that guy has to be
different enough from that guy that we can actually recover
and distinguish this one from this one.
So we can learn an inverse mapping, an approximate
inverse mapping, from the representation.
So that means you have to have a representation which is
sensitive to changes in that direction.
So when I move slightly from here to here, the
representation has to change slightly as well.
On other hand, if I move in this direction, then the
representation doesn't need to capture that.
It could be constant as I move in that direction.
In fact, it wants to be constant in all directions.
But what's going to happen is it's going to be constant in
all directions except directions that it actually
needs to reconstruct the data and in this way, recover the
directions that are the derivatives of this
And you recover the directions of the manifold-- the
directions where if I move in this direction, I still stay
in regional [? high ?]
[? probability. ?]
That's what the manifold really means.
So we can get rid of this direction.
And recently, we came up with an algorithm that you can use
to sample from the model.
So if you have an understanding of the manifold
as something that tells you at each point, these are the
directions you're allowed to move--
so as we stay in high probability [INAUDIBLE]
So these are the directions that keep you [? tangent ?] to
the manifold, then basically, the algorithm goes, well, we
are at a point.
We move in the directions that our algorithm discovered to be
good directions of change-- plausible
directions of change.
And that might correspond to something like taking an image
and translating it or [? updating it ?] or doing
something like removing part of an image.
And then projecting back towards the manifold-- it
turns out that the reconstruction
function does that.
And then [? integrating ?] that random [? wall ?]
to get samples to the model.
And we apply this to modeling faces and digits.
Now, let's come back to this question of what is a good
People in computer [? vision ?] have used this
term invariance a lot.
And it's a word that's used a lot when
you handcraft features.
So, remember, at the beginning, I said the way most
of machine learning is applied is you take your raw data, and
you handcraft features based on your knowledge of what
matters and what doesn't matter.
For example, if your input is images, you'd like to design
features that are going to be insensitive to
translations of your info.
Because typically, the category you're trying to
detect should not depend on a small translation.
So this is the idea of comparing features.
But if we want to do unsupervised learning, where
no one tells us ahead of time what matters and what doesn't
matter-- what the task is going to be, then how do we
know which invariance matters?
For example, let's say we're doing speech recognition.
Well, if you're doing speech recognition, then you want to
be invariant to who the speaker is.
And you want to be invariant to what kind of microphone it
is and what's the volume of the sound.
But if you're doing speaker identification, and you want
to be invariant to what the person says and you want to be
very sensitive to the identity of the person.
But if someone gives you speech.
And you don't know if it's going to be used for
recognition of [INAUDIBLE] or for recognizing people, what
should you do?
Well, what you should be doing is learning
to disentangle factors--
basically, discovering that in speech, the things that matter
are the [INAUDIBLE], the person, the
microphone, and so on.
These are the factors that you'd like to discover
And if you're able to do that, then my claim is you can
essentially get around the curse of dimensionality.
You can solve very hard problems.
There's something funny that happens with the deep learning
algorithms I was talking about earlier, which is that if you
train these representations from purely ununsupervised
learning, you discover that the features-- the
representation that they find have some form of
that some of the units in the [INAUDIBLE]
are very sensitive to some of the underlying factors.
And they're very sensitive to one factor and very
insensitive to other factors.
So this is what disentangling is about.
But knowing all these algorithms what those factors
would be in the first place.
So something good is happening.
But what don't really understand why.
And we'd like to understand why.
One of the things that you see in many of these algorithms is
the idea of so-called sparse representations.
So what is that?
Well, up to now, I've talked about representations as just
a bunch of numbers that we associate to an input.
But one thing we can do is learn representations
[? that have ?]
[? the property-- ?]
that many of those numbers happen to be
zero or some constant--
[? other value. ?]
But zero is very convenient.
And it turns out, when you do that, it helps a lot, at least
for some problems.
So that's interesting.
And I conjecture that it helps us disentangle the underlying
factors in the problems where basically--
for any example, there are only a few concepts and
factors that matter.
So in the scene that I see right now that comes to my
eyes, of all the concepts that my brain knows about, only a
few are relative to this scene.
And it's true of almost any input
that comes to my sensors.
So it makes sense to have representations that have this
property as well-- that even though we have a large number
of possible features, most of them are sort of not
applicable to the current situation.
Not applicable, in this case, zero.
So just by forcing many of these features to ouput not
applicable, somehow we're getting better
This has been used in a number of papers.
And we've used it with so-called rectifier neural
networks, in which the unit [? compute ?] a function like
this on top of the usual linear
transformation they perform.
And the result is that this function [INAUDIBLE]
So when x here is some weighted sum from the previous
layer, what happens is either the output is a positive real
number or the output is 0.
So let's say the input was a sort of random centered around
0, then half of the time, those features would output 0.
And if you just learn to shift this a little bit to the left,
then you know--
80% of the time or 95% of the time, the output will be 0.
So it's very easy to get sparsity with these kind of
It turns out that these [INAUDIBLE] are sufficient to
learn very complicated things.
And that was used in particular in a really
outstanding system built by Alex Krizhevsky and
[INAUDIBLE] and [INAUDIBLE]
with Geoff Hinton in Toronto recently where they obtained
amazing results on one of the benchmarks that computer
vision people really care about [INAUDIBLE]
with [? 1,000 ?] classes.
So this contains millions of images taken from Google Image
search and 1,000 classes that you're trying to classify.
So these are images like this.
And there are 1,000 different categories you want to detect.
And this shows some of the outputs of this model.
[INAUDIBLE] obviously doing well.
And they managed to bring the state-of-the-art from making
small incremental changes from say 27% to 26% down to 17% on
this particular benchmark.
That's pretty amazing.
And one of the tricks they used--
I've been doing publicity for--
is called dropouts.
And Geoff Hinton is speaking a lot about this this year.
Next year will be something else.
And it's a very nice trick.
Basically, the idea is add some kind of randomness in the
typical neurons we use.
So you'd think that randomness hurts, right?
So if we learn a function like, you know-- say, thinking
about the brain doing something.
If you had noise in the computations of the brain,
you'd think it hurts.
But actually, when you do it during training, it helps.
And it helps for reasons that are yet to be completely
But the theory is it prevents the features you learned to
depend too much on the presence of the others.
So, half of the features will be turned off by this trick.
So the idea is you take the output of a neuron, and you
multiply it by 1 or 0, [INAUDIBLE]
probably be one-half.
So you turn off half of the features [INAUDIBLE].
We do that for all the layers.
And [INAUDIBLE] at this time, you don't do
this kind of thing.
You just multiply it by [INAUDIBLE].
So it averages the same thing.
But what happens is that during training, the features
learn to be more robust and more independent of each other
and collaborate in a less fragile way.
This is actually similar to the denoising auto-encoder I
was talking about earlier where we introduce corruption
noise in the input.
But here, you do it at every layer.
And somehow, this very simple trick helps
a lot in many contexts.
So they've tested it on different benchmarks.
These are three image data sets and also in speech.
And in all cases, they've seen improvements.
Let's get back to the representation learning
Many of them are based on learning one layer of
representation at a time.
And one of the algorithms that has been very [? practical ?]
for doing that is called a Restricted
Boltzmann Machine, or RBM.
And as a probability model, it's formulized this way.