Deep Learning of Representations - VoiceTube 動画で英語を学ぶ

字幕表動画を再生する

YOSHUA BENGIO: [INAUDIBLE].
Thank you [INAUDIBLE].
So I'll talk about [INAUDIBLE].
I'll talk about representations and learning
representations.
And the word deep here, I'll explain what it means.
So my goal is to contribute to building intelligent machines,
also known as AI.
And how do we get a machine to be smart--
to take good decisions?
Well, it needs knowledge.
[INAUDIBLE]
[? researchers ?]
from the early days--
'50s, '60s, '70s--
tried to give the knowledge to the machine--
the knowledge we have exclusively.
And it didn't work quite as well as was hoped.
One reason is that a lot of our knowledge is not something
we can communicate verbally and we can
write that in a program.
So that knowledge has to be taken somewhere else.
And basically what we have found is you can get that
knowledge through observing the world around us.
That means learning.
OK-- so we need learning for AI.
What is learning?
What is machine learning?
It's not about learning things by heart.
That's just a fact.
What it is about is generalizing from the examples
you've seen to new examples.
And what I like to tell my students is it's taking
probability mass-- that is, on the training examples and
somehow guessing where it should go-- which new
configurations of the things we see make
sense or are plausible.
This is what learning is about.
It's guesswork.
At first we can measure [INAUDIBLE] we can guess.
And I'll mention something about dimensionality and
geometry that comes up when we think about this [INAUDIBLE].
And one of the messages will be that we can maybe fight
this [? dimensionality ?]
problem by allowing the machine to discover underlying
causes-- the underlying factors that explain the data.
And this is a little bit like [INAUDIBLE] is about.
So let's start from learning, an easy [INAUDIBLE]
of learning.
Let's say we observe x,y pairs where x is a number--
y is a number.
And the stars here represent the examples we've seen of x,y
configurations.
So we want to [? generalize ?] for new configurations.
In other words, for example, in this problem, typically we
want to predict a y given a new x.
And there's an underlying relationship between y and x,
meaning the expected value of the y given x, which is given
with this purple curve.
But we don't know it.
That's the problem with machine learning.
We're trying to discover something
we don't know already.
And we can guess some function.
This is the predicted or learned function.
So how could we go about this?
One of the most basic principles by which machine
learning algorithms are able to do this is assume something
very simple about the world around us-- about the data
we're getting or the function we're trying to discover.
It's just assuming that the function we're trying to
discover is smooth, meaning if I know the value of the
function that's come from the x, and I want to know the
value at some nearby point x prime, then it's reasonable to
assume that the value x prime of the function we want to
learn is close to the value of x.
That's it.
I mean, you can formalize that and [INAUDIBLE] in many
different ways and exploit it in many ways.
And what it means here is if I ask you why
should we at this point--
what I'm going to do is look up the value of y that I
observed at nearby points.
And combining these--
make a reasonable guess like this one.
And if I do that on problems like this, it's actually going
to work quite well.
And a large fraction of the applications that we're
sharing use this principle.
And [INAUDIBLE]
enough of just this principle.
But if we only rely on this principle virtualization,
we're going to be in trouble.
That's one of the messages I want to explain here.
So why are we going to be in trouble?
Well, basically we're doing some kind of interpolation.
So if I see enough examples--
the green stars here-- to cover the ups and down of the
function I'm trying to learn, then I'm going to be fine.
But what if the function I want to learn has many more
ups and downs than I can possibly observe through data?
Because even Google has a finite number of examples.
Even if you have millions or billions of examples, the
functions we want to learn for AI are not like this one.
They have--
the number of configurations of articles of interest-- that
may be exponentially large.
So something maybe bigger than the number of
items in the universe.
So there's no way we're going to have enough examples to
cover all the configurations.
For example, think of the number of different English
sentences, which is something that Google is interested in.
And this problem is illustrated by the so-called
curse of dimensionality where you consider what happens when
you have not just one variable but many variables and all of
their configurations.
How many configurations of [? N ?] variables do you have?
Well, you have an exponential number of configurations.
So if I wanted to learn about a single
variable, I can just divide--
[? it ?]
[? takes ?] a real variable.
And I divide its value into intervals.
And I count how many of those bins I've seen in my data.
I can estimate probability of different intervals coming up.
So that's easy Because i only want to know about a small
number of different configurations.
But if I'm looking at two variables, then the number of
configurations may be [INAUDIBLE]
[? square ?]
[? bigger, ?] and [? it'd ?] have [? 390-- ?] even more.
But typically, I'm going to have hundreds-- if you're
thinking about images, it's thousands-- tens of
thousands-- hundreds of thousands.
So it's crazy how many configurations there are.
So how do we possibly generalize to new
configurations?
We cannot just break up this space into small cells and
count how many things happen in each cell because the new
examples that we want to [? carry-- ?] new
configurations that [INAUDIBLE] asked about might
be in some region where we hadn't [INAUDIBLE].
So that's the problem of generalizing [INAUDIBLE].
So there's one thing that can help us, but it's not going to
be sufficient.
It's something that happens with the iPhones.
It's very often [INAUDIBLE] vision, [INAUDIBLE]
processing and understanding and many other problems where
the set of configurations of variables that are plausible--
that can happen in the real world--
occupy a very small volume of all this set of possible
configurations.
So let me give an example.
In images, if I choose the pixels in an image randomly--
in other words, if I sample an image from completely uniform
distribution, I'm going to get things like this.
Just [INAUDIBLE].
And I can repeat this for eons and eons.
And I'm never going to assemble something that looks
like a face.
So what it means is that faces--
images of faces--
are very rare in the space of images.
They occupy a very small volume, much less than what
this picture would suggest.
And so this is a very important hint.
It means that actually the task is to find out where this
distribution concentrates.
I have another example here.
If you take the image of a four like this one and you do
some geometry transformations to it like rotating it,
scaling it, you get slightly different images.
And if at each point, you allow yourself to make any of
these transformations, you can create a so-called manifold--
so a surface of possible images.
Each point here corresponds to a different image.
And the number of different changes that you make is
basically the dimensionality of this manifold.
So in this case, even though the data lives in the high
dimension space, the actual variations we care about are
of low dimensionality.
And knowing that, we can maybe do
better in terms of learning.
One thing about curves of dimensionality is I don't like
the name curves of dimensionality because it's
not really dimensionality.
You can have many dimensions but have
a very simple function.
What really matters is how many variations does the
function have-- how many ups and downs?
So we actually had some fairly [? cool ?] results about--
the number of examples you would need if you were only
relying on this [INAUDIBLE] assumption, essentially is
linear-- the number of ups and downs of the function
[INAUDIBLE].
So let's come back to this idea of learning where to put
probability [? math. ?]
So in machine learning, what we have is data.
Each example is a configuration of variables.
And we know that this configuration [? occurred ?]
in the real world.
So we can say the probability for this configuration.
So this is the [? space ?] of configuration
I'm showing in 2D.
So we know that this configuration is plausible.
[INAUDIBLE].
So we can just put a [? beacon ?]
[INAUDIBLE] here.
And we can put a [? beacon ?] at every example.
The question is how do we take this probability mass and sort
of give a little bit of that to other places.
In particular, we'd like to put mass in between if there
really was a manifold that has some structure and if we could
discover that structure, it would be great.
So the classical machine learning way of doing things
is say that the distribution function-- the function that
you're trying to [? learn ?] in this case is smooth.
So if it's very probable here, it must be also probable in
the neighborhood.
So we can just do some mathematical equation that
will shift some mass from here to the [? different ?]
neighbors.
Then we get a distribution like this as our model.
And that works reasonably well.
But it's not the right thing to do.
It's putting mass in many directions
we don't care about.
Instead, what we're going to do is to discover that there
is something about this data.
There is some structure.
There is some abstraction that allows us to be very specific
about where we're going to put probability mass.
And we might discover with something like this, which in
2D doesn't look like a big difference.
But in high dimensions, the number of directions you're
allowed to move here is very small compared to the number
of dimensions here.
And the volume goes exponentially with dimension.
So you can have a huge [? gain ?] by guessing
probably which direction things move--
are allowed to keep high probability.
So, now to the core of this presentation which is about
representation learning.
I've talked about learning in general
and some of the issues--
some of the challenges with applying learning to AI.
Now, when you look at how machine learning is applied in
industry, what people do for 90% of time-- what they do
with the effort of engineers is not really
improve machine learning.
They use existing machine learning.
But to make the machine learning [INAUDIBLE] work
well, they do [INAUDIBLE]
feature engineering.
So that means taking the raw data and transforming it--
extracting some features-- deciding what matters--
throwing away the things that we think don't matter.
And that's essentially using humans and our intelligence
and our understanding of the problem to figure out the
factors that matter-- to figure out the dependencies
that matter and so on.
So what representation learning is about is trying to
do with machines what humans do right now, which is
extracting those features--
discovering what is a good representation for your data.
And one way to think about it is the machine
is trying to guess--
not just those features or those computations that are
useful for us to explain our [INAUDIBLE]
but really what are the underlying factors that
explain the [INAUDIBLE]?
What are the underlying causes?
And the guesses about what these are for our particular
example is exactly what we'd like to have as our
representation.
Of course, this is hard to define because we don't know
what the right factors are, what are the
right causes of it.
This is the objective we have.
This is [INAUDIBLE] by the way.
So there is a very important [? family ?] of algorithms as
[INAUDIBLE] mentioned [INAUDIBLE] that have been
around since at least [? those ?]
that have multiple levels like this that have been around
since the '80s.
And they have multiple layers of [? computations. ?]
And one of the things I've been trying to do is to find
some properties that they have that other algorithms may have
that may be useful and try to understand why these
properties are useful.
In particular, there's the [INAUDIBLE] of depth.
So the idea of deep learning is that not only are you going
to have representations of the data [INAUDIBLE]
learned.
But you're going to have multiple levels of
representation.
And why would it matter to have multiple levels of
representation?
Because you're going to have low level and high level
representations where high level representations are
going to be more abstract--
more nonlinear--
capture structure that is less obvious in the data.
So what we call deep learning is when the learning algorithm
can discover these representations and even
decide how many levels there should be.
So I mentioned [INAUDIBLE]
as the original example of deep learning.
What these algorithms do is they learn some computation--
some function that takes an [INAUDIBLE] vector and map it
to some output, which could be a vector, through different
levels of representation where each level is composed of
units which do a computation that's inspired by how the
neurons in the brain work.
So they have a property which you don't find in many
learning algorithms called [INAUDIBLE].
So let's first see how these other [INAUDIBLE]
numbers work--
how they generalize.
Remember, this is going to be very similar to when I talked
about [INAUDIBLE]
assumption.
They rely on this [INAUDIBLE] assumption.
Deep learning also relies on this [INAUDIBLE] assumption
but introduces additional [INAUDIBLE]-- additional
knowledge, if you will.
So when you only rely on this [INAUDIBLE] assumption, the
way you work is you essentially take your
[? input ?] space--
[INAUDIBLE] space, and break it up into regions.
For example, this is what happens with clustering,
nearest neighbors, any SVMs, any classical statistical
non-parametric algorithms, decision trees and
[? so on. ?]
So what happens is after seeing the data, you break up
the [INAUDIBLE] space into regions, and
you generalize locally.
So if you have a function that outputs something here--
because you've seen an example here for example--
you can generalize and say, well in the neighborhood, the
output is going to be similar and maybe some kind of
interrelation with the neighboring regions is going
to be performed.
But the crucial point from a mathematical point of view is
that there's a counting argument here, which is how
many parameters-- how many degrees of freedom do we have
to define this partition?
Well, basically, you need at least one
parameter per region.
The number of parameters is going to grow with the number
of regions.
See-- if I want to distinguish two regions, I need to say
where the first one is or how to separate between these two.
And for example, [INAUDIBLE]
specifying the center of each region.
So the number of things I have to specify from the data is
essentially equal to the number of regions I can
distinguish.
So you can think well, there's no other way you could do
that, right?
I mean, how could you possibly create a new region for each
[INAUDIBLE] not see any data, and distinguish it
meaningfully.
Well, you can.
Let me give you an example.
So this is what happens with distributed representations,
which happens with things like factor models, PCA, RBM,
neural nets, sparse coding, and deep learning.
What you're going to do is you're going to still break up
the [INAUDIBLE] space at the regions and be able to
generalize locally in a sense that things that are nearby
are going to have similar outcomes.
But the way you're going to learn that
is completely different.
So, for example, you can [INAUDIBLE]
space.
What I'm then doing is that I'm going to break it down in
different ways that are not mutually exclusive.
So, here, what I'm thinking about when I'm building this
model is there are three factors that explain the two
inputs that I'm seeing.
So this is two-dimensional input space.
And I'm bringing the space into different regions.
So, for example, the black line here tells me that you're
either on that side of it or the other side of it.
On that side, it's [? T1 ?] equals 1.
On that side, it's [? T1 ?] equals 0.
So this is a bit that tells me whether I'm in
this set or this set.
And I have this other data that tells me whether I'm on
this set and this set, or that set.
And now you can see that the number of regions I've defined
in this way could be much larger than the number of
parameters.
Because the number of parameters was [? beginning ?]
with a number of factors-- the number of these [INAUDIBLE].
So by being smart about how we define those regions by
allowing the [INAUDIBLE] to help us, you can get
potentially exponential gain in expressive power.
Of course, from the machine learning point of view, this
comes with an assumption.
The assumption is that when I learn about being on that side
or that side, it's meaningful [INAUDIBLE] in some sense--
not quite in a statistical sense--
of what happens with the other configurations--
the other half of it.
So that makes sense if you think of, OK, this is images.
And this one is telling me is this a male or a female?
This one's telling me, does he wear glasses or not?
Is he tall or short, something like that.
So if you think about these factors as [INAUDIBLE]
meaningful things, usually, you can vary them [INAUDIBLE],
like the causes that explain the world around us.
And that's why you're able to generalize.
You're assuming something about the world that gives you
a kind of exponential power of representation.
Now, of course, in the new world, the features we care
about, the factors we care about are not going to be
simple, linear, or separated.
So that's one reason why we need deep representations.
Otherwise, just the same old [? level ?] will be enough.
Let me move on because time is flying.
So this is stolen from my brother, Samy, who gave a talk
here not long ago where they used this idea of observations
in a very interesting way where you have data of two
different modalities.
You have images.
And you have text queries--
short sequence of words.
And they learned representation for images, so
they map the image to some hyperdimensional vector and
they learn a function that represents queries.
So they map query through this also hyperdimensional point in
the same space.
And they learn them in such a way that when someone types
"dolphin" and then is shown an image of a dolphin and then
clicks on it, the representation for the image
and the representation for the query end up
close to each other.
And in this way, once you learn that, you can of course
[INAUDIBLE] things like answering new queries you've
never seen and find images that match queries that
somehow you haven't seen before.
One question that people outside of machine learning
ask when they've considered what machine learning are
doing is this is crazy.
Humans can learn from very few examples.
And you guys need thousands or millions of examples.
I mean, you're doing something wrong.
And they're right.
So how do humans manage to constantly learn something
very complicated from just a few examples?
Like, how do students learn something?
Well, ther are a number of answers.
One is brains don't start from scratch.
They have some priors.
And in particular, I'm interested in generic priors
that allow us to generalize the things that [INAUDIBLE]
didn't train our species to do.
But still they do very well.
So we have some very general purpose priors we are born
with, and I'd like to figure out which they are because we
can exploit them as well.
Also-- and this is very, very important--
if you ask a newborn to do something, it
wouldn't work very well.
But of course, an adult has learned a lot of things before
you give him a few examples.
And so he's transferring knowledge from [INAUDIBLE].
This is [? crucial. ?]
And the way he's doing that is he's built in his mind
representations of the objects-- of the types of the
modalities which are given in the examples.
And these representations capture the relationships
between the factors-- the explanatory factors that
explain what is going on in your particular
setup of the new task.
And he's able to do that from unlabeled data--
from examples that were unrelated to the task we're
trying to solve.
So one of the things that humans are able to do is to do
what's called semi-supervised learning.
They're able to use examples that are not specifically for
the task you care about to generalize.
They are people who use information about the
statistical structure of the things around us to
[? greatly ?] answer new questions.
So here, let's say someone gives me just two examples.
We want to discriminate between the
green and the blue.
And the classical algorithm would do something like put a
straight line in between.
But what if you knew that there are all these other
points that are not [INAUDIBLE]
related to your task.
But these are the configurations that are
plausible in the [INAUDIBLE] distribution.
So those [INAUDIBLE] ones, you don't know if they
are green or blue.
But by the structure here, we guess that these ones are all
blue and these ones are all green.
And so you would put your decision like this.
So we're trying to take advantage of data from other
tasks that are unable to find something generic about the
world like [INAUDIBLE] usually happen in this direction and
use that to quickly generalize from very few
examples to new examples.
So of the motivations for learning about depth, there
are [? vertical ?] motivations that come from the discovery
of families of functions-- mathematical functions--
that can be represented very efficiently if you--
[? the ?]
[? longer ?] representations have longer levels that might
require exponentially more numbers--
bigger representations--
if you're only allowed one or two levels.
Even though one or two levels are enough to [? observe ?]
any function, it might be very inefficient.
And of course, there are biological motivations, like
the brain seems to have the [INAUDIBLE]
[? picture. ?]
[? It's especially ?] true of the visual cortex, which is
the part we understand best.
And that the cortex seems to have a generic learning
algorithm which the principles seem to be at work in terms of
learning everywhere in the cortex.
Finally, there are cognitive motivations [INAUDIBLE].
We learn simpler things first.
And then we [? compose ?] to these simpler things to build
high level abstractions.
This has been exploited, for example, in the work of
[INAUDIBLE]
[? Stanford-- ?]
by [INAUDIBLE] and [INAUDIBLE] and others to show how
[INAUDIBLE] representations can learn simple things like
[? edges-- ?]
combine [? them ?]
[? from ?] parts-- combine them to form faces
and things like that.
Another sort of simple motivation is how do you
program computers.
Do we program computers by having a main program that has
a bunch of lines of code?
Or do we program computers by having functions or
subroutine, [? like call ?] subroutine, [? then call ?]
subroutine.
This is [? the new ?] program.
If we were forced to program that way, it
wouldn't work very well.
But most of machine learning is basically trying to solve
the [INAUDIBLE] in this--
not in the programs they use but in the structure of the
functions that are learned.
And there are also, of course, motivations from looking at
what can be achieved by exploiting depth.
So I'm stealing this slide from another Google [? talk ?]
led by Geoff Hinton last summer, which shows how deep
nets, compared to the standard way, which has been the
state-of-the-art in speech recognition for 30 years, can
be substantially improved by exploiting these multiple
levels of representation--
even-- and this is something new that impressed me a lot--
even when the amount of data available is huge, the gain in
using these representations is-- representation learning
algorithms.
And this all comes from something that happened in
2006 when first Geoff Hinton followed by a group here in
Montreal and [INAUDIBLE] group in NYU in New York found that
you could actually train your deep neural network by using a
few simple tricks.
And the simple tricks essentially that we're going
to train layer by layer using [INAUDIBLE]
learning, although recent work now allows us to train deep
networks without this trick and using other tricks.
This has given rise to lots of industrial interest, as I
mentioned--
not only in [INAUDIBLE] conditions but also in
[INAUDIBLE], for example.
I'm going to talk about some competitions we've won using
deep learning.
So, last year we won sort of a transfer learning competition,
where you were trying to take the representations, learn
from some data, and apply them on other data that relates to
similar but different tasks.
And so there was one competition where the results
were announced at ICML 2011--
[INAUDIBLE]
[? 2011 ?] and another one at NIPS 2011.
So this is less than a year ago.
And what we see in those pictures is how the
[INAUDIBLE] improves with more layers.
But what precisely each of these graphs has on the
x-axis, a lot of the number of [? label ?] examples used for
training the machine.
And the y-axis is [INAUDIBLE]
essentially, so you want this to be [? high. ?]
And for this task, as you add more levels of representation,
what happens is you especially get better in the case where
you have very few [? label ?] examples-- the thing I was
talking about that humans can do so well--
generalize from very few examples.
Because they've learned the representation earlier on
using lots of other data.
One of the learning algorithms that came out of my lab that
has been used for this is called the denoising
auto-encoder.
And what it does-- in principle, it's pretty simple.
And to learn representation, you take each input example
and you corrupt it by, say, saying some of the [INAUDIBLE]
zero [INAUDIBLE].
And then you learn a representation so that you can
reconstruct the info.
But you want to construct the uncorrupted info-- the clean
info-- that's why it's called denoising.
And then you try to make this as close
as possible to [? it. ?]
I mean, this is close as possible to the [? raw, ?]
uncorrupted info.
And we can show this essentially models the density
of the [INAUDIBLE] distribution.
And you can learn these representations and stack them
on top of each other.
How am I doing with time?
MALE SPEAKER: 6:19.
YOSHUA BENGIO: Huh?
MALE SPEAKER: 6:19.
YOSHUA BENGIO: I have until [? when? ?]
MALE SPEAKER: [? Tomorrow ?]
[? morning. ?]
MALE SPEAKER: As long as [INAUDIBLE].
MALE SPEAKER: Just keep going.
YOSHUA BENGIO: OK.
[INAUDIBLE].
[INTERPOSING VOICES]
[LAUGHING]
YOSHUA BENGIO: OK, so I [INAUDIBLE] here a connection
between those denoising auto-encoders and the manifold
learning idea that I was mentioning earlier.
So how do these algorithms discover the manifolds-- the
regions where the configurations [INAUDIBLE] the
variables are plausible-- where the distribution
concentrates.
So, we're back on the same picture as before.
So these are our examples.
And what we're trying to do is to learn a representation.
So mapping from the [? info ?] space [INAUDIBLE] here that we
[INAUDIBLE] to a new space, such that we can essentially
recover the input-- in other words, we don't lose
information.
But at the same time because of the denoising part,
actually, you can [? show that ?] what this is
trying to do is throw away all the information.
So it seems crazy but if you want to keep all the
information, but you want to throw away all the
information.
But there's a catch.
Here, you want to only be able to
reconstruct these examples--
not necessarily any configuration if inputs.
So you're trying to find the function which will preserve
the information for these guys.
In other words, it's able to reconstruct them
[? by the identity ?] function.
But it's applied on these guys.
But when you apply it in other places, it's allowed to do
anything it wants.
And it's also learning this [? new ?] function.
So in order to do that, let's see what happens.
Let's consider a particular point here--
particular example.
It needs to distinguish this one from its neighbor.
In the representation, [INAUDIBLE].
The representation you learn from that guy has to be
different enough from that guy that we can actually recover
and distinguish this one from this one.
So we can learn an inverse mapping, an approximate
inverse mapping, from the representation.
So that means you have to have a representation which is
sensitive to changes in that direction.
So when I move slightly from here to here, the
representation has to change slightly as well.
On other hand, if I move in this direction, then the
representation doesn't need to capture that.
It could be constant as I move in that direction.
In fact, it wants to be constant in all directions.
But what's going to happen is it's going to be constant in
all directions except directions that it actually
needs to reconstruct the data and in this way, recover the
directions that are the derivatives of this
representation function.
And you recover the directions of the manifold-- the
directions where if I move in this direction, I still stay
in regional [? high ?]
[? probability. ?]
That's what the manifold really means.
So we can get rid of this direction.
And recently, we came up with an algorithm that you can use
to sample from the model.
So if you have an understanding of the manifold
as something that tells you at each point, these are the
directions you're allowed to move--
so as we stay in high probability [INAUDIBLE]
So these are the directions that keep you [? tangent ?] to
the manifold, then basically, the algorithm goes, well, we
are at a point.
We move in the directions that our algorithm discovered to be
good directions of change-- plausible
directions of change.
And that might correspond to something like taking an image
and translating it or [? updating it ?] or doing
something like removing part of an image.
And then projecting back towards the manifold-- it
turns out that the reconstruction
function does that.
And then [? integrating ?] that random [? wall ?]
to get samples to the model.
And we apply this to modeling faces and digits.
Now, let's come back to this question of what is a good
representation?
People in computer [? vision ?] have used this
term invariance a lot.
And it's a word that's used a lot when
you handcraft features.
So, remember, at the beginning, I said the way most
of machine learning is applied is you take your raw data, and
you handcraft features based on your knowledge of what
matters and what doesn't matter.
For example, if your input is images, you'd like to design
features that are going to be insensitive to
translations of your info.
Because typically, the category you're trying to
detect should not depend on a small translation.
So this is the idea of comparing features.
But if we want to do unsupervised learning, where
no one tells us ahead of time what matters and what doesn't
matter-- what the task is going to be, then how do we
know which invariance matters?
For example, let's say we're doing speech recognition.
Well, if you're doing speech recognition, then you want to
be invariant to who the speaker is.
And you want to be invariant to what kind of microphone it
is and what's the volume of the sound.
But if you're doing speaker identification, and you want
to be invariant to what the person says and you want to be
very sensitive to the identity of the person.
But if someone gives you speech.
And you don't know if it's going to be used for
recognition of [INAUDIBLE] or for recognizing people, what
should you do?
Well, what you should be doing is learning
to disentangle factors--
basically, discovering that in speech, the things that matter
are the [INAUDIBLE], the person, the
microphone, and so on.
These are the factors that you'd like to discover
dramatically.
And if you're able to do that, then my claim is you can
essentially get around the curse of dimensionality.
You can solve very hard problems.
There's something funny that happens with the deep learning
algorithms I was talking about earlier, which is that if you
train these representations from purely ununsupervised
learning, you discover that the features-- the
representation that they find have some form of
disentanglement--
that some of the units in the [INAUDIBLE]
are very sensitive to some of the underlying factors.
And they're very sensitive to one factor and very
insensitive to other factors.
So this is what disentangling is about.
But knowing all these algorithms what those factors
would be in the first place.
So something good is happening.
But what don't really understand why.
And we'd like to understand why.
One of the things that you see in many of these algorithms is
the idea of so-called sparse representations.
So what is that?
Well, up to now, I've talked about representations as just
a bunch of numbers that we associate to an input.
But one thing we can do is learn representations
[? that have ?]
[? the property-- ?]
that many of those numbers happen to be
zero or some constant--
[? other value. ?]
But zero is very convenient.
And it turns out, when you do that, it helps a lot, at least
for some problems.
So that's interesting.
And I conjecture that it helps us disentangle the underlying
factors in the problems where basically--
for any example, there are only a few concepts and
factors that matter.
So in the scene that I see right now that comes to my
eyes, of all the concepts that my brain knows about, only a
few are relative to this scene.
And it's true of almost any input
that comes to my sensors.
So it makes sense to have representations that have this
property as well-- that even though we have a large number
of possible features, most of them are sort of not
applicable to the current situation.
Not applicable, in this case, zero.
So just by forcing many of these features to ouput not
applicable, somehow we're getting better
representations.
This has been used in a number of papers.
And we've used it with so-called rectifier neural
networks, in which the unit [? compute ?] a function like
this on top of the usual linear
transformation they perform.
And the result is that this function [INAUDIBLE]
0 [INAUDIBLE].
So when x here is some weighted sum from the previous
layer, what happens is either the output is a positive real
number or the output is 0.
So let's say the input was a sort of random centered around
0, then half of the time, those features would output 0.
And if you just learn to shift this a little bit to the left,
then you know--
80% of the time or 95% of the time, the output will be 0.
So it's very easy to get sparsity with these kind of
[INAUDIBLE].
It turns out that these [INAUDIBLE] are sufficient to
learn very complicated things.
And that was used in particular in a really
outstanding system built by Alex Krizhevsky and
[INAUDIBLE] and [INAUDIBLE]
with Geoff Hinton in Toronto recently where they obtained
amazing results on one of the benchmarks that computer
vision people really care about [INAUDIBLE]
with [? 1,000 ?] classes.
So this contains millions of images taken from Google Image
search and 1,000 classes that you're trying to classify.
So these are images like this.
And there are 1,000 different categories you want to detect.
And this shows some of the outputs of this model.
[INAUDIBLE] obviously doing well.
And they managed to bring the state-of-the-art from making
small incremental changes from say 27% to 26% down to 17% on
this particular benchmark.
That's pretty amazing.
And one of the tricks they used--
I've been doing publicity for--
is called dropouts.
And Geoff Hinton is speaking a lot about this this year.
Next year will be something else.
And it's a very nice trick.
Basically, the idea is add some kind of randomness in the
typical neurons we use.
So you'd think that randomness hurts, right?
So if we learn a function like, you know-- say, thinking
about the brain doing something.
If you had noise in the computations of the brain,
you'd think it hurts.
But actually, when you do it during training, it helps.
And it helps for reasons that are yet to be completely
understood.
But the theory is it prevents the features you learned to
depend too much on the presence of the others.
So, half of the features will be turned off by this trick.
So the idea is you take the output of a neuron, and you
multiply it by 1 or 0, [INAUDIBLE]
probably be one-half.
So you turn off half of the features [INAUDIBLE].
We do that for all the layers.
And [INAUDIBLE] at this time, you don't do
this kind of thing.
You just multiply it by [INAUDIBLE].
So it averages the same thing.
But what happens is that during training, the features
learn to be more robust and more independent of each other
and collaborate in a less fragile way.
This is actually similar to the denoising auto-encoder I
was talking about earlier where we introduce corruption
noise in the input.
But here, you do it at every layer.
And somehow, this very simple trick helps
a lot in many contexts.
So they've tested it on different benchmarks.
These are three image data sets and also in speech.
And in all cases, they've seen improvements.
Let's get back to the representation learning
algorithms.
Many of them are based on learning one layer of
representation at a time.
And one of the algorithms that has been very [? practical ?]
for doing that is called a Restricted
Boltzmann Machine, or RBM.
And as a probability model, it's formulized this way.
Basically, we're trying to model this [INAUDIBLE]
of the vector--
x-- which is a [INAUDIBLE] vector of [? bits ?]
typically.
But it could be real numbers.
And we introduce a vector of [? bits ?]
[? h. ?]
And we consider the joint distribution of these vectors
[INAUDIBLE] formula.
We're trying to find the parameters in the formula--
the b, the c, and the w.
So that P of x is as large as possible.
In terms of that in this model that's been very popular for
deep learning, you need to sample from the model.
In other words, the model is representative of
distribution.
And you'd like to generate examples according to what the
model thinks is plausible.
And in principle, there are ways to do that.
You can do things like Gibbs sampling.
However, what we and others have found is that these
[? sampling ?] algorithms based on the particular
algorithm [INAUDIBLE] chain methods have some quirks.
They don't do exactly what we'd like.
In particular, we say they don't mix well.
So what does that mean?
For example, if you start a chain of samples--
so you're going to create a sequence of samples by making
small changes-- that's what Monte Carlo Markov chain
methods do--
well, it turns out that you get chains like this where it
stays around the same kind of examples.
So it doesn't move to a new category.
So you'd like your sample or your [INAUDIBLE] algorithm to
be able to visit all the plausible configurations and
be able to jump from one region in configuaration space
to another one, or at least have a chance to visit all the
places that matter.
But there's a reason why this is happening.
And I'm going to try to explain it in a picture.
So, first of all, as I mentioned, MCMC methods move
in configuration space by making small steps.
You start from a configuration pf the examples.
Let's say where I'm standing is a configuration of x,y
coordinates.
And I'm going to make small steps, such that if I'm in a
configuration of high probability, I'm going to move
to another high probability configuration.
And if I'm in a low probability configuration, I'm
going to tend to move to a neighboring high probability
configuration.
In this way, you stay in sort of high probability regions.
And you, in principle, can visit the whole distribution.
But you can see there's a problem.
If this one thing here is highly probable.
And the black thing there is highly probable, and the gray
stuff in the middle is very unplausible, how could I
possibly make small moves to go from here to here from the
white to the black?
So this is illustrating the picture like this.
In this case, this is the density.
OK-- so the input is representing different
configurations of the variables of interest.
And this is what the model thinks the
distribution should be.
So it gives high probability some places.
So these are what we call modes.
It's a region that has a peak.
And this is another mode.
So this is [INAUDIBLE] two modes.
The question is can we go from mode to mode and make sure to
visit all the modes?
And the MCMC is making small steps.
Now, [INAUDIBLE]
is the MCMC can go through these [INAUDIBLE] regions.
And they have enough probability--
it can move around here and then quickly go through these
and do a lot of steps here and go back in this way-- assemble
all the configurations with [INAUDIBLE]
probability.
The problem is--
remember, I said in the beginning, that there's this
geometry where the [INAUDIBLE]
problems have this property--
that the things we care about-- the images occupy a
very small volume in the space of configurations of pixels.
So the right distribution that we're trying to learn is one
that has very big piece where there's a lot of probability.
And in most other places, the probability will be tiny,
tiny, tiny--
exponentially small.
So, we're trying to make moves between these modes.
But now these modes are separated
by vast empty spaces--
deserts of probability where it's impossible to cross
unless you make huge jumps.
So that's a really big problem.
Because when we consider algorithms like the RBM,
what's going on is for learning, we need to sample
from the model.
Initially, when the model starts learning, it says I
don't know anything.
I'm assigning a kind of uniform [? probability ?] to
everything.
So the model thinks everything is uniform-- the probability
is the same for everything.
So we seem to move everywhere.
And as it keeps learning, it starts developing these
peaks-- these modes.
But still there's a way to go from mode to mode and go
through reasonably probable configurations.
As learning becomes more advanced, you
see these peaks emerge.
And now, it becomes impossible to cross.
Unfortunately, we need the sampling to learn these
algorithms.
And so there's a chicken and egg problem.
We need the sampling.
But if the sampling doesn't work well, the learning
doesn't work well.
And so we can't make progress.
So the one thing I wanted to talk about is a direction of
solution for this problem that [? we are ?] starting to
explore in my lab.
And it involves exploiting--
guess what--
deep representations.
The idea is instead of doing these steps in the original
space of the inputs where we observe things, if we did the
MCMC in abstract, high level representation, maybe things
would be easier.
So let's consider, for example, something [INAUDIBLE]
images and digits.
If we had a machine that had discovered that the factors
that matter for these images is something like, OK, is the
background black and foreground
white or vice versa.
So this is one [? bit ?] that says flip black and white.
And is the category zero, one, two, three, so that's just 10
[? bits ?] that tell us what the category is.
And what's the position of the digit in the image.
So these are high level factors that you could imagine
that learned or discovered.
And if it would discover these things, then if you represent
the image in that space, the MCMC would be much easier.
In particular, you could go from, say, one of these guys
to these guys directly simply because maybe these are the
zeros and these are the threes.
And there's one bit that allows you to flip-- or two
bits that allow you to flip from zero to three.
So in the space where my representation has
[? a big fuzzy ?] row of [INAUDIBLE] three--
I just need to flip two bits.
And that's easy.
It's a small move in that space.
In a space of abstract representations, it's easy to
generate data whereas the original space it's difficult.
One way to see this visually is to interpolate between
examples at different levels of representation.
So this is what we've done.
So if you look in the pixel space, and you interpolate
between this line-- this picture of a line--
this picture of [INAUDIBLE]
[? linear ?] interpolation, what you see is between the
image of a line and the image of a three, you have to go in
between through images that don't look like anything--
I mean, don't look like a digit-- the things that the
model has seen-- and so the MCMC--
if it [? walks-- ?]
oh, that's a plausible thing.
Oh, no, this is not very plausible.
This is worse.
[? I'm ?] coming back.
So it's never going to go on the other side of this desert
of probability.
So this is a one-dimensional [INAUDIBLE] of the example I
was trying to explain.
Now, what you see with the other two lines is the same
thing but at different levels of representation that has to
do with [INAUDIBLE] using unsupervised learning.
And what you see is that it has learned a representation
which has kind of skewed the space so that somehow, I can
make lots of small changes and stay in three.
And suddenly, just a few pixels flip and it becomes a
nine magically.
And so I don't have to stay very long in the row
[INAUDIBLE] region.
In fact, it's not even so implausible.
And you can move--
all of these moves are rather plausible.
So you can smoothly move from mode to mode without having to
go through these low probability regions.
So it's not like it actually discovered these actual bits.
But it's discovered something that makes the
job of sampling easier.
And we've done experiments to validate that.
So the general idea is instead of sampling in the original
space, we're going to learn these representations and then
do our sampling which iterates between representations in
that high level space.
And then once we've found something in that abstract
representation, because we have inverse mappings, we can
map back to the input space and get, say, the digit we
care about or the face we care about.
So we've applied this, for example, with face images.
And what we find is that, for example, the red here uses a
deep representation.
And the blue here uses a single layer RBM.
And what we find is that for the same number of steps it
can visit more modes-- more classes--
by adding a deeper representation.
So I'm almost done.
What I would like you to keep in mind is that machine
learning involves really interesting [INAUDIBLE]
challenges that have a geometric nature in
particular.
And in order to face those challenges, something we found
very useful is to allow machines to look for
abstraction-- to look for high level
representations of the data.
And I think that we've only scratched the
surface of this idea--
that the algorithms we have now discover still rather low
level abstractions.
There's a lot more that could be done if we were able to
discover an even higher level of abstractions.
Ideally, what the high level abstractions do is
disentangle--
separate out--
the different underlying factors that explain the
data-- the factors we don't know but we'd like the machine
to discover.
Of course, if we know the factors, we can somehow cheat
and give back to the machine by telling it, here are some
random variables that we know about.
And here are some values of these variables in such and
such setting.
But you're not going to be able to do that for
everything.
We need machines that can make sense of the world by
themselves to some extent.
So we've shown that more abstract representations give
rise to successful transfers-- so being able to generalize
new domains, new languages, new classes.
And one [INAUDIBLE]
[? computation ?] is using these tricks.
I'm done.
Thank you very much.
[APPLAUSE]
YOSHUA BENGIO: Before I take questions, I would like to
thank members of my team, so [INAUDIBLE].
[INAUDIBLE] of course very fortunate to have for my work.
And I'm open to questions.
Yes?
There's a microphone.
AUDIENCE: Hi.
So the problem you mentioned with the Gibbs sampling--
isn't that easily solved by [INAUDIBLE]?
YOSHUA BENGIO: No.
We've tried that.
And [INAUDIBLE].
AUDIENCE: [INAUDIBLE]
be maybe [INAUDIBLE]--
YOSHUA BENGIO: So the problem-- what happens is if
you [INAUDIBLE] restart, what typically can happen is your
log is going to bring you to a few of the modes--
always the same ones.
And so you're not going to visit everything.
AUDIENCE: Well, why would it bring you
always to the same mode?
YOSHUA BENGIO: Because it's like a [? dynamical ?] system.
So somehow most [? routes ?] go [? to Rome ?]
for some reason.
[INAUDIBLE].
AUDIENCE: Well, [? if ?]
[? you ?]
[? did ?]
[? say ?]
Paris, you'll go to Paris.
YOSHUA BENGIO: Most routes go to a few big cities.
That's what happens in these algorithms.
We obviously tried this because it's-- you think well,
that should work.
Uh--
and there's a question right here.
AUDIENCE: So I was just wondering about
sampling from the model.
I thought that was interesting the way you had different
levels of abstraction.
You could sort of bridge that space.
So you're able to visit more classes.
But does that make the sort of distinction
between those classes?
It seems you're bringing those classes closer together.
So in terms of [INAUDIBLE]--
YOSHUA BENGIO: We had trouble understanding what
was going on there.
Because you'd think if you're--
[INAUDIBLE]--
think if somehow we made the threes and the nines closer,
it should now be harder discriminating from them.
Whereas before, we had this big empty region where we
could put our separator.
So I don't have a complete answer for this.
But it's because we are working with these high
dimensional spaces that these are not necessarily as our
intuition would suggest.
What happens really is that the manifold--
so the regions where, say, the threes are is a really
complicated, curvy surface in high dimensional space.
And so is the nine.
And in the original space, these curvy spaces are
intertwined in complicated ways, which mean machinery has
a hard time separating between nines and threes, even though
there's lots of spacing between nines and threes.
It's not like you have nines here and threes here.
It's a complicated thing.
And what we do when we move to these high level
representations is flatten those surfaces.
And now if you interpolate between points, you're kind of
staying in high probability configurations.
So this flattening also means that it's
easier to separate them.
Even though they may be closer, it's easier because
you [? need ?] a [? simpler ?]
surface is enough.
This is a conjecture I'm making.
I haven't actually seen [INAUDIBLE]
spaces like this.
This is what my mind makes of it.
AUDIENCE: So my understanding is that you don't even include
[INAUDIBLE] algorithm of what the representations
[INAUDIBLE].
YOSHUA BENGIO: We would like to have algorithms that can
learn from as little [? cue ?] as possible.
AUDIENCE: So on some of your [? research ?]
[INAUDIBLE] my understanding is that you worked on
[INAUDIBLE]
[? commission. ?]
YOSHUA BENGIO: Yes.
AUDIENCE: Did you try to understand what were the
representations that the algorithms produced?
YOSHUA BENGIO: Yes.
AUDIENCE: And did you try from that to understand if there
were units where using [INAUDIBLE] similar
representations of your representations would be
different between these [INAUDIBLE]
units?
YOSHUA BENGIO: Well, first of all, we don't
know what humans use.
But we can guess, right?
So, for example, I showed early on the results of
modeling faces from Stanford--
and other people have found the same, but
I'll use their picture--
that these deep learning algorithms discover
representations that, at least for the first two levels, seem
to be similar to what we see in the visual cortex.
So if you look at [? V1, ?]
[INAUDIBLE]
the major first area of the visual cortex where
[INAUDIBLE]
[? arrive, ?]
you actually see neurons that detect or are sensitive to
exactly the same kinds of things.
And in the second layer--
and [? V2 ?] is not a layer really-- it's an area--
well, you don't see these things.
But you'll see combinations of images.
And [INAUDIBLE] same group that is actually compared what
the neurons in the brain seem to like according to
neuroscientists with the machine learning models
discover and they found some similarities.
Other things you can do that I mentioned quickly was--
in some cases, we know what the factors are.
We know what humans will look for.
So you can just try to correlate the features that
have been learned with the factors you know humans think
are important.
So we've done that in the work here with [INAUDIBLE].
That's not the one I want to show you.
It is the right guys, but--
sorry.
Yes-- this one.
So for example, we've trained models on the problem of
sentiment analysis where you give them a sentence and you
try to predict the person liked or didn't like
something, like a book or a video or a [INAUDIBLE]--
something--
something you find on the web.
And what we found is that when we use purely unsupervised
learning-- so it doesn't know that the job is to present
[INAUDIBLE] analysis.
Some of the features specialize on the sentiment.
Is this [INAUDIBLE] more positive or more negative kind
of statement.
And some features specialize on the domain.
Because we train this across 25 different domains.
So some basically detect or are highly correlated with--
is this about books-- is this about food-- is this about
videos-- is this about music?
So these are underlying factors we know are
present in the data.
And we find [INAUDIBLE]
features tend to specialize toward these things, much more
than the original.
AUDIENCE: [SPEAKING FRENCH]
YOSHUA BENGIO: So I'm going to summarize your question in
English and answer in English [INAUDIBLE].
So my understanding of the question is, could we just not
use our prior knowledge to build in the property and
structure [INAUDIBLE] representations and so on?
And my answer to this is of course because
this is what we do.
And this is what everybody in machine learning does.
And it's especially true in computer vision where we use a
lot of prior knowledge of our understanding
of how vision works.
But my belief is that it's also interesting to see
whether machines could discover these things by
themselves.
Because if you have algorithms that can do that, then we
can-- we can still use our prior knowledge.
But we can discover other things that we didn't know or
that we were not able to formulize.
And with these algorithms, actually, it's not too
difficult to put in prior knowledge.
You can add [INAUDIBLE]
variables that correspond to the things you know matter and
you can put extra terms in the [INAUDIBLE] model that
correspond to your prior knowledge.
You can do all these things.
And some people do that.
If I work with industrial partner, I'm going to use all
the knowledge that I have because I want to have
something that works in the next six months.
But if I want to solve AI, I think it's worth it to explore
more general purpose methods.
It could be combined with the prior knowledge.
But to make the task of discovering these general
purpose methods easier and focus on this aspect, I find
it interesting to actually avoid lots of very specific
human knowledge.
Although, different researchers look at this
differently.
AUDIENCE: [SPEAKING FRENCH]
YOSHUA BENGIO: You can do all kinds of things.
There's lots of freedom to combine our knowledge and
there are different ways.
And it's a subject of many different papers--
to combine our knowledge with learning.
So we can do that.
Sometimes, though, when you put too much prior knowledge,
it hurts the [INAUDIBLE].
AUDIENCE: So how much does the network [INAUDIBLE] matters?
And isn't that kind of the biological priors you were
talking about in the very early slide?
And isn't that the new kind of feature
engineering of deep learning--
figuring out the right [INAUDIBLE].
YOSHUA BENGIO: So the answer is no to
both of these questions.
[INAUDIBLE].
For example, the size of those layers doesn't matter much.
It matters in the sense that they have to be big enough to
capture the data.
And regarding the biology-- what-- the--
AUDIENCE: Yeahm because I was thinking, right--
our brain comes wired in certain ways because we have
the visual cortex and [INAUDIBLE] that.
So--
YOSHUA BENGIO: There might be things we will learn that
we'll be able to exploit that might be generic
enough in the brain.
And we're on the lookout for these things.
MALE SPEAKER: We'll take two more questions.
AUDIENCE: So I'm interested in the results that [INAUDIBLE]
variations [INAUDIBLE].
So that reminds me of [INAUDIBLE]
we try to avoid [INAUDIBLE].
But [INAUDIBLE] come with a lot of [INAUDIBLE]
scheduling and such.
So [INAUDIBLE].
YOSHUA BENGIO: No--
I mean, you can see this trick is so simple-- like in
[INAUDIBLE]
one nine.
So there's nothing complicated in this particular trick.
You just add noise in a very dumb way.
And somehow, it makes these networks more robust.
AUDIENCE: Well, [INAUDIBLE].
YOSHUA BENGIO: It's different from [INAUDIBLE].
It doesn't serve the same purpose.
AUDIENCE: You're also not reducing [INAUDIBLE]--
AUDIENCE: [INAUDIBLE].
YOSHUA BENGIO: Sorry?
AUDIENCE: When you're trying to [INAUDIBLE].
energy [INAUDIBLE].
YOSHUA BENGIO: Yeah-- so you're trying to minimize the
error under the [INAUDIBLE] created by this [INAUDIBLE].
So there is [INAUDIBLE].
AUDIENCE: And it's a very rough [? connection ?]
at this point.
But then for [INAUDIBLE], it's not just [? two ?]
random things.
You also have temperature decreasing and the end of the
[INAUDIBLE].
YOSHUA BENGIO: So, here, it's not happening in the space of
parameters that we are trying to optimize.
So in [INAUDIBLE] trying to do some optimization.
Here, the noise is used as a regularizer, meaning it's
injecting sort of a prior that your neuro net should be kind
of robust to half of it becoming dead.
It's a different way of using [INAUDIBLE].
FEMALE SPEAKER: And the last question.
AUDIENCE: In the beginning, you were talking about priors
that humans have in their brain.
YOSHUA BENGIO: Absolutely.
AUDIENCE: OK-- so how do you exploit
that in your algorithm?
Or do you exploit that?
So it's a general question [INAUDIBLE].
It's not so precise.
YOSHUA BENGIO: Well, I could speak for hours about that.
AUDIENCE: Yeah, OK-- but how do you exploit that in your
algorithms?
Or what kind [INAUDIBLE]?
YOSHUA BENGIO: Each prior that we consider is basically
giving rise to a different answer to your question.
So in the case of the sparsity prior I was mentioning, it
comes about simply by adding a term in the [INAUDIBLE] adding
a prior explicitly.
That's how we do it.
In the case of the prior that--
I mentioned the prior that input distribution tells us
something about the task.
And we get it by combining unsupervised learning and
supervised learning.
But in the case of the prior that--
there are different abstractions that matter--
different levels of abstraction in the world
around us, we get the prior by adding a structure in the
model that has these different methods of representation.
There are other priors that I didn't mention.
For example, one of the priors is [INAUDIBLE]
study as the constancy prior or the [? slowness-- ?]
that the factors that matter that explain the world around
us, some of them change slowly over time.
The set of people in this room is not changing very
quickly right now.
It's a constant over time.
Eventually, it will change.
But there are properties of the world around us that
remain the same for many time stamps.
And this is a prior, which you can also incorporate in your
model by changing the training criteria to say something like
well, some of the features in your representation should
stay the same from [? team to ?]
[? team ?]
[INAUDIBLE].
This is very easy to put in and has been done and
[INAUDIBLE].
So each prior--
you can think of a way to incorporate it.
Basically, it's changing the structure or changing the
training criteria usually, is the way we do it.
AUDIENCE: And which kind of prior do you
think we humans have?
I guess it's a very complex question.
[INTERPOSING VOICES].
YOSHUA BENGIO: Basically, the question you're asking is the
question I'm trying to answer.
So I have some guesses.
And I mentioned a few already.
And our research is basically about finding out what these
priors are that are generic and work for many tasks in the
world around us.
AUDIENCE: OK.
Thank you.
YOSHUA BENGIO: Welcome.
MALE SPEAKER: [INAUDIBLE] and ask one last question.
As some of our practitioners are not machine learning
experts, do you think it's worthwhile for us to learn a
bit about machine learning--
YOSHUA BENGIO: Absolutely.
[LAUGHTER]
MALE SPEAKER: What's the starting point where we can--
we have a small problem.
We want to spend one month working on this kind of thing.
Where would be the starting point for us?
YOSHUA BENGIO: You have one month?
[LAUGHING]
MALE SPEAKER: So is there a library or something
[INAUDIBLE]?
YOSHUA BENGIO: You should take Geoff Hinton's [INAUDIBLE]
course.
You can probably do it in one month because you're
[INAUDIBLE].
AUDIENCE: There was also another one by Andrew
[INAUDIBLE].
YOSHUA BENGIO: Yes.
Andrew [INAUDIBLE] has a very good course.
And there are more and more resources on the web to help
people get started.
There are libraries that people share, like the library
from my lab that you can use to get started quickly.
There are all kinds of resources like that.
AUDIENCE: [INAUDIBLE] to it on a standard computer, or you
need a cluster or--
YOSHUA BENGIO: It helps to have a cluster for training.
Once you have the training model, usually you can run it
on your laptop.
And the reason you need a cluster is that these
algorithms have [INAUDIBLE]
many [? knobs ?]
in the set.
And you want to explore many configurations of these knobs.
But actually training one model can be
on a regular computer.
It's just that you want to try many
configurations of these knobs.
MALE SPEAKER: Thank you very much, Yoshua.
YOSHUA BENGIO: You're welcome.
MALE SPEAKER: --for this interesting talk.
[APPLAUSE]
[MUSIC]