字幕表 動画を再生する
ROSSI LUO: Good afternoon.
Welcome to Brown Biostatistics Seminar.
And I'm Rossi Luo, faculty host for today's event.
And for those of you new to our departmental seminar,
the format is usually that the presentation
followed by a question and answer session.
And because of the size of crowd today, we
are going to also use this red box
thing to capture your questions and for videotaping and also
make sure your questions are heard.
And today I'm very pleased to introduce Professor Yann LeCun.
Professor LeCun is a director of Facebook AI Research,
also known as FAIR.
And he is also senior professor of computer science,
neuroscience, and electronic computer engineering
at New York University.
He's also the founding director of NYU Center for Data Science.
Before joining NYU, he had a research department
for industry, including AT&T and NEC.
Professor LeCun has made extraordinary research
contributions in machine learning, computer vision,
mobile robotics, computational neuroscience.
Among this, he's a pioneer in developing
convolutional neural networks.
And he is also a founding father of convolutional nets.
And these works contributed to say
the creation of new an exploding field in machine learning
called deep learning, which is now
called artificial intelligence tool for various range
of applications from image to natural text processing.
And his research on contributions
has earned him many honors and awards
including the election to the US National
Academy of Engineering.
Today he will give a seminar titled,
How Can Machines Learn as Efficiently as Animals
and Humans.
I understand some of you actually
told me you drove from Boston or many places are very far.
So without further ado, let's welcome Professor Yann LeCun
for his talk.
[APPLAUSE]
YANN LECUN: Thank you very much.
It's a pleasure to be here.
A game I play now occasionally when I give a talk here is I
count how many former colleagues from AT&T are in the room.
I count at least two.
Chris Rose here, Michael Litman.
Maybe that's it.
That's pretty good, two.
Right.
So, how can machines learn as efficiently
as animals and humans?
A have a terrible confession to make.
AI systems today suck.
[LAUGHTER]
Here it is in a slightly less vernacular form.
Recently, I gave a talk at a conference in Columbia
called the Compositional and Cognitive Neuroscience
Conference.
It was the first edition.
And there was a keynote.
And before me, Josh Tenenbaum give
a keynote where he said this.
All of these AI systems that we see now, none of them
are real AI.
And what he means by this is that none of them
actually learn stuff that are as complicated as what
humans can learn.
But also learn stuff as efficiently as
what animals seem to learn them.
So we don't have robots that are nearly as
agile as a cat for example.
You know, we have machines that can play golf better
than any humans.
But that's kind of not quite the same.
And so that tells us there are major pieces of learning
that we haven't figured out.
That animals are able to do that, we don't do--
we can't do with our machines.
And so, I'm sort of jumping ahead here
and telling you the punch line in advance, which
is that we need a new paradigm for learning,
or a new way of formulating that has old paradigms that
will allow machines to learn how the world works the way animals
and humans do that.
So the current paradigm of learning
is basically supervised learning.
So all the applications of machine learning,
AI, deep learning, all the stuff you see the actual real world
applications, most of them use supervised learning.
There's a tiny number of them that
use reinforcement learning.
Most of them use some form of supervised learning.
And you know, supervised learning, we all--
I'm sure most of you in the room know what it is.
You want to build a machine that classifies cars from airplanes.
You show an image of a car.
If a machine says car, you do nothing.
If it says airplane, you adjust the knobs on the machine
so that the output gets closer to what you want.
And then you show an example of an airplane.
And you do the same.
And then you keep showing images of airplanes and cars,
millions of them, thousands of them.
You adjust the knobs a little bit every time.
And eventually, the knobs settle on a configuration,
if you're lucky enough, that will distinguish every car
from every airplane, including the ones
that the machine has never seen before.
That's called a generalization ability.
And what deepening has brought to the table there,
unsupervised learning, is the ability
to build those machines more or less
numerically with very little sort of human input
in how the machine needs to be built,
except in very general terms.
So the limitation of this is that you
had to have lots of data that has been labeled by people.
And to get a machine to distinguish cars
from airplanes, you need to share
with thousands of examples.
And it's not the case that babies or animals
need thousands of examples of each category
to be able to recognize.
Now, I should say that even with supervised learning,
you could do something called transfer learning, where
you train a machine to recognize lots of different objects.
And then if you want to add a new object category,
you can just retrain with very few samples.
And generally it works.
And so what that says, what that tells
you is that when you train a machine,
you kind of figure out a way to represent the world that
is independent of the task somehow, even though you train
it for a particular task.
So what did deep learning bring to the table?
Deep learning brought to the table
the ability to basically train those machines
without having to hand craft too many modules of it.
The traditional way of doing pattern recognition
is you take an image, and you design a feature extractor that
turns the image into a list of numbers that can be digested
by a learning algorithm, regardless of what
your favorite learning algorithm is,
linear classifiers, [INAUDIBLE] machines, kernel machines,
trees, whatever you want, or neural nets.
But you have to preprocess it in a digestible way.
And what deep learning has allowed us to do
is basically design a learning machine
as a cascade of parametrised modules, each of which
computes a nonlinear function parametrised
by a set of coefficients, and train the whole machine end
to end to do a particular task.
And this kind of an old idea.
People even in the 60s had the idea
that this would be great to come up
with learning algorithms that would train multilayer systems
of this type.
They didn't quite have the right framework if you want,
neither the right computers for it.
And so in the 80s, something came up
called back propagation with neural nets
that allowed us to do this.
And I'm going to come to this in a minute.
So the next question you can ask of course
is what do you put in those boxes?
And the simplest thing you can imagine as a nonlinear
function, it has to be non-linear,
because otherwise there's no point in stacking boxes.
So the simplest thing you can imagine is take an image,
think of it as a vector, essentially.
Multiply it by a matrix.
The coefficient of this matrix are going to be learned.
And you can think of every row of this matrix being
used to compute a dot product with an input vector.
And that produces basically a weighted sum
of the inputs multiplied by those coefficients.
That gives you another vector.
And you pass each component of these vector
through a non-linearity like this one, for example.
Just halfway ratification.
So you have two different steps.
Linear, nonlinear.
Linear pointwise, nonlinear.
Very simple.
And you can show that by stacking two layers of this,
you can approximate any function you want, as close as you want
as long as you have sufficiently many of these guys
in the middle by tweaking the parameters of the two layers.
But in fact, most functions we're interested in
are more economically represented by many layers.
And so that's the new approach to deep learning, if you want,
that changes from the neural nets of 30 years ago,
which typically had only two or three layers.
The neural nets of today, the deep learning systems of today
have anywhere between 20, 50, or 100 layers.
OK.
So we have linear operators that are
parametrized by coefficients.
And the supervised learning, we're basically
going to train it to be some sort of objective function
that's going to measure the discrepancy
between the output the machine produces and the output
we want.
And so the subjective function is going to be differentiable.
What we're going to do is compute the gradient
of the objective function with respect
to all the parameters in the machine averaged
over a number of training samples.
Or if we use stochastic gradient decent,
averaged over a small batch of training samples,
or even a single sample.
And then take one step [INAUDIBLE]
to get your gradient using the stochastic gradient update
rule.
Basically, the parameters are going
to kind of go down to a minimum in a stochastic fashion
as you train more and more.
So now the next step you have to do
is compute the gradient of the objective function
with respect to the parameters.
And the way you do this through back propagation.
I'm not going to go through this.
The mathematical concept on which it's based
is incredibly sophisticated.
Is it's called chain rule.
[LAUGHTER]
And some people learn this in high school.
And it basically comes down to the fact
that if have-- if your ranged parametrized functions
in a graph of competition, which in this case
is a very simple one.
It's just a linear stack of modules.
But it doesn't need to be such a simple graph.
It could be any graph.
And you [? take ?] connection by propagating signals
backwards through this graph.
Basically taking the gradient of some cost
function you want to minimize with respect
to this red variable.
And so this gradient is represented
by these green variable.
And multiplying it by the Jacobian of this box,
you get the gradient respect to the input of that box.
This is chain rule.
So it's this guy here.
Gradient with respect to the input
equals gradient with respect to the output
multiplied by Jacobian.
Very easy.
And so you propagate this backwards through the graph.
And the cool thing about this is that you can do this
automatically by having a bunch of modules of this type
that have been predefined.
And you assemble them in a graph.
And then automatically you get a gradient back.
You don't have to figure out how to compute it.
So that's what all of those deep learning frameworks
can allow you to do.
They're very simple to use.
Our favorite one is called PyTorch.
And you know, there's several Jacobians
for each of those boxes.
One that propagates through the input,
others that propagate through the parameters.
And that allows you to compute all the gradients
of the objective function, or whatever
you want to minimize with respect to all the parameters.
So, OK, back prop.
That's an old idea.
The basic idea of it actually goes back
to Leibniz and Newton, obviously.
But more recently, the people in optimal control
actually have used things like this
to called the adjoint state methods or adjoint system
methods for optimal control that was invented in the 60s.
That's what NASA used to compute rocket trajectories
and things of that type.
And it wasn't used for learning.
It was used for optimal control.
But it is very similar idea.
So we think of those variables as being
kind of control variables of a rocket,
and this being kind of the trajectory
the rocket if you want.
And then people realized you could use this
for learning in the late 70s, early 80s,
but never quite actually made it work.
And it started being used in the late 80s essentially.
And that's when the first wave of neural nets--
or the second wave a neural nets took off.
And around 1986, 1987 where people
realized you could train [? multi ?] neural
nets with this.
And then it died in the 90s, the mid 90s.
OK.
So the next question you can ask is those linear operators
are nice.
But you know, if my image is a long vector with millions
of pixels, I'm not going to multiply
by matrix that's several million by several million.
So you have to organize those linear operators
in ways that make them practical for things like images
or high dimensional inputs.
That's where the idea of convolutional nets comes in.
It actually doesn't come from sort of theoretical hypotheses.
But it was actually inspired by biology.
So I know there are neuroscientists in the room.
So this is inspired by Hubel and Wiesel, 1962.
Very classical working in neuroscience,
Nobel Prize winning work.
There were models of--
computational models of these most basic ideas
by Hubel, Wiesel, by Fukushima and his new neocognitron
model that was inspiring for inspiration
for convolutional nets.
And the basic ideas that individual cortex, and this
is something you can derive from first principles,
it's probably a good idea images to be
able to detect local features by basically having a template
that you match with the input.
And you get a score for how well this thing matches
with this one, basically a dot product, the weighted sum
of those pixels by those coefficients.
And then you swipe this over the edge everywhere.
And the results are recorded in a something
we call a feature map here.
And that operation is a discrete convolution.
But it's very similar to the kind of operation
you see, what's called simple cells in the visual cortex
do on images, where a particular neuron, an individual cortex
is connected to a local neighborhood
in the visual field.
And sort of detects local features as well.
So that's where this first layer is doing.
So these are multiple filters.
These are the convolutional kernel, [INAUDIBLE]
filter applied to this image by use of those maps.
And then you do what's called a pooling operation where
you take the result, like a local patch
of those results of filtering after the non-linearity.
And you compute an average or a max or L2 norm,
or something like this.
And you subsample the results so that the windows
over which you compute this aggregation
is set by more than one pixel.
So here it's set by two pixels.
So you get a map that's half the resolution of this one.
And then you repeat the process.
So you get convolutions again.
So this guy is a result of applying convolution kernels
to each of those maps, adding up the result,
passing it through a non-linearity.
And then again, there is pooling and subsampling.
So as you go up the layers, you get
representations that are more global and kind of more
abstract and etc.
And this is really the idea of simple cells
and complex cells, complex cells being those pooling areas
sort of a realization of this.
That's the-- drawing from Fukushima's paper
on the neocognitron where you had
those kind of simple cells and complex cells.
So this is a convolutional net.
This is meant to be an animation.
I'm not sure why it's not an animating.
But it's not animating.
And not only that, it actually crashed my computer.
All right.
I'm going to have to do something very brief
for just a minute.
OK.
Now it works.
So this is a an old convolutional net
trained in the early 90s to recognize handwriting.
And what you can see here is that this is the first layer.
That's the input.
So the first layer, 6 feature maps.
Then pooling subsampling, second layer.
Pooling subsampling, third layer.
And by the time you get here, each unit here, each pixel
represents the activation of the a unit.
It basically sees the entire input, or at least
a square on the input.
And so a slice through this represents an entire character
essentially in sort of abstract form.
And the good thing we realize pretty quickly with it
is that we could not just use it to recognize single objects,
but also multiple objects.
And that's very important.
So here we-- you basically have multiple copies
of the same convolutional net applied to a sliding window
over the input.
And it's actually very cheap to do this.
You can sort of apply the convolutional net
convolutionally.
It's convolutions all the way.
People sometimes call this [? free ?] convolutional net
now.
And at the output, you get a score
for every window and every category.
And here I'm just showing the winning score
with kind of a gray scale to indicate
the score of the category.
And then a very simple post-processing
pulls out the correct interpretation.
So here, the cool thing is that the system
can recognize objects without prior segmentation.
You don't have to separate the digits before being
able to recognize them.
And that's really important if you
want to be able to apply those things to natural images
where objects appear in the background.
And you can't afford to--
and you can't actually figure out
how to separate them from the background.
So that was kind of an important thing.
And then going forward a number of years,
about almost 10 years to 2003, someone at DARPA came up to us
and said, can you use machine learning, neural nets,
let's say, to drive robots?
And so we built this little track robot here.
It's just a radio controlled track
with two cameras, analog cameras.
And we had this truck being driven
by someone for about 20 minutes, or a total of maybe two hours.
And that person would be instructed
to drive straight and sort of veer off
whenever there was an obstacle.
And you know, he would--
after some training, you feed the network
with two images from the two cameras.
And then you would just train network
to emulate the steering angle of the human driver.
And you let the robot loose.
And he gets through all this kind of horrible busy Jersey
backyard here driving itself through this these obstacles.
So we showed these to DARPA.
And they said, oh, that's great.
We're going to start a program called LAGR
and have six different teams compete.
That would be nice if this slide actually showed.
Here we go.
See different teams compete.
They will all get the same robot.
And you'll train this robot to--
using machine learning, to figure out
whether it can drive over a particular area or not.
And so we used this convolutional net
that would look at bands in the image
and then label every pixel as to whether it's
traversable or not.
So something like this.
And the cool thing is that you can actually
get truth more or less, run truth through stereo vision.
So using a stereo vision system, because this robot has
multiple cameras, you can figure out
if something sticks out of the ground.
But that only works up to about 10 meters.
Beyond that it doesn't work.
So you trained a neural net with the labels collected
from stereo.
And then you run the neural net on the whole image.
And it does this.
It figures out where a path is essentially.
And it figures out here in the back
there is this row of obstacles in the little passage
way in between.
And so this thin kind worked pretty well.
There were again, six different teams competing on this.
We were the only ones to use convolutional nets.
But again, this was 200--
project started in 2005 and ended 2008.
And so the fast vision system that
uses a stereo, a slow system that uses stereo, and then
a slow vision system as well that uses this neural net.
And then you put the result. You combine
all the results in a map.
And you can do some planning to figure out how
to get to a particular goal.
The map here is centered on the robot.
So it's relatively easy to plan.
And then the system actually trains itself as it goes.
It adapts, collecting labels from the stereo vision.
It learns how to navigate new environment it's never seen
before, even the pesky grad students who
try to annoy this poor robot.
[LAUGHTER]
The robot weighs about 100 kilos.
It can probably break their legs.
But they're pretty sure it's not going to do that, because they
actually wrote the code.
This is-- and they trained it.
This was Raia Hadsell, who at that time
was a PhD student with me, who now leads the Robotics Research
Group at Deepmind.
And Pierre Sermanet, who is at Google Brain,
also working on robotics.
So a couple of years later, we realized
we could use the same kind of technology
for not just labeling pixels in an image as to whether it's
traversable or not, but also labeled with categories.
And some datasets started to appear that allowed to train,
you know, maybe with a couple thousand
images, that allowed to train the convolutional net to do
this.
So again, this is a convolutional net
applied to the whole image.
Each output of the convolutional net is influenced by a window
on the input, which is something like 40 by 40 pixels
at high resolution and 90 by 90 pixels
at half, and 180 by 180 pixels at quarter resolution.
So it sees a big context to make a decision for a single pixel.
But it kind of makes a decision for every pixel.
And the cool thing about this is that we
can read this in real time.
So this was implemented on what's
called an FAG, which is sort of a programmable hardware.
And it could run at about 20 frames per second classifying
to 33 categories.
And it wasn't-- far from perfect.
You know, it classified those areas here as sand or desert.
And this is the middle of Manhattan.
So there's no sand I'm aware of.
And it worked pretty well.
So we submitted a paper to CVPR in 2011.
And it was soundly rejected.
And the reviewer comments were either what the hell
is a convolutional net?
Or how is it possible that you get so good results
with a technique we've never heard of?
So it's kind of funny.
So we afterwards submitted it to ACML where it was accepted.
And so the funny thing is back in 2011,
you couldn't get a paper accepted at a computer vision
conference if you use neural nets.
Now you cannot get a paper accepted at CVPR unless you
actually use convolutional nets.
So there's a complete revolution over the next few years.
So that gave some ideas to a few people
working with driving cars around that time around 2013-14,
where they realized they could use
those kind of convolutional net based semantic segmentation
techniques to label every pixel in an image
as to whether it's traversable or not, or as to whether it's
a pedestrian or a road or something like this.
So this is some work at Nvidia.
This is work at Mobileye.
Which now belongs to Intel.
And this is a system that--
Mobileye produces systems that were used in the Tesla cars
for autonomous driving until mid 2016.
Then the two companies are divorced.
They weren't agreeing with each other somehow.
So now Tesla is developing its own system.
Nvidia has big project on this which I may come back to.
And then around 2012, the big revolution occurred.
And what that was is the use of very large convolutional nets
implemented on GPUs to run really efficiently
and train on large datasets like the ImageNet dataset
that has a million training samples, 1,000 categories.
And it turns out those things work really,
really well when you have lots of categories
and lots of training samples.
And when you make them big.
And so the first to really make an efficient implementation
of those networks on GPUs were Geoff Hinton
and his students, Alex Krizhevsky and Ilya Sutskever.
And they had presented the result at an Imagenet workshop
at ECCV in Fall 2012.
And then had a paper at NIPS in Winter 2012.
And that basically made the computer vision
field completely change, and basically jump
started the deep learning revolution.
That revolution had started in speech recognition
a couple of years earlier.
And the interesting thing about this
is that we ended up seeing an inflation
in the number of layers that are used
by those convolutional nets.
So this is the VGG network, which
was one of the top performing in 2013.
GoogLeNet-- no, this was 2013.
Then GoogLeNet in 2014, which had even more layers.
And then ResNet.
[INAUDIBLE] Hee and his collaborators from Microsoft
Research Asia had this idea of having skipping connections
that basically solved for the problem that sometimes,
when you train a very deep neural net, some of the layers
die.
The weights don't go anywhere.
That kills the entire thing.
So they use those kipping connections
to prevent the catastrophic bad things happening
if some layers died.
And that turned out to be a very, very good idea that
seems incredibly efficient.
But in fact, it works really, really well.
And so you can train neural nets with 50 layers,
100 layers, 150 layers.
And they work really well.
There's sort of a more modern version of this.
One version called DenseNet, which
is a collaboration between people at FAIR
and people at Cornell, which is sort of a version of this
is designed to run efficiently and etc.
And so one question you might ask
is, why do we need all those layers?
Right, Theoretically, you can approximate any function
with only two layers.
Why you need many layers?
And you know, one possibility is the fact
that the world is compositional.
Images are basically composite pixels.
And pixels form together, arranged together
to form things like edges and colored blobs,
and stuff like that.
And then by detecting combinations of those,
you can detect things like circles and corners
and gratings.
And then a combination of those form parts of objects.
And combination of those objects, et cetera.
So there is this kind of hierarchical nature
of the perceptual world which is sort of captured
by those layered architectures.
So we used to take weeks to train those networks.
And now we can train one of those networks
with basically state of the art performance in about an hour.
On a very large machine with 250 GPU cards in it.
It's actually multiple machines.
Each machine has 8 GPUs.
And you stack them up.
So you can do these kind of things
if you are at Facebook or at Google.
A little more difficult in university environment.
But here are some more recent results on computer vision.
So this is a bit of a snapshot of the state of the art.
This is a model called Mask R-CNN, which
is a system that does not just semantic segmentation,
but instant segmentation.
So I'm going to bore you with all the details.
I'm just going to tell you that beats
all the records on some standard data like, COCO.
And here's an example of a result you can do.
So again, it's essentially conceptually very simple,
a convolutional net with some sort of system
that sort of detects regions of interest and then
applies a slightly more complex convolutional net
on those regions of interest.
And the output of the network is not just a category,
but it's a category, the coordinates of a bounding box,
and an image of a mask of the object at the same resolution
as the input.
And so you get for every object, you get the category,
you get the mask of the person or the object,
and you get a bounding box.
And it detects baseball, the dog, the individual people,
even though they all overlap.
So this is instance segmentation, not just
semantic segmentation.
Semantic segmentation it would have just one big blob here
labeled people.
You can detect wine glasses and wine bottles, very important
for French people, computers, you know, et cetera.
Backpacks, umbrellas, sheeps, you can count sheeps.
You know, overlapping cars, things like that.
It works amazingly well.
It's also trained to detect key points on human bodies.
So you can infer that the body pose
of people in photos and videos.
There's actually-- there's more of this which I can't show you.
But it actually runs at 5 frames per seconds on a smartphone.
So it's scaled down version of this.
And then there were kind of new applications
of this for convolutional net for 3D data.
So this is a recent competition called ShapeNet
where the dataset consist of 3D objects represented by point
cloud from a depth center.
And it's been manually segmented into regions or parts.
And the goal here is to essentially label every region
with the correct label.
And what turned out to win this recent competition
was a 3D convolutional net produced by Ben Graham
and Laurens van der Maaten.
So this is the original paper that
describes the idea of a sparse 3D convolutional net.
And there's some other contributors to the system.
It's a library you can download.
It's basically the idea of sort of only doing convolutions
in areas where you have populated voxels,
because in a 3-D environment, most of the voxels are empty.
So you don't want to be computing convolutions
everywhere where there is nothing.
So you just follow the areas where there is something.
And it turns out to be much faster and easier to train.
And they actually won the competition with his technique.
And other application of convolutional nets
that's more research is a system that's actually
deployed at Facebook that uses convolutional nets
for translation, language translation.
So you use feed a sentence in English.
And it goes through a bunch of convolutions.
And it's actually a gated convolutional network.
So those are gated linear units, which I'm not going
to go into the details of.
There is pointwise multiplication going on here.
And then it goes into this kind of a weird alignment
system that basically produces sort of German words,
word by word, and then kind of lines them up
in an appropriate way.
And so, it's very fast.
It's very efficient.
It works really well.
And this is what I used for some--
for translating from some pairs of languages on Facebook.
Facebook can translate 2000 pairs of languages.
A number of them are translated using old style phrase based
statistical methods.
A number of them are translated using recurrent neural nets.
And then a small number of them are
translated using this system, which
is now being trained on more and more language pairs.
So a lot of the research that we do at a FAIR-- in fact
all of it is open.
We publish everything we do, generally
very quickly on arXiv.
And we also publish most of our coding open source so forth.
So these are a few examples of some of stuff we've deployed.
We've distributed open source.
I would single PyTorch.
This is a deep learning framework with a Python front
end.
It is very simple to use.
It's very good for research.
It's more transparent than TensorFlow.
OK.
And there's of course a lot of applications
of those things to medical imaging,
of course, and things like that, which
I'm not personally working on.
But a lot of my colleagues are.
But what's missing about this is two things.
One is, how do we learn reasoning and memory and things
like this?
And the second one is, how do we learn general things
that animals and humans can learn
without being told the name of everything,
without being given labeled data.
So this is a work by a bunch of people from Facebook AI
research in Menlo Park in California.
Justin Johnson was an intern at Facebook from Stanford.
And Fei-Fei Li, his advisor.
And the idea here is can we use deep learning to do things
like visual reasoning?
So could we answer questions like this one.
Is there a [? mat ?] cube that has the same size
as the red metal object.
So you to have to read this a few times and sort of figure
out really what operation you have to do here.
And so the idea they come up with is very cool.
You take the question.
Are there more cubes than yellow things?
You feed this through a recurrent neural net
that represents this as essentially
a single vector of fixed size.
And then you run this through another recurrent net
that spits out a kind of a representation of a computation
graph.
Think of it as a visual program, which
basically gets instantiated in this graph that has one block.
Those are actually trainable blocks.
OK.
They're all the same architecture.
So one block that is supposed to figure out--
filter all the objects that are yellow.
And another one that filters out the cubes.
One block that counts how many yellow things there are.
This one counts how many cubes there are.
And then it compares the two.
And then figures out the answer.
Right.
And so you don't predefine what those blocks should do.
You initialize it a little bit by heavy supervision,
by specifying what the program here should be,
and which blocks should be assembled,
even though the blocks are not trained initially.
And then you backpropagate the gradients
to get the right answer through this whole thing, including
the convolutional net.
And eventually this thing figures out
what those blocks should do.
Of course, we'll need to reach all those keywords.
And learn how to do reasoning.
But the interesting thing about it
is that it's completely dynamical.
You change the question, it's going to change the graph.
So the graph that you propagate gradient through changes
every time.
And that's why the dynamic graphs are so important in deep
learning nowadays.
People are so excited about it for things
like natural language understanding.
So dynamic graphs is the situation
where the computational graph that you
use to compute your answer changes when the data changes.
There's actually more recent work
along those lines by Aaron Courville at University
of Montreal, where they don't actually
have to specify a program like this.
You just stack multiple blocks.
And it just works.
It's pretty cool.
OK.
So for those statisticians in the room,
since I've been invited by bio-statisticians,
deep learning breaks all the basic rules of statistics.
I mean, not all of them, but some of them, right.
So the models are enormous, often
with many, many more parameters and there are training samples.
I mean, so take one of those convolutional nets
for ImageNet.
There is 1 million training samples.
Some of those models have 100 million parameters.
And they still work quite well.
They can often nail the training set perfectly.
And often there is no explicit regularization.
But it still works.
How is that possible?
The loss function is very highly non-convex.
It's got a ridiculously large combinatorial number
of settle points.
But still, you pretty much get the same result every time you
train.
What it tells you is that maybe there are local minima,
but they're all pretty much equivalent.
And in fact, there are experiments
that seem to suggest they're all connected.
There is only one local minimum basically.
I mean, not one.
But essentially one.
Little attention is paid to managing uncertainty
beyond using very simple things like softmax
on the output when you do classification.
But there's a lot of effort spent on computational issues.
Like efficiently implementing all those things, and all
that stuff.
So it's sort of very much unusual.
It breaks the rules you see in textbooks,
in statistical textbooks.
And that might be a reason why some people who are more
theoretically oriented had initially a lot of skepticism
towards neural nets.
OK.
But let me switch to kind of the point I really
want to make about with this talk, which
is, where do we go from there?
OK.
So deep learning works very well.
There's a lot of applications we can use it for.
Even if we don't do any research anymore,
just with the technique that we've developed so far,
there's probably a lot of different industries
that are going to be affected by it that we can apply this to.
In fact, there's something that Andrew Ng said recently.
Stop doing research.
Just apply the stuff that we already know.
I don't think it's a good idea.
But I don't think he believes it completely either.
But what is interesting of him to say this.
So what are the obstacles really to making significant progress?
Because as I said before, all the stuff you see,
that's not real AI.
And our machines do not learn with the same kind
of efficiency that we observe animals and humans learning
with.
So how do we get machines to learn
how the world works, learn common sense
or something like this?
So that would ask the question going back to the inspiration
from biology, does the brain use a learning algorithm?
Or does it use 50 learning algorithms?
Or maybe 200?
Or maybe it's complete [INAUDIBLE],,
the result of evolution.
There's no underlying principle behind it.
It's just a result of millions of years of evolution.
How much prior structure does animal or human learning
require for a intelligence to emerge
in a reasonable amount of time?
All the learning algorithms that people in machine learning
have come up with in statistics minimize
some sort of objective function, or optimize
some sort of objective function, I should say.
Does the brain optimize an objective function?
What would that function be?
If it optimizes a function, does it
do it by evaluating a gradient?
If it evaluates a gradient, how does it do it?
It probably doesn't do backprop in the way
that we understand it today.
And how does it handle uncertainty
in prediction, which I think is a crucial issue?
So all kinds of questions like this that connect
AI machine learning with neuroscience really.
And one big missing ingredient in AI, or maybe a holy grail,
is common sense.
There's a subarea of AI called commonsense reasoning.
It's not actually a solution to a problem.
It's more of a problem.
And it's a question of how do we get machines
to quite common sense.
So common sense is everyday--
the commonsense of everyday thing.
That supported-- unsupported objects fall.
That some objects are stable.
And some are not.
If I let this guy go, it's going to fall,
even if I put it briefly vertically.
If I take this object, I hide it behind my computer,
you still know it's here.
It hasn't disappeared.
So object permanence.
So those things we learn.
How do we learn the structure of the world?
And one hypothesis perhaps is that our brains
are prediction machines.
They learn to predict all the missing information
from whatever is available today at this time.
And then time passes by.
Or you move your head, or whatever.
And new information becomes available.
And that allows you to train your world model
with the new information.
So if I want to learn that the world is three dimensional,
I'm going to learn it because it's
the best explanation for how the world changes
when I move my head.
My view of the world changes when
I move my head side to side.
And the best explanation for how it changes
is the notion of depth.
So necessarily, if my brain is trained to predict
what the world is going to look like when I move my head,
it's going to have to somehow represent the notion of depth.
Same way if I want to predict--
if I let this go and I stop the movie right there,
then I ask the machine, ask my brain
what's going to happen next?
It's going to predict this guy is going to fall--
he's going to fall down, of course, because of gravity.
So it just needs to wait for time
to pass by to train itself to see
if its prediction was correct.
So that would be predictive learning.
But predicting-- learning to predict
is not just predicting the future
from the present and the past.
It might be also predicting what the blind spot of a retina
contains without even looking.
So if you fixate on a particular place,
there is a particular spot in your visual field where you're
essentially blind because that's where
your optical nerve [? puncture ?]
through your retina.
You don't see anything at there.
But you don't realize it, because your brain
fills it up essentially.
So things like filling the visual field
of the regional blind spot, filling occluded images,
missing segments in speech, predicting
the state of the world from partial textual description,
predicting the consequences of your action,
predicting sequences of action leading to a result.
I mean, all of those are fill in the blanks, if you want.
And common sense, I would surmise,
is the ability to fill in the blanks
through the construction of world models.
Object permanence is something babies learn around
the age of two or three months.
And which is why peekaboo is so funny for little babies,
because you can disappear when you hide your face.
So here's a baby orangutan here.
It's being shown a magic trick.
The guy put an object in the cup.
And then he shakes the cup.
It takes the object out without showing the orangutan.
And then shows the inside cup.
And the cup is empty.
And the orangutan rolls on the floor laughing.
OK.
That obviously broke his world model, that objects--
there's object permanence.
Objects don't disappear like that.
And you know, one of three things
can happen when your world model is broken, you laugh.
It's really funny.
It's really interesting, you pay attention,
because your role model is wrong.
So you need to learn a new world model basically,
because of this new data that you predicted wrongly.
Or something really dangerous might
happen that you didn't predict.
And so you're scared.
So that's what happens when your world model is working.
So I think-- how do we do this a machine?
How do we get them to learn all those things about the world?
Lean gravity?
So if you show a baby, this are special slides
I borrowed from Emmanuel Dupoux, who
is a cognitive scientist, developmental cognitive
scientist in Paris at Ecole Normale Superieur.
And if you do an experiment like this,
you take this little car here.
And you put it on this support.
And you push it.
And it goes off, and it doesn't fall.
Of course, it's held in the back.
But the baby doesn't see that.
Before six months, the baby says, yeah, sure.
That's way the world works.
Fine.
No problem.
After eight months, they go like this.
You know, they open their eyes.
And they fixate.
And they say, what's going on?
And they don't say, what's going on,
obviously because they can't talk.
But you know, they look like they're saying,
what's going on.
And so with this kind of technique,
by basically measuring how long you know babies
fixate and observe and open their eyes like crazy,
you can figure out at what stage babies learn things.
And again, this is from Emmanuel Dupoux.
So things like object permanence you learn pretty quickly.
Biological motion, the fact that there
are objects that move by themselves,
others that are inanimate.
You know, you learn that by three months.
Objects that are rigid or not.
Different types of natural categories, chairs, tables cars
etc.
Stability and support.
And sort of basic intuitive physics, gravity, inertia,
conservation of momentum.
That arrives around 8 months, roughly
between six and eight months.
And there's a bunch of other things like that
happen at various stages.
And this is not learned in supervised mode.
It's not like, babies are told the name of objects.
It's not like they are directed in any way for any of this.
They basically learn this by observation.
They're really not well-developed in sort
of motor control either.
So they don't get to do a huge amount of interaction
with the world.
So there's no way this can be learned through interaction,
by some sort of direct reinforcement learning.
There's other mechanism going on there
where you learn how the world works by observation.
And that's the piece we're missing in our current machine
learning and AI systems.
So in fact, I need to apologize in advance to Michael.
But he knows what I'm going to show, so--
There's three sort of paradigms of learning, right.
There is a reinforcement learning,
where basically the machine at each trial
is given a scalar value to tell it whether it did well enough
or not.
So there was grade for games.
Machine does an action.
And it either gets a reward or not.
Or sometimes it has to make a whole sequence of action
before it gets a reward.
And it works great when it's combined with deep learning.
The problem is that it requires a huge amount
of training samples, an enormous amount of training samples.
It's because the amount of information
you give to the machine is extremely small at every trial.
It's very weak.
It's a small amount of information.
Therefore, you need to do this many, many times for it
to learn anything complicated.
Supervised learning, you need a little less samples,
because you give more information every time.
You give it the correct answer.
And so if there are a dozen categories,
that's more than just a single scalar value.
So you need fewer samples to learn similarly complex tasks.
And then the predictive learning or unsupervised learning,
you ask the machine to predict basically every variable
from every feature variable from every present variable
or past variable, or every unseen variable from every seen
variable.
And so there is a lot more information
you ask the machine to predict.
And that's why probably you can learn
a lot more about the structure of the world this way.
So that led me to this completely obnoxious slide,
which I have to show in every slide-- in every talk now.
The analogy between intelligence and chocolate cake,
where the [INAUDIBLE] of the cake
is basically unsupervised or predictive learning,
because that's where the bulk of the information goes.
The bulk of the information given to the machine
is really in that mode of learning.
And then the icing on the cake is supervised learning.
There is considerably less information
provided to the machine per trial in supervised mode.
And in reinforcement mode there is very little information
given to the machine.
So that's going to be equivalent to the cherry on the cake.
And I've been showing this-- the first time
I showed this slide was actually giving a talk at Deepmind,
where Deepmind is actually the temple of reinforcement
learning.
So it was sort of obnoxious on purpose, a little bit.
But now I kind of fell into that obsession
of showing it in every talk.
So the problem with reinforcement learning,
with pure reinforcement learning, and Michael
will correct me if I'm wrong, is that if you use it
in its purest form, you need so many trials to learn
any kind of complex behavior that if you were
to train a self-driving car to drive, and to learn to not
run off a cliff, it would have to run off a cliff
about 50,000 times before it figures out it's a bad idea.
And then another 50 dozen times before it
figures out how not to run off a cliff.
And you know it's half of a joke, which is why--
I mean, that's the reason why it works really well for games,
because you can run games very quickly
on many computers at the same time
and at many thousands of frames per second.
But it doesn't really work in the real world,
because you cannot run the real world faster than real time.
That's a thing that sucks about the world.
And then anything you do real world can kill you,
like running off cliffs.
Maybe it's a good thing that we can't run the real world faster
than real time.
So perhaps what we need is build models of the world
that we can run faster than real time,
and that we can run without the risk of killing ourselves.
And that would be predictive models.
If we ever were to predict before we run off
a cliff that we're going to run off a cliff,
we would not run off a cliff.
And perhaps, that's the way we learn to drive.
We know not to get off the road, because we
know bad things will happen if that's the case.
Reinforcement learning works really well for games.
And there was a smashing demonstration
of how well this works for Atari games
and Go and doom, and not yet StarCraft, that's
very much work in progress at FAIR and Deepmind
and various other places.
It's very complicated.
But you know, it works really well.
And the latest AlphaGo Zero is pretty amazing in that way.
But again, it's a particularly simple situation
where the number of actions is discrete,
the world is completely observable,
and the reward is fairly clear.
And you can run the environment, which is a go board,
at tens of thousands of frames per second essentially.
It works pretty well, even for games like Doom.
So this is a Doom competition that was
won by the team from Facebook.
And actually teams with Facebook people won two years in a row,
in '16 and '17 using basically deep reinforcement
learning techniques.
So we work on reinforcement learning at Facebook.
It's not--
The cake I showed--
I showed the cake, but you have to notice that this is
a black forest chocolate cake.
And the cherry is not optional on this cake.
In fact, it's got little bits of cherries
all around here inside.
[LAUGHTER]
OK as I said, we also work on StarCraft.
So StarCraft is an extremely challenging situation,
because there is multiple time scales.
There are continuous actions.
It's not fully observable.
You can't tell what your opponent is doing unless you
send scouts to look at it.
So it's very complicated in that sense.
We've done a little bit of reinforcement training
for sort of local micro-management of tactics.
It's actually an open source platform called ELM or miniRTS
from Facebook that is basically a StarCraft like real time
strategy game.
But here is a suggestion.
So I said we need our machines to be able to learn
predictive models of the world.
And this idea is very old.
It goes back to a very old time.
But in particular, to one of Rich Sutton's papers
where he was proposing what he called the Dyna architecture.
And he said the main idea of Dyna
is the old common sense idea that planning is trying things
in your head using an internal model of the world.
And this suggests existence of a more primitive process
for training things not in your head,
but through direct interaction with the world.
So he said here, reinforcement learning
is the name we use for this more primitive and direct kind
of training.
And Dyna is the extension of reinforcement
learning to include a [INAUDIBLE] world model.
In fact, this [? domain picture ?]
doesn't exist today.
All of this is called reinforcement learning.
It's just that the version that has a model
is called model based reinforcement learning.
And the other one is called model free reinforcement
learning.
But it's basically the same, the same thing.
And this idea that you should have a world model which
in optimal control is called a plant simulator,
but it's the same thing, or a plant model.
But this idea that [INAUDIBLE] predictive world model
to be able to reason about what to do, what action to take,
is really [? all ?] idea in the context of optimal control.
So a typical situation in optimal
control, and you can look at classical textbooks going back
to the 60s, is you have a model of the world that gives you
the state of the world at time t plus 1
as a function of [? standard ?] time t.
And the action [? you can ?] [? take. ?]
And then the state of the world is
sent to an objective function that
measures how well the state of the world is,
or how good it is.
And so you can run this model of the world.
And through backprop through time and gradient descent
figure out a sequence of commands
that will optimize this objective function over time.
And if you're well-simulator is differentiable,
you can do this through backprop and gradient decent.
If it's not, you have to do things [INAUDIBLE]
programming or something like this.
So the main problem we're going to have is,
how do we learn this world model?
How do we learn a model that will allow our mission
to predict what the state of the world at time t plus 1
is going to be as a function of the state
at time t and our action, and perhaps actions
of others in the environment.
That's the problem of predictive or unsupervised learning.
And that led me to state that--
oops.
I'm not sure how that happened.
Apologies.
Wow, it went forward by like 10 slides.
So that is new to this statement that the next revolution in AI
will not be supervised.
I stole the concept of this slide
from Alyosha Efros at Berkeley.
And so we have to think about what
would be the architecture of a real intelligent system, a sort
of autonomous intelligence system.
So it would be something like this, an agent that
produces actions on the world.
And the world responds with percepts.
And of course, the world might be--
the world might not care about your action at all.
Or it might care only vaguely.
What the agent is trying to do, the agent
has an internal state which is sent to an objective function.
And the objective function produces
a value that basically tells the agent whether it's happy
or not.
So the objective function is a measure
of unhappiness of that agent.
You get a small value if you're happy, a large value if you
are unhappy.
So what the agent is trying to do
is bring the world into a state that will bring itself
into a mental state that basically this red function
identifies as happy.
And there are models of how animal brains are built, are
basically this way, where this is your entire brain,
except the basal ganglia.
And that's the basal ganglia.
So basal ganglia is the thing at the bottom of your brain
that basically determines your level of happiness or comfort
or discomfort or pain or things like that.
So inside of this agent, if we believe
what I-- or the argument that I previously, the system should
have some sort of world simulator
that allows you to predict what the state the world is
going to be as a consequences of a sequence of actions.
And then two other modules.
These are sort of standard nomenclature in RL.
An actor that produces action proposals that can be
kind of simulated in the world.
And then a critic whose role is to predict
the long term expected value of this objective.
So this guy basically computes emotions.
So if this guy predicts that your objective function is
going to rise up, make you very unhappy or in pain,
that creates fear, essentially.
You don't want to get anywhere near that state.
And this guy predicts what happens.
So this guy predicts this.
This guy doesn't quite predict that.
But this guy actually predicts that as well.
And so now the problem becomes, how do we
train this world simulator?
Because the rest, we kind of know how to do it more or less.
We don't know how to build this.
But if we knew, we could do something like this.
Get the state of the world through your perception module,
initialization your world simulator,
propose a sequence of actions, and then
refine the sequence of actions so as
to minimize the expected cost computed by the critic.
And then train the actor to produce this optimal sequence
of actions.
And then take the first action.
And then kind of shift everything by one time stamp.
So how do we learn forward models of the world?
This is an experiment that was done at Facebook
a couple of years ago by Adam Lere, Sam Gross, and Rob Fergus
where they put a stack of cubes, this is in a simulator.
This isn't the real world.
And then they observe what actually occurs.
And then they train a convolutional net
to actually predict what's going to happen by kind of learning
the mask of the objects.
And what you get is a pretty accurate prediction
for this tower is going to fall this way.
But fairly fuzzy predictions for like, tall towers,
where it's king of ambiguous where things are going to fall.
So you get those kind of fuzzy predictions here.
Because you can't exactly predicting where
things are going to fall.
So how do we solve that problem?
I'm going to skip this.
So this is why predictive models are
good for question answering systems and natural language
processing.
But I'm going to skip this in the interest of time.
So, here's the problem we have to deal with.
Those towers can fall in a number of different directions
that we can't really predict just from the look of it
which direction they're going to fall into.
So it's kind of--
I don't know if we can find a pen here
or any kind of vertical thing.
I'm going to do it with a piece of paper.
So if I put this piece of paper here on the table,
and I let it go, you can be pretty sure it's going to fall.
But you can't really tell probably which direction
it's going to fall.
Every time I do it, it's probably
going to fall into a different direction.
So you can't really use supervised
learning to train something like this.
Because if I give the initial segment,
and then I ask machine predict, the machine predicts that.
If that happens, that's fine.
If this happens, then the mission
has to predict now this.
But now the next time over, it's going to predict that.
And so the best thing the machine can predict
is kind of an average of the outcomes, which
is not a good answer.
And so, something like this, where let's say you
observe two variables which have a dependency between them.
And this is pretty elementary for anybody who
works on probabilistic models.
But let's say these are the data points you observe.
Your world consists of two variables.
And these are your observations.
If I give you a particular value of Y2,
you can infer basically two values for Y1.
But if you try to learn this with L2 least square criterion,
you're going to predict something right in the middle,
which is not a good answer.
So you have to predict, somehow be
able to predict one or the other,
but not an average of the two.
Or predict a distribution.
But how do you represent distributions
in high dimensional spaces?
So the unsupervised learning problem
is how do you capture the dependency between things
like this?
And one possible way is to learn a contrast function.
So basically, think of it as an energy function,
or negative lo log probability if you are a probabilist.
And this are your data points.
And you want those to have low energy, which
means high probability.
And you want everything else to have higher energy, or lower
probability.
So the blue points are the data that you observe.
The green points are not data.
And you want the energy of the green points
to be higher than the energy of the blue points.
So if you have a parametrised function that
computes this function in the space of Ys,
it's easy enough to tweak its parameters
so that when you see a blue point,
you make the output go down.
But how you make sure at the value of your function
is higher outside of those needs?
How you generate those green points?
And that's basically-- there's basically
seven or eight different methods for doing this.
But I'm only going to talk about a couple.
And the first one is adversarial training.
So adversarial-- the basic idea of adversarial training
is basically the scenario I was talking about.
You have a predictor here.
And this predictor looks at the past,
let's say, if you want to do video production.
So it looks at the past.
And it has access to a source of random vectors
and is going to produce a prediction.
The precise prediction is going to depend
on the value of this vector.
And as the value of this vector changes,
this prediction goes through a set of plausible outputs,
let's say, represented by this red ribbon here.
So let's say we asked the machine.
We show the machine a small segment of video.
And we ask it, what is the world going
to look like half a second from now?
And the machine predicts this.
It predicts that pen is going to fall to the back and the left.
And in fact, we let time pass by.
And what happens is this.
The pen falls to the back and slightly to the right.
So we don't want to punish the machine
for making the wrong decision here, because it's
qualitatively correct.
So what we'd like is we'd like an objective function that
tells us low cost if you are on this red ribbon, high cost
if you are outside.
And that's exactly what I was talking about earlier.
You want a function like this one
that tells you low cost if it's something
that looks reasonable.
High cost if it's not.
So the thing is, we don't know how you
characterize this functions.
So we're going to have to learn it.
So adversarial training is you have two functions
you learn, one that predicts and one that tells the system
whether the predictions are good or not.
And basically it works like this.
So you have an initial segment of a video.
For example, if you do video prediction,
the data tells you here is how the video ends.
And you train this contrast function,
called the discriminator, or sometimes critic actually ,
to produce a low output for things that actually occur
in the world.
So those are the two blue points.
So we'll make the function take a low value for things
actually occur.
And then you this past to the generator.
You have it generate a prediction,
which initially sucks.
And so you feed it to the discriminator.
And it tells the discriminator produce a large output
here to make the output here.
So these are all of the green points.
Make that large.
And so next time around, the value here the discriminator
will produce for those predictions
is going to be higher.
But here is what you do simultaneously.
Simultaneously, you backpropagate gradients
through the discriminator to train the generator
to produce Ys that make the discriminator produce
low outputs.
OK.
So basically, the generator gets information
about how to change its parameters so as
to change its output so that the green points get closer
to the blueprints, essentially, to a region
that the discriminator give low energy to.
So eventually it looks like this, where the green points
match the blue points more or less in distribution
if you're lucky, because those things are kind of finicky.
And it works.
So you can train those things with past frames.
Or you can just train it on images to just generate images
from random vectors.
So this thing has access to all sorts of vectors.
If you trend this thing on images of bedrooms, you get--
those are non-existing generated bedrooms.
And they all look kind of reasonable,
except maybe for this guy.
It looks an Austin Powers kind of bedroom, or whatever.
But you know, they all have a bed and windows and dressers
and lights, and stuff like that.
And those are basically a bunch of random numbers coming
into a convolutional net that has been trained
to produce bedroom images.
And they don't look like anything in a training set.
They're different from any training set image.
So there are various versions of those GANs.
There's a whole menagerie of different types of GANs
nowadays.
There are [? psycho ?] GANs and infoGANs and WGANs and IWGANs,
and an infinite number of GANs.
There is another family of generative models,
this type called variational [INAUDIBLE] encoders.
This is when trained on ImageNet.
So this is something called Energy-Based GAN trained
on ImageNet.
And it doesn't actually produce objects.
But you put things that from far away kind
of looks like objects, [INAUDIBLE] abstract.
This is trained on dogs.
It's kind of funny.
I mean, people do much better than this now.
But it's still funny.
OK.
So here is an example for video production.
So here it's a convolutional net that looks at 4 frames
and predicts two frames, two future frames.
And it looks at the images at multiple scales.
And there's all kinds-- and it's pretty complicated
architecture.
And this is the prediction you get
if you train with least square.
So you train this video predictor with least square.
You get blurry predictions.
If you train it with this adversarial training
criteria combined with some others,
you get this kind of prediction, considerably sharper.
So the first four frames are observed.
The last two frames are indicated in red
here are predicted.
And so you get--
the motions basically continue.
And they seem fairly reasonable.
There's a little bit of blurriness.
But it's is not too bad.
This is when trained on video segments
from apartments in New York.
So the camera rotates.
And the system has to basically invent
what the room looks like as the camera rotates.
So here is a bookcase.
And this part of the bookcase-- so this is observed.
Now it's predicted.
This part of the bookcase is invented.
So it figures out that a bookcase has to continue.
It figures out that a couch has to continue.
So it captures some regularity of what an apartment in New
York is supposed to look like.
Something that maybe is more interesting for people
interested self-driving cars.
This is a dataset called cityscape.
And-- oops.
And this is a system where you take a video sequence,
and you run a semantic segmentation system
on the video sequence.
So what you get is a bunch of maps
which give you the pixels that are
labeled for every category for every pixel.
So much like this, blue is car.
Sidewalk is pink.
And pedestrian is red.
And things like that.
And what this thing predicts is that-- so it
predicts in this case here half a second in the future.
It predicts that pedestrians keep crossing the street.
The car that is turning left keeps turning left.
The scenery keeps moving.
So it's useful if you want to work and self-driving cars
to have the ability to predict what's going to happen ahead
before it happens.
It might allow you to use this to train
for example, a reinforcement learning system
without actually crashing, but just by predicting
even a crash.
Here's a new model, a more recent one
just admitted actually called error encoding network.
So this one-- in fact, the one that actually works
is slightly different from this one.
But this is a simpler version to explain.
So this one basically trains a model.
So it looks at the past.
It runs through a few layers of a neural net.
It produces an internal state.
And ignore the top for the time being.
Then runs through a generator essentially,
another part of a neural net that produces a prediction,
say a video, another frame in the video.
And you train this using least square,
or something like this with what is actually observed.
And then you play a trick.
What you do is you take the difference between those two.
So this is a vector, the vector of the difference
between those two, the target and the prediction.
You feed this to a parametrised trainable function.
And then you feed the output of that function
to the hidden layer.
You add it to the hidden layer.
And you train this guy so that this variable
is going to take a value that minimizes the prediction error.
But this viable only depends on the prediction error.
And so basically, this part of the network,
when this value is set to zero, predicts
whatever is predictable.
And this guy basically parametrise
whatever is not predictable, which is a residual error,
and figures out how to represent the hidden latent variable that
will actually correct that mistake.
So that might represent the--
for example, you observe someone playing a game
and moving something on the screen.
The physics of how things move on the screen
is essentially predictable.
That's Newtonian physics.
But the action that the player uses maybe isn't.
And so that would essentially represent the action
that the player played.
That would be very useful for things like imitation learning,
for example.
Here's an example of how this can be used.
And I'm probably going to end here.
So you have to wait a little bit.
So this is a dataset that was produced
by Sergey Levine, [INAUDIBLE] and a few other people
at Berkeley.
So there is an object.
There is a robot arm.
And the robot randomly pokes the object.
So the result is that after being poked,
the object has moved a little bit.
And these are predictions for how the object could
have been moved by the thing.
This is pure pixel prediction, pixel space prediction.
So the system has no notion of object or anything.
These are prediction it makes.
And each different prediction is generated by different sampling
of the Z variable, the latent variable, or the action
variable.
You can think of this as basically an encoding of what
the robot arm did without actually
having to observe what it did.
So it's action inference if you want.
OK.
I've spoken for long enough, so I'm
going to stop here and take your questions.
Thank you very much.
[APPLAUSE]
AUDIENCE: Hey.
[INAUDIBLE]
Real quick question.
So can you break-- so, let's just think about images.
Are you trying to-- or we use essentially biology and things
we know about the world to segment the image.
What if you took a camera and did
a combinatorial scramble, which is a huge potential scramble.
Does it break everything?
YANN LECUN: It scrambles the pixels?
AUDIENCE: It scrambles the pixels.
YANN LECUN: Yeah.
AUDIENCE: You know, it's combinatorially huge.
YANN LECUN: Yeah, that's right.
So if you do a fixed scramble and you
use a convolutional net, the convolutional net
will have a hard time figuring out the thing,
because it's based on the idea that neighboring pixels are
correlated.
And a local patch of pixels can be represented efficiently
by just those features.
So it probably would have a very hard time.
Now it turns out there's a paper by Pascal [INAUDIBLE]
on [INAUDIBLE] from way back where
they show that if you just-- if you take a collection of images
that you've perturbed through the fixed
permutation of the pixels, you can actually
recover the topology by figuring out
the local correlations between pixels.
So in principle, it would be possible to make this work
if you [? hardwired ?] this.
AUDIENCE: Thank you for giving a talk today.
I'm a big fan to you, actually.
[INAUDIBLE] talk to me.
And recently the D-Wave Systems and the quantum computer
is actually deployed in practice right now.
And how would you envision the quantum computing
affect the deep neural networks in general?
YANN LECUN: Yeah, it's--
if you didn't hear the question, it's
about whether quantum computing will affect deep learning
in some way.
It's not entirely clear to me.
So D-Wave is not actually deployed in practice.
It's experimented with by people.
And there are a few attempts.
But it's not actually used in practice
for commercial deployment, if that's the question.
So the D-Wave System is not a full quantum computer
in the sense that it uses quantum tunneling
for more efficient function optimization.
It's not entirely clear that you need this
at all for any of the tasks that I talked about.
So I think it's still up in the air
whether or quantum computing will have any effect.
It's possible you could do nearest neighbor much
faster with quantum computing.
It's not even clear to me that you can, but it's possible.
So, it's unclear.
AUDIENCE: So I actually have two questions.
The first question is that [INAUDIBLE]
if the data point is very small, like in the area
of a [INAUDIBLE],, but only [INAUDIBLE] maybe X-Ray imaging
or even less.
[INAUDIBLE] So I read something about the [? zero ?] shot,
one shot, and [? two ?] shot [INAUDIBLE]..
So what do you think of [INAUDIBLE]..
And the second question is are any of the AI [INAUDIBLE]
developed by Facebook or developed [INAUDIBLE],,
[INAUDIBLE].
YANN LECUN: All right.
Yeah, OK.
Let me answer first question first.
So the small the regime.
There's basically currently two ways to handle it.
One is transfer learning.
So for example, you want to do image recognition.
And you want to do, I don't know,
medical imaging or something like this.
And you don't have enough data.
So one approach is you train your neural net
on a big data set that you actually
have, either with the same type of images,
or even complete different types of images, as long
as the statistics are similar, like ImageNet for example.
You know, it's not the same type of image.
But it's OK.
[INAUDIBLE]
And then you can transfer learning.
So you take that pre-trained machine.
And then you retrain this machine for your data
that helps you just retrain the top two or three
layers to a limit the number of parameters.
That works really well.
So there is actually a service within Facebook
that uses this for the product division within Facebook.
So to give you an idea, there's 2.1 billion users on Facebook.
And the users upload on the order of 1.5 billion photos
every day.
So there's 1.5 billion a day.
Every single one of those photos go
through four convolutional nets that we know about.
It goes way more.
But these four pre-trained convolutional nets.
So one that basically recognizes tags
of various types on the image.
So recognizes objects.
It recognizes the type of images.
Is this a birthday or a wedding or landscape or indoor scene
or a [? macrophoto ?] or whatever.
There's a second one that--
and this is used for feed ranking basically,
to decide whether to show particular images
to particular people who have particular interests.
The second one filters objectionable content.
So basically, violence, pornography, things like that.
The third one generates captions for images,
for the visually impaired.
So that if you're blind and you're on Facebook,
you can get an idea of what's in the picture
by getting this text description.
And then the last one, which is turned on in US,
but not in other countries, not in many other countries,
not turned on in Europe does face detection.
So it tags your friends automatically.
So that was for the first question.
Now there's a second answer to the first question.
And the second answer to the first question
is you can use unsupervised training or pre-training.
So basically, you don't trust trained system
to classify your medical images into cancer or non-cancer.
But you also train it to reconstruct itself.
And that has a regularization effect.
So there are situations, certain types of architectures,
things called ladder networks or what stack [INAUDIBLE] or UNet,
where this type of learning actually
helps supervised learning and reduces
the need for labeled data.
OK.
So that was-- ultimately, I think
that supervised learning is going
to solve all of these problems.
Now your second question was about those bots
that there was a big story in the press a few months ago
that said that researchers at Facebook
had created two bots that were supposed to talk
to each other in English.
And they're supposed to cooperate to solve a task.
It's going to reinforcement learning type task.
And they ended up using English language
in ways that were not really initially predicted.
They would use a funny way to use words to express--
to communicate with each other.
And so some of the newspapers right after, it almost
said AI's going to kill us all.
Some tabloid published an article saying, oh my god,
Facebook researchers had this project where two bots invented
their own language.
And they had to like unplug the computer in panic mode,
because they were going to take over the world or something.
And it's completely insane, because there
was a blog post about it and a paper that was published.
And it's basically, these people are
interested in natural language understanding.
And they trained those systems to use English.
And they ended up not using English
in a way you would normally use it.
So they said, the experiment failed.
Let's try something else.
It's not like the Hollywood sci-fi movie
where you see these guys grabbing the electronic cars,
and there's sparks flying and all that stuff right.
Nothing like that.
But it's really funny how--
funny in a way, kind of depressing a little bit,
of how some of the press describes those things.
There were a lot of articles in more serious press afterward
that said that's complete bunk, which is good.
AUDIENCE: Thank you.
AUDIENCE: Hi.
I have a comment here.
I have a comment and a question.
First comment is that earlier you said
there are many systems that hasn't been in the parameters
that much more than the number of pixels
or whatever you're talking--
YANN LECUN: Samples.
AUDIENCE: Samples.
[INAUDIBLE]
I think from a statistics point of view,
it's the central limit theorem [? doing it's ?] [? job. ?]
That's my comment.
YANN LECUN: Which theory?
AUDIENCE: Central limit.
YANN LECUN: Oh, central limit theorem.
AUDIENCE: [INAUDIBLE] I think.
But, OK.
My second question is actually related to this.
Are there-- all your examples kind of works.
Are there any theoretical scientists,
computer scientists working on foundation
of these kinds of things.
What makes it converge, and what's not?
YANN LECUN: Yeah.
I mean, there's a lot of different types of people
working on those questions, some of them
are computer scientists, but many of whom
are either physicists or mathematicians.
So I've been--
I've been involved in an effort for many years
to try to get the applied math and pure math community
interested in those questions.
And I've only been successful in the last year or two.
Same for the physicists.
So basically, there are results in random matrix theory that
can be applied to the understanding
of the landscape of objective functions of those networks.
And it would seem to demonstrate,
to show that the number of [? settle ?]
points in those loss functions is combinatorally large.
But on the other hand, that there are--
although there might be a lot of local minima,
they're all pretty much of the same energy level.
So it doesn't matter which one you find.
And then there is empirical evidence to the fact
that the local minima are extremely degenerate.
So if you move in a large number of dimensions
around those local minima, the objective function
is essentially flat.
And there's a small number of directions where it's not flat.
That depends on the complexity of the problem.
And there's also empirical evidence
that [INAUDIBLE] showed in a paper, which is
that if you take two solutions.
So you start from two random initial conditions.
You train your neural net.
You get two different solutions.
Then you go straight line between the two.
And you barely go up.
And if you bend the past just a little bit,
then you can go from one minimum the other without going up.
So that tends to show that there's basically
only one minimum.
It's very degenerate.
And it's connected everywhere.
The intuition that we have, the usual intuition
of a local minimum in one dimension is completely wrong.
Building a box in a hundred million dimension
is very hard because you need a lot of walls.
So there's always going to be directions
where you can escape.
And that creates settle points.
So that's one thing.
And then there is work on generalization ability.
Like why do those things generalize the way they do,
even though they are way overparameterized.
There's an interesting paper.
One of the co-authors is Ben Recht from Berkeley
recently where they showed that you can take a ImageNet style
network, convolutional net.
You set the labels to completely random labels.
And those neural nets can still learn the training
set completely without errors.
One million training samples, they will just nail it,
100% correct.
Of course, [? transition ?] error is chance.
But what that means is that there
is a huge amount of capacity in those networks
that they are able to recruit, if they need to.
But when you train them on things that make sense,
they don't have overfit that much.
They do overfit, but not ridiculously.
AUDIENCE: Hi.
So it seems like it's very clear that it's
important to have a strong predictive model of the world
to achieve intelligence.
But it also seems like there may be other components
to it, things such as creativity or metacognition.
So do you have any thoughts on how
we might achieve those other parts of intelligence?
YANN LECUN: So metacognition probably
is number 562 in the list of problems
we have to solve that maybe has 1,000 items so.
I'm not sure about that.
But creativity, I think those GANs actually exhibits
some level of creativity.
So there are people, for example,
at Rutgers, one of them is actually now
at Facebook, who used GANs to generate paintings,
abstract paintings in particular styles.
And they look really nice.
So that begs the question of is there,
what does creativity really mean?
We have a couple projects at Facebook
that I can't talk about yet, but soon, that involve also
creating kind of artistic artifacts using
those generative models.
And they look interesting.
People who actually are in the business of creating artifacts
are actually the impressed.
AUDIENCE: Hi.
I do some particles physics here.
I'm an undergrad.
And one of the big problems that we're
facing in implementing technologies like this
is that the data we have is collected almost
from a third person perspective where you have access
to all the variable information in three dimensions.
And so it's very hard to take a first person camera view
perspective of an event and try to pick apart what's going on.
What are the major computational challenges--
what's the difference between taking like a camera
view of these scenes and dissecting them
with a convolutional neural net versus somehow finding
an effective way of analyzing three dimensional information?
YANN LECUN: OK.
So a number of different answers there.
So first of all, there is quite a lot of interest
for the use of convolutional nets
in the context of high energy physics,
basically for trajectory filtering essentially,
so filtering events that are interesting.
I'm sure that's the kind of stuff you were thinking of.
I actually gave a talk at CERN maybe a couple years ago,
or a year and a half ago, and met a bunch
of people working on this.
And it's really expanding.
There's a colleague of mine at NYU called Kyle Cranmer who
has been working on this kind of stuff
actually using those GANs.
He's come up with good ideas on characterizing trajectories
of generating models of trajectories.
So that said, very often, those trajectories are in 3D.
And you'd like to be able to basically analyze them in 3D.
So you could use those 3D convolutional net
that I was talking about early in the middle of the talk.
They are sort of efficient for this,
because most of the voxels in a high energy physics experiments
are empty.
So you would like to be able to concentrate the computation
where things are relevant.
That's one thing.
The second thing is that there is
a new set of ideas I didn't talk about called graph
convolutional nets, or spectral networks.
So it's basically the idea that an image, a normal image, you
can think of an image as a function on a grid graph,
on a regular grid.
The pixels form a grid.
You can think of it as a graph where each pixel is connected
to its nearest neighbors.
And that indicates that--
it's just a reflection of the fact
that neighboring pixels are correlated.
Now imagine now that you have data
that comes to you not in kind of a flat grid graph,
but in a weird graph, like a cylinder or something,
like the calorimeter or in a high energy physics experiment,
or with some other set of sensors that is non-Euclidean.
You can actually define convolutions in those spaces.
And they're basically diagonal operators
in the graph Laplacian where the graph represents
the neighborhood relationships.
And so people have actually come up with ways
to apply convolutional nets to those non-Euclidean domains.
In fact, there is going to be a tutorial at NIPS
next week on precisely that topic in exactly one
week, Monday next week, which I'm a core speaker on.
But I'm actually going to speak.
There's going to be [INAUDIBLE].
AUDIENCE: You talked about--
sorry.
You talked about systems that both learn and reason.
And it seems to me like you argued that to get a strong AI,
you would need to do both of these things.
Now it seems to me like obviously humans do this.
But humans in a lot of ways are very dumb.
They make a lot of mistakes.
And they're very plastic.
And they need to learn to reason.
Whereas a lot of AI systems and reinforcement learning systems
do something very smart that takes
a lot of computational power.
And it's very much hard coded.
Do you think we'll see a trend towards dumber and more
plastic reasoning systems?
YANN LECUN: So I think most reinforcement--
Michael, correct me if I'm wrong.
But I think most reinforcement learning systems
that people are training today actually
are completely reactive.
They are very simple in terms--
I mean, there's very little actual reasoning.
Other than things like AlphaGo, AlphaGo Zero,
where there is tree exploration in the set of possible futures,
which is used for training.
Once it's trained, it actually just plays
without much tree exploration, actually.
So there's not a huge amount of reasoning there.
And that's a limitation not of reinforcement learning per se,
but of the architectures we use for all of our AI systems.
So I think what we consider I think
intelligent behavior involves this ability to predict.
In fact, I think the essence of intelligence
really is the ability to predict.
And so if you have a good model of the world that
is accurate for prediction, then you
can use it to plan a sequence of actions ahead
and perhaps moderate uncertainties about it.
And things like this.
So this is what reasoning really is about,
is predicting ahead what's going to happen, not necessarily
in time.
But also sort of simulating, so manipulating models.
Like when you think in your head about mathematics
or various other things, very often,
you have mental models that you manipulate.
They are simulators in a way.
You give them inputs, and they change.
And things like that.
That I think is really the essence
of reasoning and intelligence.
ROSSI LUO: Looking at the clock, it's 5:30.
I'm going to take one last question.
And if you have additional questions,
you probably just [? briefly ?] to [? floor ?] discussions
afterwards.
AUDIENCE: What's-- I'm not that familiar with deep learning
neural nets.
But I'm curious.
If I wanted to learn an object up
to something like affine transformations,
can I do transfer learning to do that?
Can you learn a whole group of transformations,
and then learn an object and then
have the object under those transformations?
YANN LECUN: So yes and no.
So if you take a convolutional net, for example,
and you train it on datasets like ImageNet that
have lots of different instances of the same objects and various
and things like this, it learns the notion
of object relatively independently of the viewpoint,
but not completely.
So it has to recognize a dog, whether it's a profile
view or a frontal view.
But if you take the head of the dog upside down,
it probably won't be able to recognize it.
The same way we have a hard time recognizing people
when their faces are upside down.
AUDIENCE: Not exclu-- little rotations,
shears, things like that.
YANN LECUN: Right, right.
So small rotation, shears, and scaling,
that that's handled by the pooling operation
in convolutional nets.
AUDIENCE: Right.
But there's nothing, no explicit geometric--
YANN LECUN: No.
There's no explicit 3D geometry.
And there is no real explicit 3D geometry,
except for the fact that whenever a feature is
detected in one location, it's also
detected in other locations.
And the fact that there is this pooling operation
that basically build a little bit of resist--
smoothness to variations of the location
of particular features.
So small variations of the position of elementary features
due to rotation, shear, and things like this,
will actually--
AUDIENCE: You're pooling them.
And that's why you're getting them.
But you're not explicitly modeling.
Same thing with Newtonian physics.
There's no built in physics yet, right?
YANN LECUN: Right.
There's-- no.
No built in physics.
AUDIENCE: Thank you.
ROSSI LUO: The main event I think is over.
And if you have additional questions,
you're welcome to briefly discuss with Professor Yamaka
afterwards.
And thanks [INAUDIBLE].
And let's give Professor Yann Lecun applause.
[APPLAUSE]