Placeholder Image

字幕表 動画を再生する

  • ROSSI LUO: Good afternoon.

  • Welcome to Brown Biostatistics Seminar.

  • And I'm Rossi Luo, faculty host for today's event.

  • And for those of you new to our departmental seminar,

  • the format is usually that the presentation

  • followed by a question and answer session.

  • And because of the size of crowd today, we

  • are going to also use this red box

  • thing to capture your questions and for videotaping and also

  • make sure your questions are heard.

  • And today I'm very pleased to introduce Professor Yann LeCun.

  • Professor LeCun is a director of Facebook AI Research,

  • also known as FAIR.

  • And he is also senior professor of computer science,

  • neuroscience, and electronic computer engineering

  • at New York University.

  • He's also the founding director of NYU Center for Data Science.

  • Before joining NYU, he had a research department

  • for industry, including AT&T and NEC.

  • Professor LeCun has made extraordinary research

  • contributions in machine learning, computer vision,

  • mobile robotics, computational neuroscience.

  • Among this, he's a pioneer in developing

  • convolutional neural networks.

  • And he is also a founding father of convolutional nets.

  • And these works contributed to say

  • the creation of new an exploding field in machine learning

  • called deep learning, which is now

  • called artificial intelligence tool for various range

  • of applications from image to natural text processing.

  • And his research on contributions

  • has earned him many honors and awards

  • including the election to the US National

  • Academy of Engineering.

  • Today he will give a seminar titled,

  • How Can Machines Learn as Efficiently as Animals

  • and Humans.

  • I understand some of you actually

  • told me you drove from Boston or many places are very far.

  • So without further ado, let's welcome Professor Yann LeCun

  • for his talk.

  • [APPLAUSE]

  • YANN LECUN: Thank you very much.

  • It's a pleasure to be here.

  • A game I play now occasionally when I give a talk here is I

  • count how many former colleagues from AT&T are in the room.

  • I count at least two.

  • Chris Rose here, Michael Litman.

  • Maybe that's it.

  • That's pretty good, two.

  • Right.

  • So, how can machines learn as efficiently

  • as animals and humans?

  • A have a terrible confession to make.

  • AI systems today suck.

  • [LAUGHTER]

  • Here it is in a slightly less vernacular form.

  • Recently, I gave a talk at a conference in Columbia

  • called the Compositional and Cognitive Neuroscience

  • Conference.

  • It was the first edition.

  • And there was a keynote.

  • And before me, Josh Tenenbaum give

  • a keynote where he said this.

  • All of these AI systems that we see now, none of them

  • are real AI.

  • And what he means by this is that none of them

  • actually learn stuff that are as complicated as what

  • humans can learn.

  • But also learn stuff as efficiently as

  • what animals seem to learn them.

  • So we don't have robots that are nearly as

  • agile as a cat for example.

  • You know, we have machines that can play golf better

  • than any humans.

  • But that's kind of not quite the same.

  • And so that tells us there are major pieces of learning

  • that we haven't figured out.

  • That animals are able to do that, we don't do--

  • we can't do with our machines.

  • And so, I'm sort of jumping ahead here

  • and telling you the punch line in advance, which

  • is that we need a new paradigm for learning,

  • or a new way of formulating that has old paradigms that

  • will allow machines to learn how the world works the way animals

  • and humans do that.

  • So the current paradigm of learning

  • is basically supervised learning.

  • So all the applications of machine learning,

  • AI, deep learning, all the stuff you see the actual real world

  • applications, most of them use supervised learning.

  • There's a tiny number of them that

  • use reinforcement learning.

  • Most of them use some form of supervised learning.

  • And you know, supervised learning, we all--

  • I'm sure most of you in the room know what it is.

  • You want to build a machine that classifies cars from airplanes.

  • You show an image of a car.

  • If a machine says car, you do nothing.

  • If it says airplane, you adjust the knobs on the machine

  • so that the output gets closer to what you want.

  • And then you show an example of an airplane.

  • And you do the same.

  • And then you keep showing images of airplanes and cars,

  • millions of them, thousands of them.

  • You adjust the knobs a little bit every time.

  • And eventually, the knobs settle on a configuration,

  • if you're lucky enough, that will distinguish every car

  • from every airplane, including the ones

  • that the machine has never seen before.

  • That's called a generalization ability.

  • And what deepening has brought to the table there,

  • unsupervised learning, is the ability

  • to build those machines more or less

  • numerically with very little sort of human input

  • in how the machine needs to be built,

  • except in very general terms.

  • So the limitation of this is that you

  • had to have lots of data that has been labeled by people.

  • And to get a machine to distinguish cars

  • from airplanes, you need to share

  • with thousands of examples.

  • And it's not the case that babies or animals

  • need thousands of examples of each category

  • to be able to recognize.

  • Now, I should say that even with supervised learning,

  • you could do something called transfer learning, where

  • you train a machine to recognize lots of different objects.

  • And then if you want to add a new object category,

  • you can just retrain with very few samples.

  • And generally it works.

  • And so what that says, what that tells

  • you is that when you train a machine,

  • you kind of figure out a way to represent the world that

  • is independent of the task somehow, even though you train

  • it for a particular task.

  • So what did deep learning bring to the table?

  • Deep learning brought to the table

  • the ability to basically train those machines

  • without having to hand craft too many modules of it.

  • The traditional way of doing pattern recognition

  • is you take an image, and you design a feature extractor that

  • turns the image into a list of numbers that can be digested

  • by a learning algorithm, regardless of what

  • your favorite learning algorithm is,

  • linear classifiers, [INAUDIBLE] machines, kernel machines,

  • trees, whatever you want, or neural nets.

  • But you have to preprocess it in a digestible way.

  • And what deep learning has allowed us to do

  • is basically design a learning machine

  • as a cascade of parametrised modules, each of which

  • computes a nonlinear function parametrised

  • by a set of coefficients, and train the whole machine end

  • to end to do a particular task.

  • And this kind of an old idea.

  • People even in the 60s had the idea

  • that this would be great to come up

  • with learning algorithms that would train multilayer systems

  • of this type.

  • They didn't quite have the right framework if you want,

  • neither the right computers for it.

  • And so in the 80s, something came up

  • called back propagation with neural nets

  • that allowed us to do this.

  • And I'm going to come to this in a minute.

  • So the next question you can ask of course

  • is what do you put in those boxes?

  • And the simplest thing you can imagine as a nonlinear

  • function, it has to be non-linear,

  • because otherwise there's no point in stacking boxes.

  • So the simplest thing you can imagine is take an image,

  • think of it as a vector, essentially.

  • Multiply it by a matrix.

  • The coefficient of this matrix are going to be learned.

  • And you can think of every row of this matrix being

  • used to compute a dot product with an input vector.

  • And that produces basically a weighted sum

  • of the inputs multiplied by those coefficients.

  • That gives you another vector.

  • And you pass each component of these vector

  • through a non-linearity like this one, for example.

  • Just halfway ratification.

  • So you have two different steps.

  • Linear, nonlinear.

  • Linear pointwise, nonlinear.

  • Very simple.

  • And you can show that by stacking two layers of this,

  • you can approximate any function you want, as close as you want

  • as long as you have sufficiently many of these guys

  • in the middle by tweaking the parameters of the two layers.

  • But in fact, most functions we're interested in

  • are more economically represented by many layers.

  • And so that's the new approach to deep learning, if you want,

  • that changes from the neural nets of 30 years ago,

  • which typically had only two or three layers.

  • The neural nets of today, the deep learning systems of today

  • have anywhere between 20, 50, or 100 layers.

  • OK.

  • So we have linear operators that are

  • parametrized by coefficients.

  • And the supervised learning, we're basically

  • going to train it to be some sort of objective function

  • that's going to measure the discrepancy

  • between the output the machine produces and the output

  • we want.

  • And so the subjective function is going to be differentiable.

  • What we're going to do is compute the gradient

  • of the objective function with respect

  • to all the parameters in the machine averaged

  • over a number of training samples.

  • Or if we use stochastic gradient decent,

  • averaged over a small batch of training samples,

  • or even a single sample.

  • And then take one step [INAUDIBLE]

  • to get your gradient using the stochastic gradient update

  • rule.

  • Basically, the parameters are going

  • to kind of go down to a minimum in a stochastic fashion

  • as you train more and more.

  • So now the next step you have to do

  • is compute the gradient of the objective function

  • with respect to the parameters.

  • And the way you do this through back propagation.

  • I'm not going to go through this.

  • The mathematical concept on which it's based

  • is incredibly sophisticated.

  • Is it's called chain rule.

  • [LAUGHTER]

  • And some people learn this in high school.

  • And it basically comes down to the fact

  • that if have-- if your ranged parametrized functions

  • in a graph of competition, which in this case

  • is a very simple one.

  • It's just a linear stack of modules.

  • But it doesn't need to be such a simple graph.

  • It could be any graph.

  • And you [? take ?] connection by propagating signals

  • backwards through this graph.

  • Basically taking the gradient of some cost

  • function you want to minimize with respect

  • to this red variable.

  • And so this gradient is represented

  • by these green variable.

  • And multiplying it by the Jacobian of this box,

  • you get the gradient respect to the input of that box.

  • This is chain rule.

  • So it's this guy here.

  • Gradient with respect to the input

  • equals gradient with respect to the output

  • multiplied by Jacobian.

  • Very easy.

  • And so you propagate this backwards through the graph.

  • And the cool thing about this is that you can do this

  • automatically by having a bunch of modules of this type

  • that have been predefined.

  • And you assemble them in a graph.

  • And then automatically you get a gradient back.

  • You don't have to figure out how to compute it.

  • So that's what all of those deep learning frameworks

  • can allow you to do.

  • They're very simple to use.

  • Our favorite one is called PyTorch.

  • And you know, there's several Jacobians

  • for each of those boxes.

  • One that propagates through the input,

  • others that propagate through the parameters.

  • And that allows you to compute all the gradients

  • of the objective function, or whatever

  • you want to minimize with respect to all the parameters.

  • So, OK, back prop.

  • That's an old idea.

  • The basic idea of it actually goes back

  • to Leibniz and Newton, obviously.

  • But more recently, the people in optimal control

  • actually have used things like this

  • to called the adjoint state methods or adjoint system

  • methods for optimal control that was invented in the 60s.

  • That's what NASA used to compute rocket trajectories

  • and things of that type.

  • And it wasn't used for learning.

  • It was used for optimal control.

  • But it is very similar idea.

  • So we think of those variables as being

  • kind of control variables of a rocket,

  • and this being kind of the trajectory

  • the rocket if you want.

  • And then people realized you could use this

  • for learning in the late 70s, early 80s,

  • but never quite actually made it work.

  • And it started being used in the late 80s essentially.

  • And that's when the first wave of neural nets--

  • or the second wave a neural nets took off.

  • And around 1986, 1987 where people

  • realized you could train [? multi ?] neural

  • nets with this.

  • And then it died in the 90s, the mid 90s.

  • OK.

  • So the next question you can ask is those linear operators

  • are nice.

  • But you know, if my image is a long vector with millions

  • of pixels, I'm not going to multiply

  • by matrix that's several million by several million.

  • So you have to organize those linear operators

  • in ways that make them practical for things like images

  • or high dimensional inputs.

  • That's where the idea of convolutional nets comes in.

  • It actually doesn't come from sort of theoretical hypotheses.

  • But it was actually inspired by biology.

  • So I know there are neuroscientists in the room.

  • So this is inspired by Hubel and Wiesel, 1962.

  • Very classical working in neuroscience,

  • Nobel Prize winning work.

  • There were models of--

  • computational models of these most basic ideas

  • by Hubel, Wiesel, by Fukushima and his new neocognitron

  • model that was inspiring for inspiration

  • for convolutional nets.

  • And the basic ideas that individual cortex, and this

  • is something you can derive from first principles,

  • it's probably a good idea images to be

  • able to detect local features by basically having a template

  • that you match with the input.

  • And you get a score for how well this thing matches

  • with this one, basically a dot product, the weighted sum

  • of those pixels by those coefficients.

  • And then you swipe this over the edge everywhere.

  • And the results are recorded in a something

  • we call a feature map here.

  • And that operation is a discrete convolution.

  • But it's very similar to the kind of operation

  • you see, what's called simple cells in the visual cortex

  • do on images, where a particular neuron, an individual cortex

  • is connected to a local neighborhood

  • in the visual field.

  • And sort of detects local features as well.

  • So that's where this first layer is doing.

  • So these are multiple filters.

  • These are the convolutional kernel, [INAUDIBLE]

  • filter applied to this image by use of those maps.

  • And then you do what's called a pooling operation where

  • you take the result, like a local patch

  • of those results of filtering after the non-linearity.

  • And you compute an average or a max or L2 norm,

  • or something like this.

  • And you subsample the results so that the windows

  • over which you compute this aggregation

  • is set by more than one pixel.

  • So here it's set by two pixels.

  • So you get a map that's half the resolution of this one.

  • And then you repeat the process.

  • So you get convolutions again.

  • So this guy is a result of applying convolution kernels

  • to each of those maps, adding up the result,

  • passing it through a non-linearity.

  • And then again, there is pooling and subsampling.

  • So as you go up the layers, you get

  • representations that are more global and kind of more

  • abstract and etc.

  • And this is really the idea of simple cells

  • and complex cells, complex cells being those pooling areas

  • sort of a realization of this.

  • That's the-- drawing from Fukushima's paper

  • on the neocognitron where you had

  • those kind of simple cells and complex cells.

  • So this is a convolutional net.

  • This is meant to be an animation.

  • I'm not sure why it's not an animating.

  • But it's not animating.

  • And not only that, it actually crashed my computer.

  • All right.

  • I'm going to have to do something very brief

  • for just a minute.

  • OK.

  • Now it works.

  • So this is a an old convolutional net

  • trained in the early 90s to recognize handwriting.

  • And what you can see here is that this is the first layer.

  • That's the input.

  • So the first layer, 6 feature maps.

  • Then pooling subsampling, second layer.

  • Pooling subsampling, third layer.

  • And by the time you get here, each unit here, each pixel

  • represents the activation of the a unit.

  • It basically sees the entire input, or at least

  • a square on the input.

  • And so a slice through this represents an entire character

  • essentially in sort of abstract form.

  • And the good thing we realize pretty quickly with it

  • is that we could not just use it to recognize single objects,

  • but also multiple objects.

  • And that's very important.

  • So here we-- you basically have multiple copies

  • of the same convolutional net applied to a sliding window

  • over the input.

  • And it's actually very cheap to do this.

  • You can sort of apply the convolutional net

  • convolutionally.

  • It's convolutions all the way.

  • People sometimes call this [? free ?] convolutional net

  • now.

  • And at the output, you get a score

  • for every window and every category.

  • And here I'm just showing the winning score

  • with kind of a gray scale to indicate

  • the score of the category.

  • And then a very simple post-processing

  • pulls out the correct interpretation.

  • So here, the cool thing is that the system

  • can recognize objects without prior segmentation.

  • You don't have to separate the digits before being

  • able to recognize them.

  • And that's really important if you

  • want to be able to apply those things to natural images

  • where objects appear in the background.

  • And you can't afford to--

  • and you can't actually figure out

  • how to separate them from the background.

  • So that was kind of an important thing.

  • And then going forward a number of years,

  • about almost 10 years to 2003, someone at DARPA came up to us

  • and said, can you use machine learning, neural nets,

  • let's say, to drive robots?

  • And so we built this little track robot here.

  • It's just a radio controlled track

  • with two cameras, analog cameras.

  • And we had this truck being driven

  • by someone for about 20 minutes, or a total of maybe two hours.

  • And that person would be instructed

  • to drive straight and sort of veer off

  • whenever there was an obstacle.

  • And you know, he would--

  • after some training, you feed the network

  • with two images from the two cameras.

  • And then you would just train network

  • to emulate the steering angle of the human driver.

  • And you let the robot loose.

  • And he gets through all this kind of horrible busy Jersey

  • backyard here driving itself through this these obstacles.

  • So we showed these to DARPA.

  • And they said, oh, that's great.

  • We're going to start a program called LAGR

  • and have six different teams compete.

  • That would be nice if this slide actually showed.

  • Here we go.

  • See different teams compete.

  • They will all get the same robot.

  • And you'll train this robot to--

  • using machine learning, to figure out

  • whether it can drive over a particular area or not.

  • And so we used this convolutional net

  • that would look at bands in the image

  • and then label every pixel as to whether it's

  • traversable or not.

  • So something like this.

  • And the cool thing is that you can actually

  • get truth more or less, run truth through stereo vision.

  • So using a stereo vision system, because this robot has

  • multiple cameras, you can figure out

  • if something sticks out of the ground.

  • But that only works up to about 10 meters.

  • Beyond that it doesn't work.

  • So you trained a neural net with the labels collected

  • from stereo.

  • And then you run the neural net on the whole image.

  • And it does this.

  • It figures out where a path is essentially.

  • And it figures out here in the back

  • there is this row of obstacles in the little passage

  • way in between.

  • And so this thin kind worked pretty well.

  • There were again, six different teams competing on this.

  • We were the only ones to use convolutional nets.

  • But again, this was 200--

  • project started in 2005 and ended 2008.

  • And so the fast vision system that

  • uses a stereo, a slow system that uses stereo, and then

  • a slow vision system as well that uses this neural net.

  • And then you put the result. You combine

  • all the results in a map.

  • And you can do some planning to figure out how

  • to get to a particular goal.

  • The map here is centered on the robot.

  • So it's relatively easy to plan.

  • And then the system actually trains itself as it goes.

  • It adapts, collecting labels from the stereo vision.

  • It learns how to navigate new environment it's never seen

  • before, even the pesky grad students who

  • try to annoy this poor robot.

  • [LAUGHTER]

  • The robot weighs about 100 kilos.

  • It can probably break their legs.

  • But they're pretty sure it's not going to do that, because they

  • actually wrote the code.

  • This is-- and they trained it.

  • This was Raia Hadsell, who at that time

  • was a PhD student with me, who now leads the Robotics Research

  • Group at Deepmind.

  • And Pierre Sermanet, who is at Google Brain,

  • also working on robotics.

  • So a couple of years later, we realized

  • we could use the same kind of technology

  • for not just labeling pixels in an image as to whether it's

  • traversable or not, but also labeled with categories.

  • And some datasets started to appear that allowed to train,

  • you know, maybe with a couple thousand

  • images, that allowed to train the convolutional net to do

  • this.

  • So again, this is a convolutional net

  • applied to the whole image.

  • Each output of the convolutional net is influenced by a window

  • on the input, which is something like 40 by 40 pixels

  • at high resolution and 90 by 90 pixels

  • at half, and 180 by 180 pixels at quarter resolution.

  • So it sees a big context to make a decision for a single pixel.

  • But it kind of makes a decision for every pixel.

  • And the cool thing about this is that we

  • can read this in real time.

  • So this was implemented on what's

  • called an FAG, which is sort of a programmable hardware.

  • And it could run at about 20 frames per second classifying

  • to 33 categories.

  • And it wasn't-- far from perfect.

  • You know, it classified those areas here as sand or desert.

  • And this is the middle of Manhattan.

  • So there's no sand I'm aware of.

  • And it worked pretty well.

  • So we submitted a paper to CVPR in 2011.

  • And it was soundly rejected.

  • And the reviewer comments were either what the hell

  • is a convolutional net?

  • Or how is it possible that you get so good results

  • with a technique we've never heard of?

  • So it's kind of funny.

  • So we afterwards submitted it to ACML where it was accepted.

  • And so the funny thing is back in 2011,

  • you couldn't get a paper accepted at a computer vision

  • conference if you use neural nets.

  • Now you cannot get a paper accepted at CVPR unless you

  • actually use convolutional nets.

  • So there's a complete revolution over the next few years.

  • So that gave some ideas to a few people

  • working with driving cars around that time around 2013-14,

  • where they realized they could use

  • those kind of convolutional net based semantic segmentation

  • techniques to label every pixel in an image

  • as to whether it's traversable or not, or as to whether it's

  • a pedestrian or a road or something like this.

  • So this is some work at Nvidia.

  • This is work at Mobileye.

  • Which now belongs to Intel.

  • And this is a system that--

  • Mobileye produces systems that were used in the Tesla cars

  • for autonomous driving until mid 2016.

  • Then the two companies are divorced.

  • They weren't agreeing with each other somehow.

  • So now Tesla is developing its own system.

  • Nvidia has big project on this which I may come back to.

  • And then around 2012, the big revolution occurred.

  • And what that was is the use of very large convolutional nets

  • implemented on GPUs to run really efficiently

  • and train on large datasets like the ImageNet dataset

  • that has a million training samples, 1,000 categories.

  • And it turns out those things work really,

  • really well when you have lots of categories

  • and lots of training samples.

  • And when you make them big.

  • And so the first to really make an efficient implementation

  • of those networks on GPUs were Geoff Hinton

  • and his students, Alex Krizhevsky and Ilya Sutskever.

  • And they had presented the result at an Imagenet workshop

  • at ECCV in Fall 2012.

  • And then had a paper at NIPS in Winter 2012.

  • And that basically made the computer vision

  • field completely change, and basically jump

  • started the deep learning revolution.

  • That revolution had started in speech recognition

  • a couple of years earlier.

  • And the interesting thing about this

  • is that we ended up seeing an inflation

  • in the number of layers that are used

  • by those convolutional nets.

  • So this is the VGG network, which

  • was one of the top performing in 2013.

  • GoogLeNet-- no, this was 2013.

  • Then GoogLeNet in 2014, which had even more layers.

  • And then ResNet.

  • [INAUDIBLE] Hee and his collaborators from Microsoft

  • Research Asia had this idea of having skipping connections

  • that basically solved for the problem that sometimes,

  • when you train a very deep neural net, some of the layers

  • die.

  • The weights don't go anywhere.

  • That kills the entire thing.

  • So they use those kipping connections

  • to prevent the catastrophic bad things happening

  • if some layers died.

  • And that turned out to be a very, very good idea that

  • seems incredibly efficient.

  • But in fact, it works really, really well.

  • And so you can train neural nets with 50 layers,

  • 100 layers, 150 layers.

  • And they work really well.

  • There's sort of a more modern version of this.

  • One version called DenseNet, which

  • is a collaboration between people at FAIR

  • and people at Cornell, which is sort of a version of this

  • is designed to run efficiently and etc.

  • And so one question you might ask

  • is, why do we need all those layers?

  • Right, Theoretically, you can approximate any function

  • with only two layers.

  • Why you need many layers?

  • And you know, one possibility is the fact

  • that the world is compositional.

  • Images are basically composite pixels.

  • And pixels form together, arranged together

  • to form things like edges and colored blobs,

  • and stuff like that.

  • And then by detecting combinations of those,

  • you can detect things like circles and corners

  • and gratings.

  • And then a combination of those form parts of objects.

  • And combination of those objects, et cetera.

  • So there is this kind of hierarchical nature

  • of the perceptual world which is sort of captured

  • by those layered architectures.

  • So we used to take weeks to train those networks.

  • And now we can train one of those networks

  • with basically state of the art performance in about an hour.

  • On a very large machine with 250 GPU cards in it.

  • It's actually multiple machines.

  • Each machine has 8 GPUs.

  • And you stack them up.

  • So you can do these kind of things

  • if you are at Facebook or at Google.

  • A little more difficult in university environment.

  • But here are some more recent results on computer vision.

  • So this is a bit of a snapshot of the state of the art.

  • This is a model called Mask R-CNN, which

  • is a system that does not just semantic segmentation,

  • but instant segmentation.

  • So I'm going to bore you with all the details.

  • I'm just going to tell you that beats

  • all the records on some standard data like, COCO.

  • And here's an example of a result you can do.

  • So again, it's essentially conceptually very simple,

  • a convolutional net with some sort of system

  • that sort of detects regions of interest and then

  • applies a slightly more complex convolutional net

  • on those regions of interest.

  • And the output of the network is not just a category,

  • but it's a category, the coordinates of a bounding box,

  • and an image of a mask of the object at the same resolution

  • as the input.

  • And so you get for every object, you get the category,

  • you get the mask of the person or the object,

  • and you get a bounding box.

  • And it detects baseball, the dog, the individual people,

  • even though they all overlap.

  • So this is instance segmentation, not just

  • semantic segmentation.

  • Semantic segmentation it would have just one big blob here

  • labeled people.

  • You can detect wine glasses and wine bottles, very important

  • for French people, computers, you know, et cetera.

  • Backpacks, umbrellas, sheeps, you can count sheeps.

  • You know, overlapping cars, things like that.

  • It works amazingly well.

  • It's also trained to detect key points on human bodies.

  • So you can infer that the body pose

  • of people in photos and videos.

  • There's actually-- there's more of this which I can't show you.

  • But it actually runs at 5 frames per seconds on a smartphone.

  • So it's scaled down version of this.

  • And then there were kind of new applications

  • of this for convolutional net for 3D data.

  • So this is a recent competition called ShapeNet

  • where the dataset consist of 3D objects represented by point

  • cloud from a depth center.

  • And it's been manually segmented into regions or parts.

  • And the goal here is to essentially label every region

  • with the correct label.

  • And what turned out to win this recent competition

  • was a 3D convolutional net produced by Ben Graham

  • and Laurens van der Maaten.

  • So this is the original paper that

  • describes the idea of a sparse 3D convolutional net.

  • And there's some other contributors to the system.

  • It's a library you can download.

  • It's basically the idea of sort of only doing convolutions

  • in areas where you have populated voxels,

  • because in a 3-D environment, most of the voxels are empty.

  • So you don't want to be computing convolutions

  • everywhere where there is nothing.

  • So you just follow the areas where there is something.

  • And it turns out to be much faster and easier to train.

  • And they actually won the competition with his technique.

  • And other application of convolutional nets

  • that's more research is a system that's actually

  • deployed at Facebook that uses convolutional nets

  • for translation, language translation.

  • So you use feed a sentence in English.

  • And it goes through a bunch of convolutions.

  • And it's actually a gated convolutional network.

  • So those are gated linear units, which I'm not going

  • to go into the details of.

  • There is pointwise multiplication going on here.

  • And then it goes into this kind of a weird alignment

  • system that basically produces sort of German words,

  • word by word, and then kind of lines them up

  • in an appropriate way.

  • And so, it's very fast.

  • It's very efficient.

  • It works really well.

  • And this is what I used for some--

  • for translating from some pairs of languages on Facebook.

  • Facebook can translate 2000 pairs of languages.

  • A number of them are translated using old style phrase based

  • statistical methods.

  • A number of them are translated using recurrent neural nets.

  • And then a small number of them are

  • translated using this system, which

  • is now being trained on more and more language pairs.

  • So a lot of the research that we do at a FAIR-- in fact

  • all of it is open.

  • We publish everything we do, generally

  • very quickly on arXiv.

  • And we also publish most of our coding open source so forth.

  • So these are a few examples of some of stuff we've deployed.

  • We've distributed open source.

  • I would single PyTorch.

  • This is a deep learning framework with a Python front

  • end.

  • It is very simple to use.

  • It's very good for research.

  • It's more transparent than TensorFlow.

  • OK.

  • And there's of course a lot of applications

  • of those things to medical imaging,

  • of course, and things like that, which

  • I'm not personally working on.

  • But a lot of my colleagues are.

  • But what's missing about this is two things.

  • One is, how do we learn reasoning and memory and things

  • like this?

  • And the second one is, how do we learn general things

  • that animals and humans can learn

  • without being told the name of everything,

  • without being given labeled data.

  • So this is a work by a bunch of people from Facebook AI

  • research in Menlo Park in California.

  • Justin Johnson was an intern at Facebook from Stanford.

  • And Fei-Fei Li, his advisor.

  • And the idea here is can we use deep learning to do things

  • like visual reasoning?

  • So could we answer questions like this one.

  • Is there a [? mat ?] cube that has the same size

  • as the red metal object.

  • So you to have to read this a few times and sort of figure

  • out really what operation you have to do here.

  • And so the idea they come up with is very cool.

  • You take the question.

  • Are there more cubes than yellow things?

  • You feed this through a recurrent neural net

  • that represents this as essentially

  • a single vector of fixed size.

  • And then you run this through another recurrent net

  • that spits out a kind of a representation of a computation

  • graph.

  • Think of it as a visual program, which

  • basically gets instantiated in this graph that has one block.

  • Those are actually trainable blocks.

  • OK.

  • They're all the same architecture.

  • So one block that is supposed to figure out--

  • filter all the objects that are yellow.

  • And another one that filters out the cubes.

  • One block that counts how many yellow things there are.

  • This one counts how many cubes there are.

  • And then it compares the two.

  • And then figures out the answer.

  • Right.

  • And so you don't predefine what those blocks should do.

  • You initialize it a little bit by heavy supervision,

  • by specifying what the program here should be,

  • and which blocks should be assembled,

  • even though the blocks are not trained initially.

  • And then you backpropagate the gradients

  • to get the right answer through this whole thing, including

  • the convolutional net.

  • And eventually this thing figures out

  • what those blocks should do.

  • Of course, we'll need to reach all those keywords.

  • And learn how to do reasoning.

  • But the interesting thing about it

  • is that it's completely dynamical.

  • You change the question, it's going to change the graph.

  • So the graph that you propagate gradient through changes

  • every time.

  • And that's why the dynamic graphs are so important in deep

  • learning nowadays.

  • People are so excited about it for things

  • like natural language understanding.

  • So dynamic graphs is the situation

  • where the computational graph that you

  • use to compute your answer changes when the data changes.

  • There's actually more recent work

  • along those lines by Aaron Courville at University

  • of Montreal, where they don't actually

  • have to specify a program like this.

  • You just stack multiple blocks.

  • And it just works.

  • It's pretty cool.

  • OK.

  • So for those statisticians in the room,

  • since I've been invited by bio-statisticians,

  • deep learning breaks all the basic rules of statistics.

  • I mean, not all of them, but some of them, right.

  • So the models are enormous, often

  • with many, many more parameters and there are training samples.

  • I mean, so take one of those convolutional nets

  • for ImageNet.

  • There is 1 million training samples.

  • Some of those models have 100 million parameters.

  • And they still work quite well.

  • They can often nail the training set perfectly.

  • And often there is no explicit regularization.

  • But it still works.

  • How is that possible?

  • The loss function is very highly non-convex.

  • It's got a ridiculously large combinatorial number

  • of settle points.

  • But still, you pretty much get the same result every time you

  • train.

  • What it tells you is that maybe there are local minima,

  • but they're all pretty much equivalent.

  • And in fact, there are experiments

  • that seem to suggest they're all connected.

  • There is only one local minimum basically.

  • I mean, not one.

  • But essentially one.

  • Little attention is paid to managing uncertainty

  • beyond using very simple things like softmax

  • on the output when you do classification.

  • But there's a lot of effort spent on computational issues.

  • Like efficiently implementing all those things, and all

  • that stuff.

  • So it's sort of very much unusual.

  • It breaks the rules you see in textbooks,

  • in statistical textbooks.

  • And that might be a reason why some people who are more

  • theoretically oriented had initially a lot of skepticism

  • towards neural nets.

  • OK.

  • But let me switch to kind of the point I really

  • want to make about with this talk, which

  • is, where do we go from there?

  • OK.

  • So deep learning works very well.

  • There's a lot of applications we can use it for.

  • Even if we don't do any research anymore,

  • just with the technique that we've developed so far,

  • there's probably a lot of different industries

  • that are going to be affected by it that we can apply this to.

  • In fact, there's something that Andrew Ng said recently.

  • Stop doing research.

  • Just apply the stuff that we already know.

  • I don't think it's a good idea.

  • But I don't think he believes it completely either.

  • But what is interesting of him to say this.

  • So what are the obstacles really to making significant progress?

  • Because as I said before, all the stuff you see,

  • that's not real AI.

  • And our machines do not learn with the same kind

  • of efficiency that we observe animals and humans learning

  • with.

  • So how do we get machines to learn

  • how the world works, learn common sense

  • or something like this?

  • So that would ask the question going back to the inspiration

  • from biology, does the brain use a learning algorithm?

  • Or does it use 50 learning algorithms?

  • Or maybe 200?

  • Or maybe it's complete [INAUDIBLE],,

  • the result of evolution.

  • There's no underlying principle behind it.

  • It's just a result of millions of years of evolution.

  • How much prior structure does animal or human learning

  • require for a intelligence to emerge

  • in a reasonable amount of time?

  • All the learning algorithms that people in machine learning

  • have come up with in statistics minimize

  • some sort of objective function, or optimize

  • some sort of objective function, I should say.

  • Does the brain optimize an objective function?

  • What would that function be?

  • If it optimizes a function, does it

  • do it by evaluating a gradient?

  • If it evaluates a gradient, how does it do it?

  • It probably doesn't do backprop in the way

  • that we understand it today.

  • And how does it handle uncertainty

  • in prediction, which I think is a crucial issue?

  • So all kinds of questions like this that connect

  • AI machine learning with neuroscience really.

  • And one big missing ingredient in AI, or maybe a holy grail,

  • is common sense.

  • There's a subarea of AI called commonsense reasoning.

  • It's not actually a solution to a problem.

  • It's more of a problem.

  • And it's a question of how do we get machines

  • to quite common sense.

  • So common sense is everyday--

  • the commonsense of everyday thing.

  • That supported-- unsupported objects fall.

  • That some objects are stable.

  • And some are not.

  • If I let this guy go, it's going to fall,

  • even if I put it briefly vertically.

  • If I take this object, I hide it behind my computer,

  • you still know it's here.

  • It hasn't disappeared.

  • So object permanence.

  • So those things we learn.

  • How do we learn the structure of the world?

  • And one hypothesis perhaps is that our brains

  • are prediction machines.

  • They learn to predict all the missing information

  • from whatever is available today at this time.

  • And then time passes by.

  • Or you move your head, or whatever.

  • And new information becomes available.

  • And that allows you to train your world model

  • with the new information.

  • So if I want to learn that the world is three dimensional,

  • I'm going to learn it because it's

  • the best explanation for how the world changes

  • when I move my head.

  • My view of the world changes when

  • I move my head side to side.

  • And the best explanation for how it changes

  • is the notion of depth.

  • So necessarily, if my brain is trained to predict

  • what the world is going to look like when I move my head,

  • it's going to have to somehow represent the notion of depth.

  • Same way if I want to predict--

  • if I let this go and I stop the movie right there,

  • then I ask the machine, ask my brain

  • what's going to happen next?

  • It's going to predict this guy is going to fall--

  • he's going to fall down, of course, because of gravity.

  • So it just needs to wait for time

  • to pass by to train itself to see

  • if its prediction was correct.

  • So that would be predictive learning.

  • But predicting-- learning to predict

  • is not just predicting the future

  • from the present and the past.

  • It might be also predicting what the blind spot of a retina

  • contains without even looking.

  • So if you fixate on a particular place,

  • there is a particular spot in your visual field where you're

  • essentially blind because that's where

  • your optical nerve [? puncture ?]

  • through your retina.

  • You don't see anything at there.

  • But you don't realize it, because your brain

  • fills it up essentially.

  • So things like filling the visual field

  • of the regional blind spot, filling occluded images,

  • missing segments in speech, predicting

  • the state of the world from partial textual description,

  • predicting the consequences of your action,

  • predicting sequences of action leading to a result.

  • I mean, all of those are fill in the blanks, if you want.

  • And common sense, I would surmise,

  • is the ability to fill in the blanks

  • through the construction of world models.

  • Object permanence is something babies learn around

  • the age of two or three months.

  • And which is why peekaboo is so funny for little babies,

  • because you can disappear when you hide your face.

  • So here's a baby orangutan here.

  • It's being shown a magic trick.

  • The guy put an object in the cup.

  • And then he shakes the cup.

  • It takes the object out without showing the orangutan.

  • And then shows the inside cup.

  • And the cup is empty.

  • And the orangutan rolls on the floor laughing.

  • OK.

  • That obviously broke his world model, that objects--

  • there's object permanence.

  • Objects don't disappear like that.

  • And you know, one of three things

  • can happen when your world model is broken, you laugh.

  • It's really funny.

  • It's really interesting, you pay attention,

  • because your role model is wrong.

  • So you need to learn a new world model basically,

  • because of this new data that you predicted wrongly.

  • Or something really dangerous might

  • happen that you didn't predict.

  • And so you're scared.

  • So that's what happens when your world model is working.

  • So I think-- how do we do this a machine?

  • How do we get them to learn all those things about the world?

  • Lean gravity?

  • So if you show a baby, this are special slides

  • I borrowed from Emmanuel Dupoux, who

  • is a cognitive scientist, developmental cognitive

  • scientist in Paris at Ecole Normale Superieur.

  • And if you do an experiment like this,

  • you take this little car here.

  • And you put it on this support.

  • And you push it.

  • And it goes off, and it doesn't fall.

  • Of course, it's held in the back.

  • But the baby doesn't see that.

  • Before six months, the baby says, yeah, sure.

  • That's way the world works.

  • Fine.

  • No problem.

  • After eight months, they go like this.

  • You know, they open their eyes.

  • And they fixate.

  • And they say, what's going on?

  • And they don't say, what's going on,

  • obviously because they can't talk.

  • But you know, they look like they're saying,

  • what's going on.

  • And so with this kind of technique,

  • by basically measuring how long you know babies

  • fixate and observe and open their eyes like crazy,

  • you can figure out at what stage babies learn things.

  • And again, this is from Emmanuel Dupoux.

  • So things like object permanence you learn pretty quickly.

  • Biological motion, the fact that there

  • are objects that move by themselves,

  • others that are inanimate.

  • You know, you learn that by three months.

  • Objects that are rigid or not.

  • Different types of natural categories, chairs, tables cars

  • etc.

  • Stability and support.

  • And sort of basic intuitive physics, gravity, inertia,

  • conservation of momentum.

  • That arrives around 8 months, roughly

  • between six and eight months.

  • And there's a bunch of other things like that

  • happen at various stages.

  • And this is not learned in supervised mode.

  • It's not like, babies are told the name of objects.

  • It's not like they are directed in any way for any of this.

  • They basically learn this by observation.

  • They're really not well-developed in sort

  • of motor control either.

  • So they don't get to do a huge amount of interaction

  • with the world.

  • So there's no way this can be learned through interaction,

  • by some sort of direct reinforcement learning.

  • There's other mechanism going on there

  • where you learn how the world works by observation.

  • And that's the piece we're missing in our current machine

  • learning and AI systems.

  • So in fact, I need to apologize in advance to Michael.

  • But he knows what I'm going to show, so--

  • There's three sort of paradigms of learning, right.

  • There is a reinforcement learning,

  • where basically the machine at each trial

  • is given a scalar value to tell it whether it did well enough

  • or not.

  • So there was grade for games.

  • Machine does an action.

  • And it either gets a reward or not.

  • Or sometimes it has to make a whole sequence of action

  • before it gets a reward.

  • And it works great when it's combined with deep learning.

  • The problem is that it requires a huge amount

  • of training samples, an enormous amount of training samples.

  • It's because the amount of information

  • you give to the machine is extremely small at every trial.

  • It's very weak.

  • It's a small amount of information.

  • Therefore, you need to do this many, many times for it

  • to learn anything complicated.

  • Supervised learning, you need a little less samples,

  • because you give more information every time.

  • You give it the correct answer.

  • And so if there are a dozen categories,

  • that's more than just a single scalar value.

  • So you need fewer samples to learn similarly complex tasks.

  • And then the predictive learning or unsupervised learning,

  • you ask the machine to predict basically every variable

  • from every feature variable from every present variable

  • or past variable, or every unseen variable from every seen

  • variable.

  • And so there is a lot more information

  • you ask the machine to predict.

  • And that's why probably you can learn

  • a lot more about the structure of the world this way.

  • So that led me to this completely obnoxious slide,

  • which I have to show in every slide-- in every talk now.

  • The analogy between intelligence and chocolate cake,

  • where the [INAUDIBLE] of the cake

  • is basically unsupervised or predictive learning,

  • because that's where the bulk of the information goes.

  • The bulk of the information given to the machine

  • is really in that mode of learning.

  • And then the icing on the cake is supervised learning.

  • There is considerably less information

  • provided to the machine per trial in supervised mode.

  • And in reinforcement mode there is very little information

  • given to the machine.

  • So that's going to be equivalent to the cherry on the cake.

  • And I've been showing this-- the first time

  • I showed this slide was actually giving a talk at Deepmind,

  • where Deepmind is actually the temple of reinforcement

  • learning.

  • So it was sort of obnoxious on purpose, a little bit.

  • But now I kind of fell into that obsession

  • of showing it in every talk.

  • So the problem with reinforcement learning,

  • with pure reinforcement learning, and Michael

  • will correct me if I'm wrong, is that if you use it

  • in its purest form, you need so many trials to learn

  • any kind of complex behavior that if you were

  • to train a self-driving car to drive, and to learn to not

  • run off a cliff, it would have to run off a cliff

  • about 50,000 times before it figures out it's a bad idea.

  • And then another 50 dozen times before it

  • figures out how not to run off a cliff.

  • And you know it's half of a joke, which is why--

  • I mean, that's the reason why it works really well for games,

  • because you can run games very quickly

  • on many computers at the same time

  • and at many thousands of frames per second.

  • But it doesn't really work in the real world,

  • because you cannot run the real world faster than real time.

  • That's a thing that sucks about the world.

  • And then anything you do real world can kill you,

  • like running off cliffs.

  • Maybe it's a good thing that we can't run the real world faster

  • than real time.

  • So perhaps what we need is build models of the world

  • that we can run faster than real time,

  • and that we can run without the risk of killing ourselves.

  • And that would be predictive models.

  • If we ever were to predict before we run off

  • a cliff that we're going to run off a cliff,

  • we would not run off a cliff.

  • And perhaps, that's the way we learn to drive.

  • We know not to get off the road, because we

  • know bad things will happen if that's the case.

  • Reinforcement learning works really well for games.

  • And there was a smashing demonstration

  • of how well this works for Atari games

  • and Go and doom, and not yet StarCraft, that's

  • very much work in progress at FAIR and Deepmind

  • and various other places.

  • It's very complicated.

  • But you know, it works really well.

  • And the latest AlphaGo Zero is pretty amazing in that way.

  • But again, it's a particularly simple situation

  • where the number of actions is discrete,

  • the world is completely observable,

  • and the reward is fairly clear.

  • And you can run the environment, which is a go board,

  • at tens of thousands of frames per second essentially.

  • It works pretty well, even for games like Doom.

  • So this is a Doom competition that was

  • won by the team from Facebook.

  • And actually teams with Facebook people won two years in a row,

  • in '16 and '17 using basically deep reinforcement

  • learning techniques.

  • So we work on reinforcement learning at Facebook.

  • It's not--

  • The cake I showed--

  • I showed the cake, but you have to notice that this is

  • a black forest chocolate cake.

  • And the cherry is not optional on this cake.

  • In fact, it's got little bits of cherries

  • all around here inside.

  • [LAUGHTER]

  • OK as I said, we also work on StarCraft.

  • So StarCraft is an extremely challenging situation,

  • because there is multiple time scales.

  • There are continuous actions.

  • It's not fully observable.

  • You can't tell what your opponent is doing unless you

  • send scouts to look at it.

  • So it's very complicated in that sense.

  • We've done a little bit of reinforcement training

  • for sort of local micro-management of tactics.

  • It's actually an open source platform called ELM or miniRTS

  • from Facebook that is basically a StarCraft like real time

  • strategy game.

  • But here is a suggestion.

  • So I said we need our machines to be able to learn

  • predictive models of the world.

  • And this idea is very old.

  • It goes back to a very old time.

  • But in particular, to one of Rich Sutton's papers

  • where he was proposing what he called the Dyna architecture.

  • And he said the main idea of Dyna

  • is the old common sense idea that planning is trying things

  • in your head using an internal model of the world.

  • And this suggests existence of a more primitive process

  • for training things not in your head,

  • but through direct interaction with the world.

  • So he said here, reinforcement learning

  • is the name we use for this more primitive and direct kind

  • of training.

  • And Dyna is the extension of reinforcement

  • learning to include a [INAUDIBLE] world model.

  • In fact, this [? domain picture ?]

  • doesn't exist today.

  • All of this is called reinforcement learning.

  • It's just that the version that has a model

  • is called model based reinforcement learning.

  • And the other one is called model free reinforcement

  • learning.

  • But it's basically the same, the same thing.

  • And this idea that you should have a world model which

  • in optimal control is called a plant simulator,

  • but it's the same thing, or a plant model.

  • But this idea that [INAUDIBLE] predictive world model

  • to be able to reason about what to do, what action to take,

  • is really [? all ?] idea in the context of optimal control.

  • So a typical situation in optimal

  • control, and you can look at classical textbooks going back

  • to the 60s, is you have a model of the world that gives you

  • the state of the world at time t plus 1

  • as a function of [? standard ?] time t.

  • And the action [? you can ?] [? take. ?]

  • And then the state of the world is

  • sent to an objective function that

  • measures how well the state of the world is,

  • or how good it is.

  • And so you can run this model of the world.

  • And through backprop through time and gradient descent

  • figure out a sequence of commands

  • that will optimize this objective function over time.

  • And if you're well-simulator is differentiable,

  • you can do this through backprop and gradient decent.

  • If it's not, you have to do things [INAUDIBLE]

  • programming or something like this.

  • So the main problem we're going to have is,

  • how do we learn this world model?

  • How do we learn a model that will allow our mission

  • to predict what the state of the world at time t plus 1

  • is going to be as a function of the state

  • at time t and our action, and perhaps actions

  • of others in the environment.

  • That's the problem of predictive or unsupervised learning.

  • And that led me to state that--

  • oops.

  • I'm not sure how that happened.

  • Apologies.

  • Wow, it went forward by like 10 slides.

  • So that is new to this statement that the next revolution in AI

  • will not be supervised.

  • I stole the concept of this slide

  • from Alyosha Efros at Berkeley.

  • And so we have to think about what

  • would be the architecture of a real intelligent system, a sort

  • of autonomous intelligence system.

  • So it would be something like this, an agent that

  • produces actions on the world.

  • And the world responds with percepts.

  • And of course, the world might be--

  • the world might not care about your action at all.

  • Or it might care only vaguely.

  • What the agent is trying to do, the agent

  • has an internal state which is sent to an objective function.

  • And the objective function produces

  • a value that basically tells the agent whether it's happy

  • or not.

  • So the objective function is a measure

  • of unhappiness of that agent.

  • You get a small value if you're happy, a large value if you

  • are unhappy.

  • So what the agent is trying to do

  • is bring the world into a state that will bring itself

  • into a mental state that basically this red function

  • identifies as happy.

  • And there are models of how animal brains are built, are

  • basically this way, where this is your entire brain,

  • except the basal ganglia.

  • And that's the basal ganglia.

  • So basal ganglia is the thing at the bottom of your brain

  • that basically determines your level of happiness or comfort

  • or discomfort or pain or things like that.

  • So inside of this agent, if we believe

  • what I-- or the argument that I previously, the system should

  • have some sort of world simulator

  • that allows you to predict what the state the world is

  • going to be as a consequences of a sequence of actions.

  • And then two other modules.

  • These are sort of standard nomenclature in RL.

  • An actor that produces action proposals that can be

  • kind of simulated in the world.

  • And then a critic whose role is to predict

  • the long term expected value of this objective.

  • So this guy basically computes emotions.

  • So if this guy predicts that your objective function is

  • going to rise up, make you very unhappy or in pain,

  • that creates fear, essentially.

  • You don't want to get anywhere near that state.

  • And this guy predicts what happens.

  • So this guy predicts this.

  • This guy doesn't quite predict that.

  • But this guy actually predicts that as well.

  • And so now the problem becomes, how do we

  • train this world simulator?

  • Because the rest, we kind of know how to do it more or less.

  • We don't know how to build this.

  • But if we knew, we could do something like this.

  • Get the state of the world through your perception module,

  • initialization your world simulator,

  • propose a sequence of actions, and then

  • refine the sequence of actions so as

  • to minimize the expected cost computed by the critic.

  • And then train the actor to produce this optimal sequence

  • of actions.

  • And then take the first action.

  • And then kind of shift everything by one time stamp.

  • So how do we learn forward models of the world?

  • This is an experiment that was done at Facebook

  • a couple of years ago by Adam Lere, Sam Gross, and Rob Fergus

  • where they put a stack of cubes, this is in a simulator.

  • This isn't the real world.

  • And then they observe what actually occurs.

  • And then they train a convolutional net

  • to actually predict what's going to happen by kind of learning

  • the mask of the objects.

  • And what you get is a pretty accurate prediction

  • for this tower is going to fall this way.

  • But fairly fuzzy predictions for like, tall towers,

  • where it's king of ambiguous where things are going to fall.

  • So you get those kind of fuzzy predictions here.

  • Because you can't exactly predicting where

  • things are going to fall.

  • So how do we solve that problem?

  • I'm going to skip this.

  • So this is why predictive models are

  • good for question answering systems and natural language

  • processing.

  • But I'm going to skip this in the interest of time.

  • So, here's the problem we have to deal with.

  • Those towers can fall in a number of different directions

  • that we can't really predict just from the look of it

  • which direction they're going to fall into.

  • So it's kind of--

  • I don't know if we can find a pen here

  • or any kind of vertical thing.

  • I'm going to do it with a piece of paper.

  • So if I put this piece of paper here on the table,

  • and I let it go, you can be pretty sure it's going to fall.

  • But you can't really tell probably which direction

  • it's going to fall.

  • Every time I do it, it's probably

  • going to fall into a different direction.

  • So you can't really use supervised

  • learning to train something like this.

  • Because if I give the initial segment,

  • and then I ask machine predict, the machine predicts that.

  • If that happens, that's fine.

  • If this happens, then the mission

  • has to predict now this.

  • But now the next time over, it's going to predict that.

  • And so the best thing the machine can predict

  • is kind of an average of the outcomes, which

  • is not a good answer.

  • And so, something like this, where let's say you

  • observe two variables which have a dependency between them.

  • And this is pretty elementary for anybody who

  • works on probabilistic models.

  • But let's say these are the data points you observe.

  • Your world consists of two variables.

  • And these are your observations.

  • If I give you a particular value of Y2,

  • you can infer basically two values for Y1.

  • But if you try to learn this with L2 least square criterion,

  • you're going to predict something right in the middle,

  • which is not a good answer.

  • So you have to predict, somehow be

  • able to predict one or the other,

  • but not an average of the two.

  • Or predict a distribution.

  • But how do you represent distributions

  • in high dimensional spaces?

  • So the unsupervised learning problem

  • is how do you capture the dependency between things

  • like this?

  • And one possible way is to learn a contrast function.

  • So basically, think of it as an energy function,

  • or negative lo log probability if you are a probabilist.

  • And this are your data points.

  • And you want those to have low energy, which

  • means high probability.

  • And you want everything else to have higher energy, or lower

  • probability.

  • So the blue points are the data that you observe.

  • The green points are not data.

  • And you want the energy of the green points

  • to be higher than the energy of the blue points.

  • So if you have a parametrised function that

  • computes this function in the space of Ys,

  • it's easy enough to tweak its parameters

  • so that when you see a blue point,

  • you make the output go down.

  • But how you make sure at the value of your function

  • is higher outside of those needs?

  • How you generate those green points?

  • And that's basically-- there's basically

  • seven or eight different methods for doing this.

  • But I'm only going to talk about a couple.

  • And the first one is adversarial training.

  • So adversarial-- the basic idea of adversarial training

  • is basically the scenario I was talking about.

  • You have a predictor here.

  • And this predictor looks at the past,

  • let's say, if you want to do video production.

  • So it looks at the past.

  • And it has access to a source of random vectors

  • and is going to produce a prediction.

  • The precise prediction is going to depend

  • on the value of this vector.

  • And as the value of this vector changes,

  • this prediction goes through a set of plausible outputs,

  • let's say, represented by this red ribbon here.

  • So let's say we asked the machine.

  • We show the machine a small segment of video.

  • And we ask it, what is the world going

  • to look like half a second from now?

  • And the machine predicts this.

  • It predicts that pen is going to fall to the back and the left.

  • And in fact, we let time pass by.

  • And what happens is this.

  • The pen falls to the back and slightly to the right.

  • So we don't want to punish the machine

  • for making the wrong decision here, because it's

  • qualitatively correct.

  • So what we'd like is we'd like an objective function that

  • tells us low cost if you are on this red ribbon, high cost

  • if you are outside.

  • And that's exactly what I was talking about earlier.

  • You want a function like this one

  • that tells you low cost if it's something

  • that looks reasonable.

  • High cost if it's not.

  • So the thing is, we don't know how you

  • characterize this functions.

  • So we're going to have to learn it.

  • So adversarial training is you have two functions

  • you learn, one that predicts and one that tells the system

  • whether the predictions are good or not.

  • And basically it works like this.

  • So you have an initial segment of a video.

  • For example, if you do video prediction,

  • the data tells you here is how the video ends.

  • And you train this contrast function,

  • called the discriminator, or sometimes critic actually ,

  • to produce a low output for things that actually occur

  • in the world.

  • So those are the two blue points.

  • So we'll make the function take a low value for things

  • actually occur.

  • And then you this past to the generator.

  • You have it generate a prediction,

  • which initially sucks.

  • And so you feed it to the discriminator.

  • And it tells the discriminator produce a large output

  • here to make the output here.

  • So these are all of the green points.

  • Make that large.

  • And so next time around, the value here the discriminator

  • will produce for those predictions

  • is going to be higher.

  • But here is what you do simultaneously.

  • Simultaneously, you backpropagate gradients

  • through the discriminator to train the generator

  • to produce Ys that make the discriminator produce

  • low outputs.

  • OK.

  • So basically, the generator gets information

  • about how to change its parameters so as

  • to change its output so that the green points get closer

  • to the blueprints, essentially, to a region

  • that the discriminator give low energy to.

  • So eventually it looks like this, where the green points

  • match the blue points more or less in distribution

  • if you're lucky, because those things are kind of finicky.

  • And it works.

  • So you can train those things with past frames.

  • Or you can just train it on images to just generate images

  • from random vectors.

  • So this thing has access to all sorts of vectors.

  • If you trend this thing on images of bedrooms, you get--

  • those are non-existing generated bedrooms.

  • And they all look kind of reasonable,

  • except maybe for this guy.

  • It looks an Austin Powers kind of bedroom, or whatever.

  • But you know, they all have a bed and windows and dressers

  • and lights, and stuff like that.

  • And those are basically a bunch of random numbers coming

  • into a convolutional net that has been trained

  • to produce bedroom images.

  • And they don't look like anything in a training set.

  • They're different from any training set image.

  • So there are various versions of those GANs.

  • There's a whole menagerie of different types of GANs

  • nowadays.

  • There are [? psycho ?] GANs and infoGANs and WGANs and IWGANs,

  • and an infinite number of GANs.

  • There is another family of generative models,

  • this type called variational [INAUDIBLE] encoders.

  • This is when trained on ImageNet.

  • So this is something called Energy-Based GAN trained

  • on ImageNet.

  • And it doesn't actually produce objects.

  • But you put things that from far away kind

  • of looks like objects, [INAUDIBLE] abstract.

  • This is trained on dogs.

  • It's kind of funny.

  • I mean, people do much better than this now.

  • But it's still funny.

  • OK.

  • So here is an example for video production.

  • So here it's a convolutional net that looks at 4 frames

  • and predicts two frames, two future frames.

  • And it looks at the images at multiple scales.

  • And there's all kinds-- and it's pretty complicated

  • architecture.

  • And this is the prediction you get

  • if you train with least square.

  • So you train this video predictor with least square.

  • You get blurry predictions.

  • If you train it with this adversarial training

  • criteria combined with some others,

  • you get this kind of prediction, considerably sharper.

  • So the first four frames are observed.

  • The last two frames are indicated in red

  • here are predicted.

  • And so you get--

  • the motions basically continue.

  • And they seem fairly reasonable.

  • There's a little bit of blurriness.

  • But it's is not too bad.

  • This is when trained on video segments

  • from apartments in New York.

  • So the camera rotates.

  • And the system has to basically invent

  • what the room looks like as the camera rotates.

  • So here is a bookcase.

  • And this part of the bookcase-- so this is observed.

  • Now it's predicted.

  • This part of the bookcase is invented.

  • So it figures out that a bookcase has to continue.

  • It figures out that a couch has to continue.

  • So it captures some regularity of what an apartment in New

  • York is supposed to look like.

  • Something that maybe is more interesting for people

  • interested self-driving cars.

  • This is a dataset called cityscape.

  • And-- oops.

  • And this is a system where you take a video sequence,

  • and you run a semantic segmentation system

  • on the video sequence.

  • So what you get is a bunch of maps

  • which give you the pixels that are

  • labeled for every category for every pixel.

  • So much like this, blue is car.

  • Sidewalk is pink.

  • And pedestrian is red.

  • And things like that.

  • And what this thing predicts is that-- so it

  • predicts in this case here half a second in the future.

  • It predicts that pedestrians keep crossing the street.

  • The car that is turning left keeps turning left.

  • The scenery keeps moving.

  • So it's useful if you want to work and self-driving cars

  • to have the ability to predict what's going to happen ahead

  • before it happens.

  • It might allow you to use this to train

  • for example, a reinforcement learning system

  • without actually crashing, but just by predicting

  • even a crash.

  • Here's a new model, a more recent one

  • just admitted actually called error encoding network.

  • So this one-- in fact, the one that actually works

  • is slightly different from this one.

  • But this is a simpler version to explain.

  • So this one basically trains a model.

  • So it looks at the past.

  • It runs through a few layers of a neural net.

  • It produces an internal state.

  • And ignore the top for the time being.

  • Then runs through a generator essentially,

  • another part of a neural net that produces a prediction,

  • say a video, another frame in the video.

  • And you train this using least square,

  • or something like this with what is actually observed.

  • And then you play a trick.

  • What you do is you take the difference between those two.

  • So this is a vector, the vector of the difference

  • between those two, the target and the prediction.

  • You feed this to a parametrised trainable function.

  • And then you feed the output of that function

  • to the hidden layer.

  • You add it to the hidden layer.

  • And you train this guy so that this variable

  • is going to take a value that minimizes the prediction error.

  • But this viable only depends on the prediction error.

  • And so basically, this part of the network,

  • when this value is set to zero, predicts

  • whatever is predictable.

  • And this guy basically parametrise

  • whatever is not predictable, which is a residual error,

  • and figures out how to represent the hidden latent variable that

  • will actually correct that mistake.

  • So that might represent the--

  • for example, you observe someone playing a game

  • and moving something on the screen.

  • The physics of how things move on the screen

  • is essentially predictable.

  • That's Newtonian physics.

  • But the action that the player uses maybe isn't.

  • And so that would essentially represent the action

  • that the player played.

  • That would be very useful for things like imitation learning,

  • for example.

  • Here's an example of how this can be used.

  • And I'm probably going to end here.

  • So you have to wait a little bit.

  • So this is a dataset that was produced

  • by Sergey Levine, [INAUDIBLE] and a few other people

  • at Berkeley.

  • So there is an object.

  • There is a robot arm.

  • And the robot randomly pokes the object.

  • So the result is that after being poked,

  • the object has moved a little bit.

  • And these are predictions for how the object could

  • have been moved by the thing.

  • This is pure pixel prediction, pixel space prediction.

  • So the system has no notion of object or anything.

  • These are prediction it makes.

  • And each different prediction is generated by different sampling

  • of the Z variable, the latent variable, or the action

  • variable.

  • You can think of this as basically an encoding of what

  • the robot arm did without actually

  • having to observe what it did.

  • So it's action inference if you want.

  • OK.

  • I've spoken for long enough, so I'm

  • going to stop here and take your questions.

  • Thank you very much.

  • [APPLAUSE]

  • AUDIENCE: Hey.

  • [INAUDIBLE]

  • Real quick question.

  • So can you break-- so, let's just think about images.

  • Are you trying to-- or we use essentially biology and things

  • we know about the world to segment the image.

  • What if you took a camera and did

  • a combinatorial scramble, which is a huge potential scramble.

  • Does it break everything?

  • YANN LECUN: It scrambles the pixels?

  • AUDIENCE: It scrambles the pixels.

  • YANN LECUN: Yeah.

  • AUDIENCE: You know, it's combinatorially huge.

  • YANN LECUN: Yeah, that's right.

  • So if you do a fixed scramble and you

  • use a convolutional net, the convolutional net

  • will have a hard time figuring out the thing,

  • because it's based on the idea that neighboring pixels are

  • correlated.

  • And a local patch of pixels can be represented efficiently

  • by just those features.

  • So it probably would have a very hard time.

  • Now it turns out there's a paper by Pascal [INAUDIBLE]

  • on [INAUDIBLE] from way back where

  • they show that if you just-- if you take a collection of images

  • that you've perturbed through the fixed

  • permutation of the pixels, you can actually

  • recover the topology by figuring out

  • the local correlations between pixels.

  • So in principle, it would be possible to make this work

  • if you [? hardwired ?] this.

  • AUDIENCE: Thank you for giving a talk today.

  • I'm a big fan to you, actually.

  • [INAUDIBLE] talk to me.

  • And recently the D-Wave Systems and the quantum computer

  • is actually deployed in practice right now.

  • And how would you envision the quantum computing

  • affect the deep neural networks in general?

  • YANN LECUN: Yeah, it's--

  • if you didn't hear the question, it's

  • about whether quantum computing will affect deep learning

  • in some way.

  • It's not entirely clear to me.

  • So D-Wave is not actually deployed in practice.

  • It's experimented with by people.

  • And there are a few attempts.

  • But it's not actually used in practice

  • for commercial deployment, if that's the question.

  • So the D-Wave System is not a full quantum computer

  • in the sense that it uses quantum tunneling

  • for more efficient function optimization.

  • It's not entirely clear that you need this

  • at all for any of the tasks that I talked about.

  • So I think it's still up in the air

  • whether or quantum computing will have any effect.

  • It's possible you could do nearest neighbor much

  • faster with quantum computing.

  • It's not even clear to me that you can, but it's possible.

  • So, it's unclear.

  • AUDIENCE: So I actually have two questions.

  • The first question is that [INAUDIBLE]

  • if the data point is very small, like in the area

  • of a [INAUDIBLE],, but only [INAUDIBLE] maybe X-Ray imaging

  • or even less.

  • [INAUDIBLE] So I read something about the [? zero ?] shot,

  • one shot, and [? two ?] shot [INAUDIBLE]..

  • So what do you think of [INAUDIBLE]..

  • And the second question is are any of the AI [INAUDIBLE]

  • developed by Facebook or developed [INAUDIBLE],,

  • [INAUDIBLE].

  • YANN LECUN: All right.

  • Yeah, OK.

  • Let me answer first question first.

  • So the small the regime.

  • There's basically currently two ways to handle it.

  • One is transfer learning.

  • So for example, you want to do image recognition.

  • And you want to do, I don't know,

  • medical imaging or something like this.

  • And you don't have enough data.

  • So one approach is you train your neural net

  • on a big data set that you actually

  • have, either with the same type of images,

  • or even complete different types of images, as long

  • as the statistics are similar, like ImageNet for example.

  • You know, it's not the same type of image.

  • But it's OK.

  • [INAUDIBLE]

  • And then you can transfer learning.

  • So you take that pre-trained machine.

  • And then you retrain this machine for your data

  • that helps you just retrain the top two or three

  • layers to a limit the number of parameters.

  • That works really well.

  • So there is actually a service within Facebook

  • that uses this for the product division within Facebook.

  • So to give you an idea, there's 2.1 billion users on Facebook.

  • And the users upload on the order of 1.5 billion photos

  • every day.

  • So there's 1.5 billion a day.

  • Every single one of those photos go

  • through four convolutional nets that we know about.

  • It goes way more.

  • But these four pre-trained convolutional nets.

  • So one that basically recognizes tags

  • of various types on the image.

  • So recognizes objects.

  • It recognizes the type of images.

  • Is this a birthday or a wedding or landscape or indoor scene

  • or a [? macrophoto ?] or whatever.

  • There's a second one that--

  • and this is used for feed ranking basically,

  • to decide whether to show particular images

  • to particular people who have particular interests.

  • The second one filters objectionable content.

  • So basically, violence, pornography, things like that.

  • The third one generates captions for images,

  • for the visually impaired.

  • So that if you're blind and you're on Facebook,

  • you can get an idea of what's in the picture

  • by getting this text description.

  • And then the last one, which is turned on in US,

  • but not in other countries, not in many other countries,

  • not turned on in Europe does face detection.

  • So it tags your friends automatically.

  • So that was for the first question.

  • Now there's a second answer to the first question.

  • And the second answer to the first question

  • is you can use unsupervised training or pre-training.

  • So basically, you don't trust trained system

  • to classify your medical images into cancer or non-cancer.

  • But you also train it to reconstruct itself.

  • And that has a regularization effect.

  • So there are situations, certain types of architectures,

  • things called ladder networks or what stack [INAUDIBLE] or UNet,

  • where this type of learning actually

  • helps supervised learning and reduces

  • the need for labeled data.

  • OK.

  • So that was-- ultimately, I think

  • that supervised learning is going

  • to solve all of these problems.

  • Now your second question was about those bots

  • that there was a big story in the press a few months ago

  • that said that researchers at Facebook

  • had created two bots that were supposed to talk

  • to each other in English.

  • And they're supposed to cooperate to solve a task.

  • It's going to reinforcement learning type task.

  • And they ended up using English language

  • in ways that were not really initially predicted.

  • They would use a funny way to use words to express--

  • to communicate with each other.

  • And so some of the newspapers right after, it almost

  • said AI's going to kill us all.

  • Some tabloid published an article saying, oh my god,

  • Facebook researchers had this project where two bots invented

  • their own language.

  • And they had to like unplug the computer in panic mode,

  • because they were going to take over the world or something.

  • And it's completely insane, because there

  • was a blog post about it and a paper that was published.

  • And it's basically, these people are

  • interested in natural language understanding.

  • And they trained those systems to use English.

  • And they ended up not using English

  • in a way you would normally use it.

  • So they said, the experiment failed.

  • Let's try something else.

  • It's not like the Hollywood sci-fi movie

  • where you see these guys grabbing the electronic cars,

  • and there's sparks flying and all that stuff right.

  • Nothing like that.

  • But it's really funny how--

  • funny in a way, kind of depressing a little bit,

  • of how some of the press describes those things.

  • There were a lot of articles in more serious press afterward

  • that said that's complete bunk, which is good.

  • AUDIENCE: Thank you.

  • AUDIENCE: Hi.

  • I have a comment here.

  • I have a comment and a question.

  • First comment is that earlier you said

  • there are many systems that hasn't been in the parameters

  • that much more than the number of pixels

  • or whatever you're talking--

  • YANN LECUN: Samples.

  • AUDIENCE: Samples.

  • [INAUDIBLE]

  • I think from a statistics point of view,

  • it's the central limit theorem [? doing it's ?] [? job. ?]

  • That's my comment.

  • YANN LECUN: Which theory?

  • AUDIENCE: Central limit.

  • YANN LECUN: Oh, central limit theorem.

  • AUDIENCE: [INAUDIBLE] I think.

  • But, OK.

  • My second question is actually related to this.

  • Are there-- all your examples kind of works.

  • Are there any theoretical scientists,

  • computer scientists working on foundation

  • of these kinds of things.

  • What makes it converge, and what's not?

  • YANN LECUN: Yeah.

  • I mean, there's a lot of different types of people

  • working on those questions, some of them

  • are computer scientists, but many of whom

  • are either physicists or mathematicians.

  • So I've been--

  • I've been involved in an effort for many years

  • to try to get the applied math and pure math community

  • interested in those questions.

  • And I've only been successful in the last year or two.

  • Same for the physicists.

  • So basically, there are results in random matrix theory that

  • can be applied to the understanding

  • of the landscape of objective functions of those networks.

  • And it would seem to demonstrate,

  • to show that the number of [? settle ?]

  • points in those loss functions is combinatorally large.

  • But on the other hand, that there are--

  • although there might be a lot of local minima,

  • they're all pretty much of the same energy level.

  • So it doesn't matter which one you find.

  • And then there is empirical evidence to the fact

  • that the local minima are extremely degenerate.

  • So if you move in a large number of dimensions

  • around those local minima, the objective function

  • is essentially flat.

  • And there's a small number of directions where it's not flat.

  • That depends on the complexity of the problem.

  • And there's also empirical evidence

  • that [INAUDIBLE] showed in a paper, which is

  • that if you take two solutions.

  • So you start from two random initial conditions.

  • You train your neural net.

  • You get two different solutions.

  • Then you go straight line between the two.

  • And you barely go up.

  • And if you bend the past just a little bit,

  • then you can go from one minimum the other without going up.

  • So that tends to show that there's basically

  • only one minimum.

  • It's very degenerate.

  • And it's connected everywhere.

  • The intuition that we have, the usual intuition

  • of a local minimum in one dimension is completely wrong.

  • Building a box in a hundred million dimension

  • is very hard because you need a lot of walls.

  • So there's always going to be directions

  • where you can escape.

  • And that creates settle points.

  • So that's one thing.

  • And then there is work on generalization ability.

  • Like why do those things generalize the way they do,

  • even though they are way overparameterized.

  • There's an interesting paper.

  • One of the co-authors is Ben Recht from Berkeley

  • recently where they showed that you can take a ImageNet style

  • network, convolutional net.

  • You set the labels to completely random labels.

  • And those neural nets can still learn the training

  • set completely without errors.

  • One million training samples, they will just nail it,

  • 100% correct.

  • Of course, [? transition ?] error is chance.

  • But what that means is that there

  • is a huge amount of capacity in those networks

  • that they are able to recruit, if they need to.

  • But when you train them on things that make sense,

  • they don't have overfit that much.

  • They do overfit, but not ridiculously.

  • AUDIENCE: Hi.

  • So it seems like it's very clear that it's

  • important to have a strong predictive model of the world

  • to achieve intelligence.

  • But it also seems like there may be other components

  • to it, things such as creativity or metacognition.

  • So do you have any thoughts on how

  • we might achieve those other parts of intelligence?

  • YANN LECUN: So metacognition probably

  • is number 562 in the list of problems

  • we have to solve that maybe has 1,000 items so.

  • I'm not sure about that.

  • But creativity, I think those GANs actually exhibits

  • some level of creativity.

  • So there are people, for example,

  • at Rutgers, one of them is actually now

  • at Facebook, who used GANs to generate paintings,

  • abstract paintings in particular styles.

  • And they look really nice.

  • So that begs the question of is there,

  • what does creativity really mean?

  • We have a couple projects at Facebook

  • that I can't talk about yet, but soon, that involve also

  • creating kind of artistic artifacts using

  • those generative models.

  • And they look interesting.

  • People who actually are in the business of creating artifacts

  • are actually the impressed.

  • AUDIENCE: Hi.

  • I do some particles physics here.

  • I'm an undergrad.

  • And one of the big problems that we're

  • facing in implementing technologies like this

  • is that the data we have is collected almost

  • from a third person perspective where you have access

  • to all the variable information in three dimensions.

  • And so it's very hard to take a first person camera view

  • perspective of an event and try to pick apart what's going on.

  • What are the major computational challenges--

  • what's the difference between taking like a camera

  • view of these scenes and dissecting them

  • with a convolutional neural net versus somehow finding

  • an effective way of analyzing three dimensional information?

  • YANN LECUN: OK.

  • So a number of different answers there.

  • So first of all, there is quite a lot of interest

  • for the use of convolutional nets

  • in the context of high energy physics,

  • basically for trajectory filtering essentially,

  • so filtering events that are interesting.

  • I'm sure that's the kind of stuff you were thinking of.

  • I actually gave a talk at CERN maybe a couple years ago,

  • or a year and a half ago, and met a bunch

  • of people working on this.

  • And it's really expanding.

  • There's a colleague of mine at NYU called Kyle Cranmer who

  • has been working on this kind of stuff

  • actually using those GANs.

  • He's come up with good ideas on characterizing trajectories

  • of generating models of trajectories.

  • So that said, very often, those trajectories are in 3D.

  • And you'd like to be able to basically analyze them in 3D.

  • So you could use those 3D convolutional net

  • that I was talking about early in the middle of the talk.

  • They are sort of efficient for this,

  • because most of the voxels in a high energy physics experiments

  • are empty.

  • So you would like to be able to concentrate the computation

  • where things are relevant.

  • That's one thing.

  • The second thing is that there is

  • a new set of ideas I didn't talk about called graph

  • convolutional nets, or spectral networks.

  • So it's basically the idea that an image, a normal image, you

  • can think of an image as a function on a grid graph,

  • on a regular grid.

  • The pixels form a grid.

  • You can think of it as a graph where each pixel is connected

  • to its nearest neighbors.

  • And that indicates that--

  • it's just a reflection of the fact

  • that neighboring pixels are correlated.

  • Now imagine now that you have data

  • that comes to you not in kind of a flat grid graph,

  • but in a weird graph, like a cylinder or something,

  • like the calorimeter or in a high energy physics experiment,

  • or with some other set of sensors that is non-Euclidean.

  • You can actually define convolutions in those spaces.

  • And they're basically diagonal operators

  • in the graph Laplacian where the graph represents

  • the neighborhood relationships.

  • And so people have actually come up with ways

  • to apply convolutional nets to those non-Euclidean domains.

  • In fact, there is going to be a tutorial at NIPS

  • next week on precisely that topic in exactly one

  • week, Monday next week, which I'm a core speaker on.

  • But I'm actually going to speak.

  • There's going to be [INAUDIBLE].

  • AUDIENCE: You talked about--

  • sorry.

  • You talked about systems that both learn and reason.

  • And it seems to me like you argued that to get a strong AI,

  • you would need to do both of these things.

  • Now it seems to me like obviously humans do this.

  • But humans in a lot of ways are very dumb.

  • They make a lot of mistakes.

  • And they're very plastic.

  • And they need to learn to reason.

  • Whereas a lot of AI systems and reinforcement learning systems

  • do something very smart that takes

  • a lot of computational power.

  • And it's very much hard coded.

  • Do you think we'll see a trend towards dumber and more

  • plastic reasoning systems?

  • YANN LECUN: So I think most reinforcement--

  • Michael, correct me if I'm wrong.

  • But I think most reinforcement learning systems

  • that people are training today actually

  • are completely reactive.

  • They are very simple in terms--

  • I mean, there's very little actual reasoning.

  • Other than things like AlphaGo, AlphaGo Zero,

  • where there is tree exploration in the set of possible futures,

  • which is used for training.

  • Once it's trained, it actually just plays

  • without much tree exploration, actually.

  • So there's not a huge amount of reasoning there.

  • And that's a limitation not of reinforcement learning per se,

  • but of the architectures we use for all of our AI systems.

  • So I think what we consider I think

  • intelligent behavior involves this ability to predict.

  • In fact, I think the essence of intelligence

  • really is the ability to predict.

  • And so if you have a good model of the world that

  • is accurate for prediction, then you

  • can use it to plan a sequence of actions ahead

  • and perhaps moderate uncertainties about it.

  • And things like this.

  • So this is what reasoning really is about,

  • is predicting ahead what's going to happen, not necessarily

  • in time.

  • But also sort of simulating, so manipulating models.

  • Like when you think in your head about mathematics

  • or various other things, very often,

  • you have mental models that you manipulate.

  • They are simulators in a way.

  • You give them inputs, and they change.

  • And things like that.

  • That I think is really the essence

  • of reasoning and intelligence.

  • ROSSI LUO: Looking at the clock, it's 5:30.

  • I'm going to take one last question.

  • And if you have additional questions,

  • you probably just [? briefly ?] to [? floor ?] discussions

  • afterwards.

  • AUDIENCE: What's-- I'm not that familiar with deep learning

  • neural nets.

  • But I'm curious.

  • If I wanted to learn an object up

  • to something like affine transformations,

  • can I do transfer learning to do that?

  • Can you learn a whole group of transformations,

  • and then learn an object and then

  • have the object under those transformations?

  • YANN LECUN: So yes and no.

  • So if you take a convolutional net, for example,

  • and you train it on datasets like ImageNet that

  • have lots of different instances of the same objects and various

  • and things like this, it learns the notion

  • of object relatively independently of the viewpoint,

  • but not completely.

  • So it has to recognize a dog, whether it's a profile

  • view or a frontal view.

  • But if you take the head of the dog upside down,

  • it probably won't be able to recognize it.

  • The same way we have a hard time recognizing people

  • when their faces are upside down.

  • AUDIENCE: Not exclu-- little rotations,

  • shears, things like that.

  • YANN LECUN: Right, right.

  • So small rotation, shears, and scaling,

  • that that's handled by the pooling operation

  • in convolutional nets.

  • AUDIENCE: Right.

  • But there's nothing, no explicit geometric--

  • YANN LECUN: No.

  • There's no explicit 3D geometry.

  • And there is no real explicit 3D geometry,

  • except for the fact that whenever a feature is

  • detected in one location, it's also

  • detected in other locations.

  • And the fact that there is this pooling operation

  • that basically build a little bit of resist--

  • smoothness to variations of the location

  • of particular features.

  • So small variations of the position of elementary features

  • due to rotation, shear, and things like this,

  • will actually--

  • AUDIENCE: You're pooling them.

  • And that's why you're getting them.

  • But you're not explicitly modeling.

  • Same thing with Newtonian physics.

  • There's no built in physics yet, right?

  • YANN LECUN: Right.

  • There's-- no.

  • No built in physics.

  • AUDIENCE: Thank you.

  • ROSSI LUO: The main event I think is over.

  • And if you have additional questions,

  • you're welcome to briefly discuss with Professor Yamaka

  • afterwards.

  • And thanks [INAUDIBLE].

  • And let's give Professor Yann Lecun applause.

  • [APPLAUSE]

ROSSI LUO: Good afternoon.

字幕と単語

ワンタップで英和辞典検索 単語をクリックすると、意味が表示されます

B1 中級

ヤン・ルクン博士 "How Could Machines Learn as Efficiently as Animals and Humans?" (Dr. Yann LeCun, "How Could Machines Learn as Efficiently as Animals and Humans?")

  • 37 2
    alex に公開 2021 年 01 月 14 日
動画の中の単語