Placeholder Image

字幕表 動画を再生する

  • [ MUSIC ]

  • [ APPLAUSE ]

  • BENGIO: Thank you.

  • All right.

  • Thank you for being here and participating in this colloquium.

  • So, I'll tell you about some of the things that are happening in deep learning,

  • but I only have 30 minutes so I'll be kind of quickly going through some subjects

  • and some challenges for scaling up deep learning towards AI.

  • Hopefully you'll have chances to ask me some questions during the panel that follows.

  • One thing I want to mention is I'm writing a book.

  • It's called Deep Learning, and you can already download most of the chapters.

  • These are draft versions of the chapters from my web page.

  • It's going to be an MIT Press book hopefully next year.

  • So, what is deep learning and why is everybody excited about it?

  • First of all, deep learning is just an approach to machine learning.

  • And what's particular about it, as Terry was saying, it's inspired by brains.

  • Inspired, we're trying to understand some of the principles, computational

  • and mathematical principles that could explain the kind of intelligence based

  • on learning that we see in brains.

  • But from a computer science perspective,

  • the idea is that these algorithms learn representations.

  • So, representations is a central concept in deep learning, and, of course,

  • the idea of learning representations is not new.

  • It was part of the deal of the original neural nets,

  • like the Boltzmann machine and the back prop from the '80s.

  • But what's new here and what happened about ten years ago is a breakthrough that allowed us

  • to train deeper neural networks, meaning that have multiple levels of representation.

  • And why is that interesting?

  • So already I mentioned that there are some theoretical results showing

  • that you can represent some complicated functions that are the result of the many levels

  • of compositions efficiently with these deep networks, whereas you might --

  • or in general, you won't be able to represent these kinds of functions

  • with a shallow network that doesn't have enough levels.

  • What does it mean to have more depth?

  • It means that you're able to represent more abstracts concepts,

  • and these more abstract concepts allow these machines to generalize better.

  • So, that's the essence of what's going on here.

  • All right.

  • So, the breakthrough happened in 2006 where, for the first time,

  • we were able to train these deeper networks and we used unsupervised learning for that,

  • but it took a few years before these advances made their way

  • to industry and to large scale applications.

  • So, it started around 2010 with speech recognition.

  • By 2012, if you had an Android phone, like this one, well,

  • you had neural nets doing speech recognition in them.

  • And now, of course, it's everywhere.

  • For speech, it's changed the field of speech recognition.

  • Everything uses it, essentially.

  • Then about two years later, 2012, there was another breakthrough using convolution networks,

  • which are a particular kind of deep networks that had been around for a long time

  • but that have been improved using some

  • of the techniques we discovered along these -- in recent years.

  • Really allowed us to make big impact in the field of computer vision

  • and object recognition, in particular.

  • So, I'm sure [Faye Faye] will say a few words later about that event and then the role

  • of the image net dataset in this.

  • But what's going on now is that neural nets are going beyond their traditional realm

  • of perception and people are exploring how to use them for understanding language.

  • Of course, we haven't yet solved that problem.

  • This is where a lot of the action is now and, of course,

  • continues a lot of research and R&D and computer vision.

  • Now, for example, expanding to video and many other areas.

  • But I'm particularly interested in the extension of this field in natural language.

  • There are other areas.

  • You've heard about reinforcement learning.

  • There is a lot of action there, robotics, control.

  • So, many areas of AI are now more and more seeing the potential gain coming

  • from using these more abstract systems.

  • So, today, I'm going to go through three of the main challenges that I see

  • for bringing deep learning, as we know it today, closer to AI.

  • One of them is computational.

  • Of course, for a company like IBM and other companies

  • that build machines, this is an important challenge.

  • It's an important challenge because what we've observed is

  • that the bigger the models we are able to train,

  • given the amount of data we currently have, the better they are.

  • So, you know, we just keep building bigger models

  • and hopefully we're going to continue improving.

  • Now, that being said, I think it's not going to be enough so there are other challenges.

  • One of them I mentioned has to do with understanding language.

  • But understanding language actually requires something more.

  • It requires a form of reasoning.

  • So, people are starting to use these recurrent nets you heard about, recurrent networks

  • that can be very deep, in some sense, when you consider time in order

  • to combine different pieces of evidence, in order to provide answers to questions.

  • And essentially, displayed in different forms of reasoning.

  • So, I'll say a few words about that challenge.

  • And finally, maybe one of the most important challenges that's maybe more fundamental even is

  • the unsupervised learning challenge.

  • Up to now, all of the industrial applications of deep learning have exploited supervised learning

  • where we have labeled the data we've said in that image, it's a cat.

  • In that image, there's a desk, and so on.

  • But there's a lot more data we could take advantage of that's unlabeled,

  • and that's going to be important because all of na information we need to build these AIs has

  • to come from somewhere, and we need enough data, and most of it is not going to be labeled.

  • Right. So, as I mentioned, and I guess as my colleague,

  • Ilya Sutskever from Google keeps saying, bigger is better.

  • At least up to now, we haven't seen the limitations.

  • I do believe that there are obstacles, and bigger is not going to be enough.

  • But clearly, there's an easy path forward with the current algorithms just

  • by making our neural nets a hundred times faster and bigger.

  • So, why is that?

  • Basically, what I see in many experiments with neural nets right now is that they --

  • I'm going to use some jargon here.

  • They under fit, meaning that they're not big enough or we don't train them long enough

  • for them to exploit all of the information that there is in the data.

  • And so they're not even able to learn the data by heart, right,

  • which is the thing we usually want to avoid in machine learning.

  • But that comes almost for free with these networks, and so we just have to press

  • on the pedal of more capacity and we're almost sure to get an improvement here.

  • All right.

  • To just illustrate graphically that we have some room to approach the size of human brains,

  • this picture was made up by my former student, Ian Goodfellow, where we see the sizes

  • of different organisms and neural nets over the years so the DBN here was from 2006.

  • Of the AlexNet is the breakthrough network of 2012 for computer vision,

  • and the AdamNet is maybe a couple of years old.

  • So, we see that the current technology is maybe between a bee and a frog in terms of size

  • of the networks for about the same number of synapses.

  • So, we've almost reached the kind of average number of synapses you see in natural brains,

  • between a thousand and ten thousand.

  • In terms of number of neurons, we're several orders of ranking away.

  • So, I'm going to tell you a little bit about a stream of research we've been pushing in my lab,

  • which is more connected to the computing challenge and potentially part

  • of our implementation, which is can we train neural nets that have very low precision.

  • So, we had a first paper at ICLR.

  • By the way, ICLR is the deep learning conference, and it happens every year now.

  • Yann Lecun and I started it in 2013 and it's been an amazing success

  • that year and every year since then.

  • We're going to have a third version next May.

  • And so we wanted to know how many bits do you actually require.

  • Of course, people have been asking these kinds of questions for decades.

  • But using sort of the current state of the art neural nets and we found 12,

  • and I can show you some pictures how we got these numbers on different data sets

  • and comparing different ways of representing numbers with fixed point or dynamic fixed point.

  • And also, depending on where I use those bits, you actually need less bits

  • in the activations than in the weights.

  • So, you need more rescission in the weights.

  • So, that was the first investigation.

  • But then we thought -- so that's the --

  • for the weights, that's the number of bits you actually need to keep the information

  • that you are accumulating from many examples.

  • But when you actually run your system during training, especially,

  • maybe you don't need all those bits.

  • Maybe you can get the same effect by introducing noise

  • and discretizing randomly those weights to plus one or minus one.

  • So, that's exactly what we did.

  • The idea is -- the cute idea here is that we can replace a real number by a binary number

  • that has the same expected value by, you know, sampling those two values with a probability

  • such as that the expected value is the correct one.

  • And now, instead of having a real number to multiply,

  • we have a bit to multiply, which is easy.

  • It's just an addition.

  • And why would we do that?

  • Because we want to get rid of multiplications.

  • Multiplications is what takes up most of the surface area on chips for doing neural nets.

  • So, we had a first try at this, and this is going to be presented at the next NIPS

  • in the next few weeks in Montreal.

  • And it allows us to get rid of the multiplications in the feed forward computation

  • and in the backward computation where we compute gradients.

  • But we remained with the multiplication -- even if you discretize the weights,

  • there is another multiplication at the end of the back prop

  • where you multiply -- you don't multiply weights.

  • You multiply activations and gradients.

  • So, if those two things are real valued, you still need regular multiplication.

  • So, we -- yes, so that's going to be in the NIPS paper.

  • But the new thing we did is to get rid of that last multiplication that we need for the update

  • of the weight, so the delta W is a change in the weights,

  • DC DA is the gradient that's propagated back, and H is the activations.

  • It's some jargon.

  • But anyway, we have to do this multiplication, and so, well, the only thing we need

  • to do is take one of these two numbers and replace it again by a stochastic quantity

  • that is not going to require multiplication.

  • So, instead of binarizing it, we quantize it stochastically to its mantissa.

  • In other words, we get rid of -- to its exponent.

  • We get rid of the mantissa.

  • In other words, we represent it, we -- we represent it in a log scale.

  • So, if you do that, again, you can map the activations

  • to some values that are just powers of two.

  • And now multiplication is just addition.

  • This is an old trick.

  • I mean, the trick of using powers of two is an old trick.

  • The new trick is to do this stochastically so that you actually get the right things

  • in average and stochastic gradient works perfectly fine.

  • And so we're running some experiments on a few data sets showing that you get a bit

  • of a slowdown because of the extra noise.

  • But so the green and yellow curve here are where this strict with binarized weights

  • and quantized, stochastically quantize the calculations.

  • And the good news is, well, it learns even better, actually,

  • because this noise acts as a regularizer.

  • Now, this -- yes, this is pretty good news.

  • Now, why is this interesting?

  • It's interesting because we can probably -- for two reasons.

  • One is for hardware implementations, this could be useful.

  • The other reasons is that it connects with what the brain -- with spikes, right.

  • So the idea with -- you can think of, if I go back here, when you replace activations

  • by some stoke tick binary values that have the right expected value, you're introducing noise.

  • But you're actually not changing that much the computation of the gradient.

  • And so it would be reasonable for brains to use the same trick

  • if they could save on the hardware side.

  • Okay. So now let me move on to my second challenge, which has to do with language and,

  • in particular, language understanding.

  • There's a lot of work to do in this direction,

  • but the progress in the last few years is pretty impressive.

  • Actually, I was part of the beginning of that process of extending the realm

  • of application of neural networks to language.

  • So, in 2000, we had a NIPS paper where we introduced the idea of learning

  • to represent probability distributions over sequences of words.

  • In other words, being able to generate sequences of words that look like English

  • by decomposing the problem in two parts.

  • That's a kind of a central element that you find in neural nets and especially in deep learning,

  • which is think of the problem not as going directly from inputs to outputs,

  • but breaking the problem into two parts.

  • One is the representation part.

  • So, learning to represent words here by mapping each word to a fixed size, real valued vector.

  • And then taking those representations and mapping them to the answers you care about.

  • And here, that's predicting the next word.

  • It turned out that those representations of words

  • that we learned have incredibly nice properties and they capture a lot

  • of the semantic aspects of words.

  • And there's been tons and tons of papers

  • to analyze these things, to use them in applications.

  • So, these are called word vectors, word embeddings, and they're used all over the place

  • and becoming like commonplace in natural language processing.

  • In the last couple of years, there's been a kind of an exciting observation

  • about these word embeddings, which is that they capture analogies,

  • even though they were not programmed for that.

  • So, what do I mean?

  • What I mean is that if you take the vector which is for each word and you do operations on them,

  • like subtract and add them, you can get interesting things coming up.

  • So, for example, if you take the vector for queen and you subtract the vector for king,

  • you get a new vector, and that vector is pretty much aligned with the vector that you get

  • from subtracting the representation for woman from the representation for man.

  • So, that means that you could do something like woman minus man, plus king and get queen, right.

  • So, it can answer the question, you know,

  • what is to king what woman is to man, and it would find queen.

  • So, that's interesting, and there is some nice explanations that we're starting