字幕表 動画を再生する 英語字幕をプリント YOSHUA BENGIO: [INAUDIBLE]. Thank you [INAUDIBLE]. So I'll talk about [INAUDIBLE]. I'll talk about representations and learning representations. And the word deep here, I'll explain what it means. So my goal is to contribute to building intelligent machines, also known as AI. And how do we get a machine to be smart-- to take good decisions? Well, it needs knowledge. [INAUDIBLE] [? researchers ?] from the early days-- '50s, '60s, '70s-- tried to give the knowledge to the machine-- the knowledge we have exclusively. And it didn't work quite as well as was hoped. One reason is that a lot of our knowledge is not something we can communicate verbally and we can write that in a program. So that knowledge has to be taken somewhere else. And basically what we have found is you can get that knowledge through observing the world around us. That means learning. OK-- so we need learning for AI. What is learning? What is machine learning? It's not about learning things by heart. That's just a fact. What it is about is generalizing from the examples you've seen to new examples. And what I like to tell my students is it's taking probability mass-- that is, on the training examples and somehow guessing where it should go-- which new configurations of the things we see make sense or are plausible. This is what learning is about. It's guesswork. At first we can measure [INAUDIBLE] we can guess. And I'll mention something about dimensionality and geometry that comes up when we think about this [INAUDIBLE]. And one of the messages will be that we can maybe fight this [? dimensionality ?] problem by allowing the machine to discover underlying causes-- the underlying factors that explain the data. And this is a little bit like [INAUDIBLE] is about. So let's start from learning, an easy [INAUDIBLE] of learning. Let's say we observe x,y pairs where x is a number-- y is a number. And the stars here represent the examples we've seen of x,y configurations. So we want to [? generalize ?] for new configurations. In other words, for example, in this problem, typically we want to predict a y given a new x. And there's an underlying relationship between y and x, meaning the expected value of the y given x, which is given with this purple curve. But we don't know it. That's the problem with machine learning. We're trying to discover something we don't know already. And we can guess some function. This is the predicted or learned function. So how could we go about this? One of the most basic principles by which machine learning algorithms are able to do this is assume something very simple about the world around us-- about the data we're getting or the function we're trying to discover. It's just assuming that the function we're trying to discover is smooth, meaning if I know the value of the function that's come from the x, and I want to know the value at some nearby point x prime, then it's reasonable to assume that the value x prime of the function we want to learn is close to the value of x. That's it. I mean, you can formalize that and [INAUDIBLE] in many different ways and exploit it in many ways. And what it means here is if I ask you why should we at this point-- what I'm going to do is look up the value of y that I observed at nearby points. And combining these-- make a reasonable guess like this one. And if I do that on problems like this, it's actually going to work quite well. And a large fraction of the applications that we're sharing use this principle. And [INAUDIBLE] enough of just this principle. But if we only rely on this principle virtualization, we're going to be in trouble. That's one of the messages I want to explain here. So why are we going to be in trouble? Well, basically we're doing some kind of interpolation. So if I see enough examples-- the green stars here-- to cover the ups and down of the function I'm trying to learn, then I'm going to be fine. But what if the function I want to learn has many more ups and downs than I can possibly observe through data? Because even Google has a finite number of examples. Even if you have millions or billions of examples, the functions we want to learn for AI are not like this one. They have-- the number of configurations of articles of interest-- that may be exponentially large. So something maybe bigger than the number of items in the universe. So there's no way we're going to have enough examples to cover all the configurations. For example, think of the number of different English sentences, which is something that Google is interested in. And this problem is illustrated by the so-called curse of dimensionality where you consider what happens when you have not just one variable but many variables and all of their configurations. How many configurations of [? N ?] variables do you have? Well, you have an exponential number of configurations. So if I wanted to learn about a single variable, I can just divide-- [? it ?] [? takes ?] a real variable. And I divide its value into intervals. And I count how many of those bins I've seen in my data. I can estimate probability of different intervals coming up. So that's easy Because i only want to know about a small number of different configurations. But if I'm looking at two variables, then the number of configurations may be [INAUDIBLE] [? square ?] [? bigger, ?] and [? it'd ?] have [? 390-- ?] even more. But typically, I'm going to have hundreds-- if you're thinking about images, it's thousands-- tens of thousands-- hundreds of thousands. So it's crazy how many configurations there are. So how do we possibly generalize to new configurations? We cannot just break up this space into small cells and count how many things happen in each cell because the new examples that we want to [? carry-- ?] new configurations that [INAUDIBLE] asked about might be in some region where we hadn't [INAUDIBLE]. So that's the problem of generalizing [INAUDIBLE]. So there's one thing that can help us, but it's not going to be sufficient. It's something that happens with the iPhones. It's very often [INAUDIBLE] vision, [INAUDIBLE] processing and understanding and many other problems where the set of configurations of variables that are plausible-- that can happen in the real world-- occupy a very small volume of all this set of possible configurations. So let me give an example. In images, if I choose the pixels in an image randomly-- in other words, if I sample an image from completely uniform distribution, I'm going to get things like this. Just [INAUDIBLE]. And I can repeat this for eons and eons. And I'm never going to assemble something that looks like a face. So what it means is that faces-- images of faces-- are very rare in the space of images. They occupy a very small volume, much less than what this picture would suggest. And so this is a very important hint. It means that actually the task is to find out where this distribution concentrates. I have another example here. If you take the image of a four like this one and you do some geometry transformations to it like rotating it, scaling it, you get slightly different images. And if at each point, you allow yourself to make any of these transformations, you can create a so-called manifold-- so a surface of possible images. Each point here corresponds to a different image. And the number of different changes that you make is basically the dimensionality of this manifold. So in this case, even though the data lives in the high dimension space, the actual variations we care about are of low dimensionality. And knowing that, we can maybe do better in terms of learning. One thing about curves of dimensionality is I don't like the name curves of dimensionality because it's not really dimensionality. You can have many dimensions but have a very simple function. What really matters is how many variations does the function have-- how many ups and downs? So we actually had some fairly [? cool ?] results about-- the number of examples you would need if you were only relying on this [INAUDIBLE] assumption, essentially is linear-- the number of ups and downs of the function [INAUDIBLE]. So let's come back to this idea of learning where to put probability [? math. ?] So in machine learning, what we have is data. Each example is a configuration of variables. And we know that this configuration [? occurred ?] in the real world. So we can say the probability for this configuration. So this is the [? space ?] of configuration I'm showing in 2D. So we know that this configuration is plausible. [INAUDIBLE]. So we can just put a [? beacon ?] [INAUDIBLE] here. And we can put a [? beacon ?] at every example. The question is how do we take this probability mass and sort of give a little bit of that to other places. In particular, we'd like to put mass in between if there really was a manifold that has some structure and if we could discover that structure, it would be great. So the classical machine learning way of doing things is say that the distribution function-- the function that you're trying to [? learn ?] in this case is smooth. So if it's very probable here, it must be also probable in the neighborhood. So we can just do some mathematical equation that will shift some mass from here to the [? different ?] neighbors. Then we get a distribution like this as our model. And that works reasonably well. But it's not the right thing to do. It's putting mass in many directions we don't care about. Instead, what we're going to do is to discover that there is something about this data. There is some structure. There is some abstraction that allows us to be very specific about where we're going to put probability mass. And we might discover with something like this, which in 2D doesn't look like a big difference. But in high dimensions, the number of directions you're allowed to move here is very small compared to the number of dimensions here. And the volume goes exponentially with dimension. So you can have a huge [? gain ?] by guessing probably which direction things move-- are allowed to keep high probability. So, now to the core of this presentation which is about representation learning. I've talked about learning in general and some of the issues-- some of the challenges with applying learning to AI. Now, when you look at how machine learning is applied in industry, what people do for 90% of time-- what they do with the effort of engineers is not really improve machine learning. They use existing machine learning. But to make the machine learning [INAUDIBLE] work well, they do [INAUDIBLE] feature engineering. So that means taking the raw data and transforming it-- extracting some features-- deciding what matters-- throwing away the things that we think don't matter. And that's essentially using humans and our intelligence and our understanding of the problem to figure out the factors that matter-- to figure out the dependencies that matter and so on. So what representation learning is about is trying to do with machines what humans do right now, which is extracting those features-- discovering what is a good representation for your data. And one way to think about it is the machine is trying to guess-- not just those features or those computations that are useful for us to explain our [INAUDIBLE] but really what are the underlying factors that explain the [INAUDIBLE]? What are the underlying causes? And the guesses about what these are for our particular example is exactly what we'd like to have as our representation. Of course, this is hard to define because we don't know what the right factors are, what are the right causes of it. This is the objective we have. This is [INAUDIBLE] by the way. So there is a very important [? family ?] of algorithms as [INAUDIBLE] mentioned [INAUDIBLE] that have been around since at least [? those ?] that have multiple levels like this that have been around since the '80s. And they have multiple layers of [? computations. ?] And one of the things I've been trying to do is to find some properties that they have that other algorithms may have that may be useful and try to understand why these properties are useful. In particular, there's the [INAUDIBLE] of depth. So the idea of deep learning is that not only are you going to have representations of the data [INAUDIBLE] learned. But you're going to have multiple levels of representation. And why would it matter to have multiple levels of representation? Because you're going to have low level and high level representations where high level representations are going to be more abstract-- more nonlinear-- capture structure that is less obvious in the data. So what we call deep learning is when the learning algorithm can discover these representations and even decide how many levels there should be. So I mentioned [INAUDIBLE] as the original example of deep learning. What these algorithms do is they learn some computation-- some function that takes an [INAUDIBLE] vector and map it to some output, which could be a vector, through different levels of representation where each level is composed of units which do a computation that's inspired by how the neurons in the brain work. So they have a property which you don't find in many learning algorithms called [INAUDIBLE]. So let's first see how these other [INAUDIBLE] numbers work-- how they generalize. Remember, this is going to be very similar to when I talked about [INAUDIBLE] assumption. They rely on this [INAUDIBLE] assumption. Deep learning also relies on this [INAUDIBLE] assumption but introduces additional [INAUDIBLE]-- additional knowledge, if you will. So when you only rely on this [INAUDIBLE] assumption, the way you work is you essentially take your [? input ?] space-- [INAUDIBLE] space, and break it up into regions. For example, this is what happens with clustering, nearest neighbors, any SVMs, any classical statistical non-parametric algorithms, decision trees and [? so on. ?] So what happens is after seeing the data, you break up the [INAUDIBLE] space into regions, and you generalize locally. So if you have a function that outputs something here-- because you've seen an example here for example-- you can generalize and say, well in the neighborhood, the output is going to be similar and maybe some kind of interrelation with the neighboring regions is going to be performed. But the crucial point from a mathematical point of view is that there's a counting argument here, which is how many parameters-- how many degrees of freedom do we have to define this partition? Well, basically, you need at least one parameter per region. The number of parameters is going to grow with the number of regions. See-- if I want to distinguish two regions, I need to say where the first one is or how to separate between these two. And for example, [INAUDIBLE] specifying the center of each region. So the number of things I have to specify from the data is essentially equal to the number of regions I can distinguish. So you can think well, there's no other way you could do that, right? I mean, how could you possibly create a new region for each [INAUDIBLE] not see any data, and distinguish it meaningfully. Well, you can. Let me give you an example. So this is what happens with distributed representations, which happens with things like factor models, PCA, RBM, neural nets, sparse coding, and deep learning. What you're going to do is you're going to still break up the [INAUDIBLE] space at the regions and be able to generalize locally in a sense that things that are nearby are going to have similar outcomes. But the way you're going to learn that is completely different. So, for example, you can [INAUDIBLE] space. What I'm then doing is that I'm going to break it down in different ways that are not mutually exclusive. So, here, what I'm thinking about when I'm building this model is there are three factors that explain the two inputs that I'm seeing. So this is two-dimensional input space. And I'm bringing the space into different regions. So, for example, the black line here tells me that you're either on that side of it or the other side of it. On that side, it's [? T1 ?] equals 1. On that side, it's [? T1 ?] equals 0. So this is a bit that tells me whether I'm in this set or this set. And I have this other data that tells me whether I'm on this set and this set, or that set. And now you can see that the number of regions I've defined in this way could be much larger than the number of parameters. Because the number of parameters was [? beginning ?] with a number of factors-- the number of these [INAUDIBLE]. So by being smart about how we define those regions by allowing the [INAUDIBLE] to help us, you can get potentially exponential gain in expressive power. Of course, from the machine learning point of view, this comes with an assumption. The assumption is that when I learn about being on that side or that side, it's meaningful [INAUDIBLE] in some sense-- not quite in a statistical sense-- of what happens with the other configurations-- the other half of it. So that makes sense if you think of, OK, this is images. And this one is telling me is this a male or a female? This one's telling me, does he wear glasses or not? Is he tall or short, something like that. So if you think about these factors as [INAUDIBLE] meaningful things, usually, you can vary them [INAUDIBLE], like the causes that explain the world around us. And that's why you're able to generalize. You're assuming something about the world that gives you a kind of exponential power of representation. Now, of course, in the new world, the features we care about, the factors we care about are not going to be simple, linear, or separated. So that's one reason why we need deep representations. Otherwise, just the same old [? level ?] will be enough. Let me move on because time is flying. So this is stolen from my brother, Samy, who gave a talk here not long ago where they used this idea of observations in a very interesting way where you have data of two different modalities. You have images. And you have text queries-- short sequence of words. And they learned representation for images, so they map the image to some hyperdimensional vector and they learn a function that represents queries. So they map query through this also hyperdimensional point in the same space. And they learn them in such a way that when someone types "dolphin" and then is shown an image of a dolphin and then clicks on it, the representation for the image and the representation for the query end up close to each other. And in this way, once you learn that, you can of course [INAUDIBLE] things like answering new queries you've never seen and find images that match queries that somehow you haven't seen before. One question that people outside of machine learning ask when they've considered what machine learning are doing is this is crazy. Humans can learn from very few examples. And you guys need thousands or millions of examples. I mean, you're doing something wrong. And they're right. So how do humans manage to constantly learn something very complicated from just a few examples? Like, how do students learn something? Well, ther are a number of answers. One is brains don't start from scratch. They have some priors. And in particular, I'm interested in generic priors that allow us to generalize the things that [INAUDIBLE] didn't train our species to do. But still they do very well. So we have some very general purpose priors we are born with, and I'd like to figure out which they are because we can exploit them as well. Also-- and this is very, very important-- if you ask a newborn to do something, it wouldn't work very well. But of course, an adult has learned a lot of things before you give him a few examples. And so he's transferring knowledge from [INAUDIBLE]. This is [? crucial. ?] And the way he's doing that is he's built in his mind representations of the objects-- of the types of the modalities which are given in the examples. And these representations capture the relationships between the factors-- the explanatory factors that explain what is going on in your particular setup of the new task. And he's able to do that from unlabeled data-- from examples that were unrelated to the task we're trying to solve. So one of the things that humans are able to do is to do what's called semi-supervised learning. They're able to use examples that are not specifically for the task you care about to generalize. They are people who use information about the statistical structure of the things around us to [? greatly ?] answer new questions. So here, let's say someone gives me just two examples. We want to discriminate between the green and the blue. And the classical algorithm would do something like put a straight line in between. But what if you knew that there are all these other points that are not [INAUDIBLE] related to your task. But these are the configurations that are plausible in the [INAUDIBLE] distribution. So those [INAUDIBLE] ones, you don't know if they are green or blue. But by the structure here, we guess that these ones are all blue and these ones are all green. And so you would put your decision like this. So we're trying to take advantage of data from other tasks that are unable to find something generic about the world like [INAUDIBLE] usually happen in this direction and use that to quickly generalize from very few examples to new examples. So of the motivations for learning about depth, there are [? vertical ?] motivations that come from the discovery of families of functions-- mathematical functions-- that can be represented very efficiently if you-- [? the ?] [? longer ?] representations have longer levels that might require exponentially more numbers-- bigger representations-- if you're only allowed one or two levels. Even though one or two levels are enough to [? observe ?] any function, it might be very inefficient. And of course, there are biological motivations, like the brain seems to have the [INAUDIBLE] [? picture. ?] [? It's especially ?] true of the visual cortex, which is the part we understand best. And that the cortex seems to have a generic learning algorithm which the principles seem to be at work in terms of learning everywhere in the cortex. Finally, there are cognitive motivations [INAUDIBLE]. We learn simpler things first. And then we [? compose ?] to these simpler things to build high level abstractions. This has been exploited, for example, in the work of [INAUDIBLE] [? Stanford-- ?] by [INAUDIBLE] and [INAUDIBLE] and others to show how [INAUDIBLE] representations can learn simple things like [? edges-- ?] combine [? them ?] [? from ?] parts-- combine them to form faces and things like that. Another sort of simple motivation is how do you program computers. Do we program computers by having a main program that has a bunch of lines of code? Or do we program computers by having functions or subroutine, [? like call ?] subroutine, [? then call ?] subroutine. This is [? the new ?] program. If we were forced to program that way, it wouldn't work very well. But most of machine learning is basically trying to solve the [INAUDIBLE] in this-- not in the programs they use but in the structure of the functions that are learned. And there are also, of course, motivations from looking at what can be achieved by exploiting depth. So I'm stealing this slide from another Google [? talk ?] led by Geoff Hinton last summer, which shows how deep nets, compared to the standard way, which has been the state-of-the-art in speech recognition for 30 years, can be substantially improved by exploiting these multiple levels of representation-- even-- and this is something new that impressed me a lot-- even when the amount of data available is huge, the gain in using these representations is-- representation learning algorithms. And this all comes from something that happened in 2006 when first Geoff Hinton followed by a group here in Montreal and [INAUDIBLE] group in NYU in New York found that you could actually train your deep neural network by using a few simple tricks. And the simple tricks essentially that we're going to train layer by layer using [INAUDIBLE] learning, although recent work now allows us to train deep networks without this trick and using other tricks. This has given rise to lots of industrial interest, as I mentioned-- not only in [INAUDIBLE] conditions but also in [INAUDIBLE], for example. I'm going to talk about some competitions we've won using deep learning. So, last year we won sort of a transfer learning competition, where you were trying to take the representations, learn from some data, and apply them on other data that relates to similar but different tasks. And so there was one competition where the results were announced at ICML 2011-- [INAUDIBLE] [? 2011 ?] and another one at NIPS 2011. So this is less than a year ago. And what we see in those pictures is how the [INAUDIBLE] improves with more layers. But what precisely each of these graphs has on the x-axis, a lot of the number of [? label ?] examples used for training the machine. And the y-axis is [INAUDIBLE] essentially, so you want this to be [? high. ?] And for this task, as you add more levels of representation, what happens is you especially get better in the case where you have very few [? label ?] examples-- the thing I was talking about that humans can do so well-- generalize from very few examples. Because they've learned the representation earlier on using lots of other data. One of the learning algorithms that came out of my lab that has been used for this is called the denoising auto-encoder. And what it does-- in principle, it's pretty simple. And to learn representation, you take each input example and you corrupt it by, say, saying some of the [INAUDIBLE] zero [INAUDIBLE]. And then you learn a representation so that you can reconstruct the info. But you want to construct the uncorrupted info-- the clean info-- that's why it's called denoising. And then you try to make this as close as possible to [? it. ?] I mean, this is close as possible to the [? raw, ?] uncorrupted info. And we can show this essentially models the density of the [INAUDIBLE] distribution. And you can learn these representations and stack them on top of each other. How am I doing with time? MALE SPEAKER: 6:19. YOSHUA BENGIO: Huh? MALE SPEAKER: 6:19. YOSHUA BENGIO: I have until [? when? ?] MALE SPEAKER: [? Tomorrow ?] [? morning. ?] MALE SPEAKER: As long as [INAUDIBLE]. MALE SPEAKER: Just keep going. YOSHUA BENGIO: OK. [INAUDIBLE]. [INTERPOSING VOICES] [LAUGHING] YOSHUA BENGIO: OK, so I [INAUDIBLE] here a connection between those denoising auto-encoders and the manifold learning idea that I was mentioning earlier. So how do these algorithms discover the manifolds-- the regions where the configurations [INAUDIBLE] the variables are plausible-- where the distribution concentrates. So, we're back on the same picture as before. So these are our examples. And what we're trying to do is to learn a representation. So mapping from the [? info ?] space [INAUDIBLE] here that we [INAUDIBLE] to a new space, such that we can essentially recover the input-- in other words, we don't lose information. But at the same time because of the denoising part, actually, you can [? show that ?] what this is trying to do is throw away all the information. So it seems crazy but if you want to keep all the information, but you want to throw away all the information. But there's a catch. Here, you want to only be able to reconstruct these examples-- not necessarily any configuration if inputs. So you're trying to find the function which will preserve the information for these guys. In other words, it's able to reconstruct them [? by the identity ?] function. But it's applied on these guys. But when you apply it in other places, it's allowed to do anything it wants. And it's also learning this [? new ?] function. So in order to do that, let's see what happens. Let's consider a particular point here-- particular example. It needs to distinguish this one from its neighbor. In the representation, [INAUDIBLE]. The representation you learn from that guy has to be different enough from that guy that we can actually recover and distinguish this one from this one. So we can learn an inverse mapping, an approximate inverse mapping, from the representation. So that means you have to have a representation which is sensitive to changes in that direction. So when I move slightly from here to here, the representation has to change slightly as well. On other hand, if I move in this direction, then the representation doesn't need to capture that. It could be constant as I move in that direction. In fact, it wants to be constant in all directions. But what's going to happen is it's going to be constant in all directions except directions that it actually needs to reconstruct the data and in this way, recover the directions that are the derivatives of this representation function. And you recover the directions of the manifold-- the directions where if I move in this direction, I still stay in regional [? high ?] [? probability. ?] That's what the manifold really means. So we can get rid of this direction. And recently, we came up with an algorithm that you can use to sample from the model. So if you have an understanding of the manifold as something that tells you at each point, these are the directions you're allowed to move-- so as we stay in high probability [INAUDIBLE] So these are the directions that keep you [? tangent ?] to the manifold, then basically, the algorithm goes, well, we are at a point. We move in the directions that our algorithm discovered to be good directions of change-- plausible directions of change. And that might correspond to something like taking an image and translating it or [? updating it ?] or doing something like removing part of an image. And then projecting back towards the manifold-- it turns out that the reconstruction function does that. And then [? integrating ?] that random [? wall ?] to get samples to the model. And we apply this to modeling faces and digits. Now, let's come back to this question of what is a good representation? People in computer [? vision ?] have used this term invariance a lot. And it's a word that's used a lot when you handcraft features. So, remember, at the beginning, I said the way most of machine learning is applied is you take your raw data, and you handcraft features based on your knowledge of what matters and what doesn't matter. For example, if your input is images, you'd like to design features that are going to be insensitive to translations of your info. Because typically, the category you're trying to detect should not depend on a small translation. So this is the idea of comparing features. But if we want to do unsupervised learning, where no one tells us ahead of time what matters and what doesn't matter-- what the task is going to be, then how do we know which invariance matters? For example, let's say we're doing speech recognition. Well, if you're doing speech recognition, then you want to be invariant to who the speaker is. And you want to be invariant to what kind of microphone it is and what's the volume of the sound. But if you're doing speaker identification, and you want to be invariant to what the person says and you want to be very sensitive to the identity of the person. But if someone gives you speech. And you don't know if it's going to be used for recognition of [INAUDIBLE] or for recognizing people, what should you do? Well, what you should be doing is learning to disentangle factors-- basically, discovering that in speech, the things that matter are the [INAUDIBLE], the person, the microphone, and so on. These are the factors that you'd like to discover dramatically. And if you're able to do that, then my claim is you can essentially get around the curse of dimensionality. You can solve very hard problems. There's something funny that happens with the deep learning algorithms I was talking about earlier, which is that if you train these representations from purely ununsupervised learning, you discover that the features-- the representation that they find have some form of disentanglement-- that some of the units in the [INAUDIBLE] are very sensitive to some of the underlying factors. And they're very sensitive to one factor and very insensitive to other factors. So this is what disentangling is about. But knowing all these algorithms what those factors would be in the first place. So something good is happening. But what don't really understand why. And we'd like to understand why. One of the things that you see in many of these algorithms is the idea of so-called sparse representations. So what is that? Well, up to now, I've talked about representations as just a bunch of numbers that we associate to an input. But one thing we can do is learn representations [? that have ?] [? the property-- ?] that many of those numbers happen to be zero or some constant-- [? other value. ?] But zero is very convenient. And it turns out, when you do that, it helps a lot, at least for some problems. So that's interesting. And I conjecture that it helps us disentangle the underlying factors in the problems where basically-- for any example, there are only a few concepts and factors that matter. So in the scene that I see right now that comes to my eyes, of all the concepts that my brain knows about, only a few are relative to this scene. And it's true of almost any input that comes to my sensors. So it makes sense to have representations that have this property as well-- that even though we have a large number of possible features, most of them are sort of not applicable to the current situation. Not applicable, in this case, zero. So just by forcing many of these features to ouput not applicable, somehow we're getting better representations. This has been used in a number of papers. And we've used it with so-called rectifier neural networks, in which the unit [? compute ?] a function like this on top of the usual linear transformation they perform. And the result is that this function [INAUDIBLE] 0 [INAUDIBLE]. So when x here is some weighted sum from the previous layer, what happens is either the output is a positive real number or the output is 0. So let's say the input was a sort of random centered around 0, then half of the time, those features would output 0. And if you just learn to shift this a little bit to the left, then you know-- 80% of the time or 95% of the time, the output will be 0. So it's very easy to get sparsity with these kind of [INAUDIBLE]. It turns out that these [INAUDIBLE] are sufficient to learn very complicated things. And that was used in particular in a really outstanding system built by Alex Krizhevsky and [INAUDIBLE] and [INAUDIBLE] with Geoff Hinton in Toronto recently where they obtained amazing results on one of the benchmarks that computer vision people really care about [INAUDIBLE] with [? 1,000 ?] classes. So this contains millions of images taken from Google Image search and 1,000 classes that you're trying to classify. So these are images like this. And there are 1,000 different categories you want to detect. And this shows some of the outputs of this model. [INAUDIBLE] obviously doing well. And they managed to bring the state-of-the-art from making small incremental changes from say 27% to 26% down to 17% on this particular benchmark. That's pretty amazing. And one of the tricks they used-- I've been doing publicity for-- is called dropouts. And Geoff Hinton is speaking a lot about this this year. Next year will be something else. And it's a very nice trick. Basically, the idea is add some kind of randomness in the typical neurons we use. So you'd think that randomness hurts, right? So if we learn a function like, you know-- say, thinking about the brain doing something. If you had noise in the computations of the brain, you'd think it hurts. But actually, when you do it during training, it helps. And it helps for reasons that are yet to be completely understood. But the theory is it prevents the features you learned to depend too much on the presence of the others. So, half of the features will be turned off by this trick. So the idea is you take the output of a neuron, and you multiply it by 1 or 0, [INAUDIBLE] probably be one-half. So you turn off half of the features [INAUDIBLE]. We do that for all the layers. And [INAUDIBLE] at this time, you don't do this kind of thing. You just multiply it by [INAUDIBLE]. So it averages the same thing. But what happens is that during training, the features learn to be more robust and more independent of each other and collaborate in a less fragile way. This is actually similar to the denoising auto-encoder I was talking about earlier where we introduce corruption noise in the input. But here, you do it at every layer. And somehow, this very simple trick helps a lot in many contexts. So they've tested it on different benchmarks. These are three image data sets and also in speech. And in all cases, they've seen improvements. Let's get back to the representation learning algorithms. Many of them are based on learning one layer of representation at a time. And one of the algorithms that has been very [? practical ?] for doing that is called a Restricted Boltzmann Machine, or RBM. And as a probability model, it's formulized this way. Basically, we're trying to model this [INAUDIBLE] of the vector-- x-- which is a [INAUDIBLE] vector of [? bits ?] typically. But it could be real numbers. And we introduce a vector of [? bits ?] [? h. ?] And we consider the joint distribution of these vectors [INAUDIBLE] formula. We're trying to find the parameters in the formula-- the b, the c, and the w. So that P of x is as large as possible. In terms of that in this model that's been very popular for deep learning, you need to sample from the model. In other words, the model is representative of distribution. And you'd like to generate examples according to what the model thinks is plausible. And in principle, there are ways to do that. You can do things like Gibbs sampling. However, what we and others have found is that these [? sampling ?] algorithms based on the particular algorithm [INAUDIBLE] chain methods have some quirks. They don't do exactly what we'd like. In particular, we say they don't mix well. So what does that mean? For example, if you start a chain of samples-- so you're going to create a sequence of samples by making small changes-- that's what Monte Carlo Markov chain methods do-- well, it turns out that you get chains like this where it stays around the same kind of examples. So it doesn't move to a new category. So you'd like your sample or your [INAUDIBLE] algorithm to be able to visit all the plausible configurations and be able to jump from one region in configuaration space to another one, or at least have a chance to visit all the places that matter. But there's a reason why this is happening. And I'm going to try to explain it in a picture. So, first of all, as I mentioned, MCMC methods move in configuration space by making small steps. You start from a configuration pf the examples. Let's say where I'm standing is a configuration of x,y coordinates. And I'm going to make small steps, such that if I'm in a configuration of high probability, I'm going to move to another high probability configuration. And if I'm in a low probability configuration, I'm going to tend to move to a neighboring high probability configuration. In this way, you stay in sort of high probability regions. And you, in principle, can visit the whole distribution. But you can see there's a problem. If this one thing here is highly probable. And the black thing there is highly probable, and the gray stuff in the middle is very unplausible, how could I possibly make small moves to go from here to here from the white to the black? So this is illustrating the picture like this. In this case, this is the density. OK-- so the input is representing different configurations of the variables of interest. And this is what the model thinks the distribution should be. So it gives high probability some places. So these are what we call modes. It's a region that has a peak. And this is another mode. So this is [INAUDIBLE] two modes. The question is can we go from mode to mode and make sure to visit all the modes? And the MCMC is making small steps. Now, [INAUDIBLE] is the MCMC can go through these [INAUDIBLE] regions. And they have enough probability-- it can move around here and then quickly go through these and do a lot of steps here and go back in this way-- assemble all the configurations with [INAUDIBLE] probability. The problem is-- remember, I said in the beginning, that there's this geometry where the [INAUDIBLE] problems have this property-- that the things we care about-- the images occupy a very small volume in the space of configurations of pixels. So the right distribution that we're trying to learn is one that has very big piece where there's a lot of probability. And in most other places, the probability will be tiny, tiny, tiny-- exponentially small. So, we're trying to make moves between these modes. But now these modes are separated by vast empty spaces-- deserts of probability where it's impossible to cross unless you make huge jumps. So that's a really big problem. Because when we consider algorithms like the RBM, what's going on is for learning, we need to sample from the model. Initially, when the model starts learning, it says I don't know anything. I'm assigning a kind of uniform [? probability ?] to everything. So the model thinks everything is uniform-- the probability is the same for everything. So we seem to move everywhere. And as it keeps learning, it starts developing these peaks-- these modes. But still there's a way to go from mode to mode and go through reasonably probable configurations. As learning becomes more advanced, you see these peaks emerge. And now, it becomes impossible to cross. Unfortunately, we need the sampling to learn these algorithms. And so there's a chicken and egg problem. We need the sampling. But if the sampling doesn't work well, the learning doesn't work well. And so we can't make progress. So the one thing I wanted to talk about is a direction of solution for this problem that [? we are ?] starting to explore in my lab. And it involves exploiting-- guess what-- deep representations. The idea is instead of doing these steps in the original space of the inputs where we observe things, if we did the MCMC in abstract, high level representation, maybe things would be easier. So let's consider, for example, something [INAUDIBLE] images and digits. If we had a machine that had discovered that the factors that matter for these images is something like, OK, is the background black and foreground white or vice versa. So this is one [? bit ?] that says flip black and white. And is the category zero, one, two, three, so that's just 10 [? bits ?] that tell us what the category is. And what's the position of the digit in the image. So these are high level factors that you could imagine that learned or discovered. And if it would discover these things, then if you represent the image in that space, the MCMC would be much easier. In particular, you could go from, say, one of these guys to these guys directly simply because maybe these are the zeros and these are the threes. And there's one bit that allows you to flip-- or two bits that allow you to flip from zero to three. So in the space where my representation has [? a big fuzzy ?] row of [INAUDIBLE] three-- I just need to flip two bits. And that's easy. It's a small move in that space. In a space of abstract representations, it's easy to generate data whereas the original space it's difficult. One way to see this visually is to interpolate between examples at different levels of representation. So this is what we've done. So if you look in the pixel space, and you interpolate between this line-- this picture of a line-- this picture of [INAUDIBLE] [? linear ?] interpolation, what you see is between the image of a line and the image of a three, you have to go in between through images that don't look like anything-- I mean, don't look like a digit-- the things that the model has seen-- and so the MCMC-- if it [? walks-- ?] oh, that's a plausible thing. Oh, no, this is not very plausible. This is worse. [? I'm ?] coming back. So it's never going to go on the other side of this desert of probability. So this is a one-dimensional [INAUDIBLE] of the example I was trying to explain. Now, what you see with the other two lines is the same thing but at different levels of representation that has to do with [INAUDIBLE] using unsupervised learning. And what you see is that it has learned a representation which has kind of skewed the space so that somehow, I can make lots of small changes and stay in three. And suddenly, just a few pixels flip and it becomes a nine magically. And so I don't have to stay very long in the row [INAUDIBLE] region. In fact, it's not even so implausible. And you can move-- all of these moves are rather plausible. So you can smoothly move from mode to mode without having to go through these low probability regions. So it's not like it actually discovered these actual bits. But it's discovered something that makes the job of sampling easier. And we've done experiments to validate that. So the general idea is instead of sampling in the original space, we're going to learn these representations and then do our sampling which iterates between representations in that high level space. And then once we've found something in that abstract representation, because we have inverse mappings, we can map back to the input space and get, say, the digit we care about or the face we care about. So we've applied this, for example, with face images. And what we find is that, for example, the red here uses a deep representation. And the blue here uses a single layer RBM. And what we find is that for the same number of steps it can visit more modes-- more classes-- by adding a deeper representation. So I'm almost done. What I would like you to keep in mind is that machine learning involves really interesting [INAUDIBLE] challenges that have a geometric nature in particular. And in order to face those challenges, something we found very useful is to allow machines to look for abstraction-- to look for high level representations of the data. And I think that we've only scratched the surface of this idea-- that the algorithms we have now discover still rather low level abstractions. There's a lot more that could be done if we were able to discover an even higher level of abstractions. Ideally, what the high level abstractions do is disentangle-- separate out-- the different underlying factors that explain the data-- the factors we don't know but we'd like the machine to discover. Of course, if we know the factors, we can somehow cheat and give back to the machine by telling it, here are some random variables that we know about. And here are some values of these variables in such and such setting. But you're not going to be able to do that for everything. We need machines that can make sense of the world by themselves to some extent. So we've shown that more abstract representations give rise to successful transfers-- so being able to generalize new domains, new languages, new classes. And one [INAUDIBLE] [? computation ?] is using these tricks. I'm done. Thank you very much. [APPLAUSE] YOSHUA BENGIO: Before I take questions, I would like to thank members of my team, so [INAUDIBLE]. [INAUDIBLE] of course very fortunate to have for my work. And I'm open to questions. Yes? There's a microphone. AUDIENCE: Hi. So the problem you mentioned with the Gibbs sampling-- isn't that easily solved by [INAUDIBLE]? YOSHUA BENGIO: No. We've tried that. And [INAUDIBLE]. AUDIENCE: [INAUDIBLE] be maybe [INAUDIBLE]-- YOSHUA BENGIO: So the problem-- what happens is if you [INAUDIBLE] restart, what typically can happen is your log is going to bring you to a few of the modes-- always the same ones. And so you're not going to visit everything. AUDIENCE: Well, why would it bring you always to the same mode? YOSHUA BENGIO: Because it's like a [? dynamical ?] system. So somehow most [? routes ?] go [? to Rome ?] for some reason. [INAUDIBLE]. AUDIENCE: Well, [? if ?] [? you ?] [? did ?] [? say ?] Paris, you'll go to Paris. YOSHUA BENGIO: Most routes go to a few big cities. That's what happens in these algorithms. We obviously tried this because it's-- you think well, that should work. Uh-- and there's a question right here. AUDIENCE: So I was just wondering about sampling from the model. I thought that was interesting the way you had different levels of abstraction. You could sort of bridge that space. So you're able to visit more classes. But does that make the sort of distinction between those classes? It seems you're bringing those classes closer together. So in terms of [INAUDIBLE]-- YOSHUA BENGIO: We had trouble understanding what was going on there. Because you'd think if you're-- [INAUDIBLE]-- think if somehow we made the threes and the nines closer, it should now be harder discriminating from them. Whereas before, we had this big empty region where we could put our separator. So I don't have a complete answer for this. But it's because we are working with these high dimensional spaces that these are not necessarily as our intuition would suggest. What happens really is that the manifold-- so the regions where, say, the threes are is a really complicated, curvy surface in high dimensional space. And so is the nine. And in the original space, these curvy spaces are intertwined in complicated ways, which mean machinery has a hard time separating between nines and threes, even though there's lots of spacing between nines and threes. It's not like you have nines here and threes here. It's a complicated thing. And what we do when we move to these high level representations is flatten those surfaces. And now if you interpolate between points, you're kind of staying in high probability configurations. So this flattening also means that it's easier to separate them. Even though they may be closer, it's easier because you [? need ?] a [? simpler ?] surface is enough. This is a conjecture I'm making. I haven't actually seen [INAUDIBLE] spaces like this. This is what my mind makes of it. AUDIENCE: So my understanding is that you don't even include [INAUDIBLE] algorithm of what the representations [INAUDIBLE]. YOSHUA BENGIO: We would like to have algorithms that can learn from as little [? cue ?] as possible. AUDIENCE: So on some of your [? research ?] [INAUDIBLE] my understanding is that you worked on [INAUDIBLE] [? commission. ?] YOSHUA BENGIO: Yes. AUDIENCE: Did you try to understand what were the representations that the algorithms produced? YOSHUA BENGIO: Yes. AUDIENCE: And did you try from that to understand if there were units where using [INAUDIBLE] similar representations of your representations would be different between these [INAUDIBLE] units? YOSHUA BENGIO: Well, first of all, we don't know what humans use. But we can guess, right? So, for example, I showed early on the results of modeling faces from Stanford-- and other people have found the same, but I'll use their picture-- that these deep learning algorithms discover representations that, at least for the first two levels, seem to be similar to what we see in the visual cortex. So if you look at [? V1, ?] [INAUDIBLE] the major first area of the visual cortex where [INAUDIBLE] [? arrive, ?] you actually see neurons that detect or are sensitive to exactly the same kinds of things. And in the second layer-- and [? V2 ?] is not a layer really-- it's an area-- well, you don't see these things. But you'll see combinations of images. And [INAUDIBLE] same group that is actually compared what the neurons in the brain seem to like according to neuroscientists with the machine learning models discover and they found some similarities. Other things you can do that I mentioned quickly was-- in some cases, we know what the factors are. We know what humans will look for. So you can just try to correlate the features that have been learned with the factors you know humans think are important. So we've done that in the work here with [INAUDIBLE]. That's not the one I want to show you. It is the right guys, but-- sorry. Yes-- this one. So for example, we've trained models on the problem of sentiment analysis where you give them a sentence and you try to predict the person liked or didn't like something, like a book or a video or a [INAUDIBLE]-- something-- something you find on the web. And what we found is that when we use purely unsupervised learning-- so it doesn't know that the job is to present [INAUDIBLE] analysis. Some of the features specialize on the sentiment. Is this [INAUDIBLE] more positive or more negative kind of statement. And some features specialize on the domain. Because we train this across 25 different domains. So some basically detect or are highly correlated with-- is this about books-- is this about food-- is this about videos-- is this about music? So these are underlying factors we know are present in the data. And we find [INAUDIBLE] features tend to specialize toward these things, much more than the original. AUDIENCE: [SPEAKING FRENCH] YOSHUA BENGIO: So I'm going to summarize your question in English and answer in English [INAUDIBLE]. So my understanding of the question is, could we just not use our prior knowledge to build in the property and structure [INAUDIBLE] representations and so on? And my answer to this is of course because this is what we do. And this is what everybody in machine learning does. And it's especially true in computer vision where we use a lot of prior knowledge of our understanding of how vision works. But my belief is that it's also interesting to see whether machines could discover these things by themselves. Because if you have algorithms that can do that, then we can-- we can still use our prior knowledge. But we can discover other things that we didn't know or that we were not able to formulize. And with these algorithms, actually, it's not too difficult to put in prior knowledge. You can add [INAUDIBLE] variables that correspond to the things you know matter and you can put extra terms in the [INAUDIBLE] model that correspond to your prior knowledge. You can do all these things. And some people do that. If I work with industrial partner, I'm going to use all the knowledge that I have because I want to have something that works in the next six months. But if I want to solve AI, I think it's worth it to explore more general purpose methods. It could be combined with the prior knowledge. But to make the task of discovering these general purpose methods easier and focus on this aspect, I find it interesting to actually avoid lots of very specific human knowledge. Although, different researchers look at this differently. AUDIENCE: [SPEAKING FRENCH] YOSHUA BENGIO: You can do all kinds of things. There's lots of freedom to combine our knowledge and there are different ways. And it's a subject of many different papers-- to combine our knowledge with learning. So we can do that. Sometimes, though, when you put too much prior knowledge, it hurts the [INAUDIBLE]. AUDIENCE: So how much does the network [INAUDIBLE] matters? And isn't that kind of the biological priors you were talking about in the very early slide? And isn't that the new kind of feature engineering of deep learning-- figuring out the right [INAUDIBLE]. YOSHUA BENGIO: So the answer is no to both of these questions. [INAUDIBLE]. For example, the size of those layers doesn't matter much. It matters in the sense that they have to be big enough to capture the data. And regarding the biology-- what-- the-- AUDIENCE: Yeahm because I was thinking, right-- our brain comes wired in certain ways because we have the visual cortex and [INAUDIBLE] that. So-- YOSHUA BENGIO: There might be things we will learn that we'll be able to exploit that might be generic enough in the brain. And we're on the lookout for these things. MALE SPEAKER: We'll take two more questions. AUDIENCE: So I'm interested in the results that [INAUDIBLE] variations [INAUDIBLE]. So that reminds me of [INAUDIBLE] we try to avoid [INAUDIBLE]. But [INAUDIBLE] come with a lot of [INAUDIBLE] scheduling and such. So [INAUDIBLE]. YOSHUA BENGIO: No-- I mean, you can see this trick is so simple-- like in [INAUDIBLE] one nine. So there's nothing complicated in this particular trick. You just add noise in a very dumb way. And somehow, it makes these networks more robust. AUDIENCE: Well, [INAUDIBLE]. YOSHUA BENGIO: It's different from [INAUDIBLE]. It doesn't serve the same purpose. AUDIENCE: You're also not reducing [INAUDIBLE]-- AUDIENCE: [INAUDIBLE]. YOSHUA BENGIO: Sorry? AUDIENCE: When you're trying to [INAUDIBLE]. energy [INAUDIBLE]. YOSHUA BENGIO: Yeah-- so you're trying to minimize the error under the [INAUDIBLE] created by this [INAUDIBLE]. So there is [INAUDIBLE]. AUDIENCE: And it's a very rough [? connection ?] at this point. But then for [INAUDIBLE], it's not just [? two ?] random things. You also have temperature decreasing and the end of the [INAUDIBLE]. YOSHUA BENGIO: So, here, it's not happening in the space of parameters that we are trying to optimize. So in [INAUDIBLE] trying to do some optimization. Here, the noise is used as a regularizer, meaning it's injecting sort of a prior that your neuro net should be kind of robust to half of it becoming dead. It's a different way of using [INAUDIBLE]. FEMALE SPEAKER: And the last question. AUDIENCE: In the beginning, you were talking about priors that humans have in their brain. YOSHUA BENGIO: Absolutely. AUDIENCE: OK-- so how do you exploit that in your algorithm? Or do you exploit that? So it's a general question [INAUDIBLE]. It's not so precise. YOSHUA BENGIO: Well, I could speak for hours about that. AUDIENCE: Yeah, OK-- but how do you exploit that in your algorithms? Or what kind [INAUDIBLE]? YOSHUA BENGIO: Each prior that we consider is basically giving rise to a different answer to your question. So in the case of the sparsity prior I was mentioning, it comes about simply by adding a term in the [INAUDIBLE] adding a prior explicitly. That's how we do it. In the case of the prior that-- I mentioned the prior that input distribution tells us something about the task. And we get it by combining unsupervised learning and supervised learning. But in the case of the prior that-- there are different abstractions that matter-- different levels of abstraction in the world around us, we get the prior by adding a structure in the model that has these different methods of representation. There are other priors that I didn't mention. For example, one of the priors is [INAUDIBLE] study as the constancy prior or the [? slowness-- ?] that the factors that matter that explain the world around us, some of them change slowly over time. The set of people in this room is not changing very quickly right now. It's a constant over time. Eventually, it will change. But there are properties of the world around us that remain the same for many time stamps. And this is a prior, which you can also incorporate in your model by changing the training criteria to say something like well, some of the features in your representation should stay the same from [? team to ?] [? team ?] [INAUDIBLE]. This is very easy to put in and has been done and [INAUDIBLE]. So each prior-- you can think of a way to incorporate it. Basically, it's changing the structure or changing the training criteria usually, is the way we do it. AUDIENCE: And which kind of prior do you think we humans have? I guess it's a very complex question. [INTERPOSING VOICES]. YOSHUA BENGIO: Basically, the question you're asking is the question I'm trying to answer. So I have some guesses. And I mentioned a few already. And our research is basically about finding out what these priors are that are generic and work for many tasks in the world around us. AUDIENCE: OK. Thank you. YOSHUA BENGIO: Welcome. MALE SPEAKER: [INAUDIBLE] and ask one last question. As some of our practitioners are not machine learning experts, do you think it's worthwhile for us to learn a bit about machine learning-- YOSHUA BENGIO: Absolutely. [LAUGHTER] MALE SPEAKER: What's the starting point where we can-- we have a small problem. We want to spend one month working on this kind of thing. Where would be the starting point for us? YOSHUA BENGIO: You have one month? [LAUGHING] MALE SPEAKER: So is there a library or something [INAUDIBLE]? YOSHUA BENGIO: You should take Geoff Hinton's [INAUDIBLE] course. You can probably do it in one month because you're [INAUDIBLE]. AUDIENCE: There was also another one by Andrew [INAUDIBLE]. YOSHUA BENGIO: Yes. Andrew [INAUDIBLE] has a very good course. And there are more and more resources on the web to help people get started. There are libraries that people share, like the library from my lab that you can use to get started quickly. There are all kinds of resources like that. AUDIENCE: [INAUDIBLE] to it on a standard computer, or you need a cluster or-- YOSHUA BENGIO: It helps to have a cluster for training. Once you have the training model, usually you can run it on your laptop. And the reason you need a cluster is that these algorithms have [INAUDIBLE] many [? knobs ?] in the set. And you want to explore many configurations of these knobs. But actually training one model can be on a regular computer. It's just that you want to try many configurations of these knobs. MALE SPEAKER: Thank you very much, Yoshua. YOSHUA BENGIO: You're welcome. MALE SPEAKER: --for this interesting talk. [APPLAUSE] [MUSIC]