字幕表 動画を再生する 英語字幕をプリント >> It's always fun introducing people who need no introduction. But for those of you who don't know Geoff and his work, he pretty much created--he helped create the field of machine learning as it now exists and was on the cutting edge back when it was the bleeding edge of statistical machine learning and neural nets when they first made their resurgence for the first time in our lifetime, and has been a constant force pushing it--pushing the analysis in the field away from just sort of the touchy-feely, let's tweak something until it thinks and towards getting--building systems that we can understand and that actually do useful things that make our lives better. So you--if you read the talk announcement, you've seen all of his many accomplishments and members of various royal societies, etcetera, so I won't list those. I think instead of taking up more of his time, I'm just going to hand the microphone over to Geoff. >> HINTON: Thank you. I've got--I got it. So the main aim of neural network research is to make computers recognize patterns better by emulating the way the brain does it. We know the brain learns to extract many layers of features from the sensory data. We don't know how it does it. So it's a sort of joint enterprise of science and engineering. The first generation of neural networks--I can give you a two minute history of neural networks. The first generation with things like Perceptrons, where you had hand coded features, they didn't adapt so you might put an image--the pixels of an image here, has some hand coded features, and you'd learn the weights to decision units and if you wanted funding, you'd make decision units like that. These were fundamentally limited in what they could do as these points out in 1969, and so people stopped doing them. Then sometime later, people figured out how to change the weights of the feature detectors as well as the weights of the decision units. So what you would do is take an image share, you'd go forwards through a feed-forward neural network, you will compare the answer the network gave with the correct answer, you take some measure of that discrepancy and you send it backwards through the net and as you go backwards through the net, you compute the derivatives for all of the connections strings here both those once and those once and those once of the discrepancy between the correct answer and what you got, and you change all these weights to get closer to the correct answer. That's backpropagation, and it's just the chain rule. It works for non-linear units so potentially, these can learn very powerful things and it was a huge disappointment. I can say that now because I got something better. Basically, we thought when we got this that we cannot learn anything and we'll get lots and lots of features, object recognition, speech recognition, it'll be easy. There's some problems, it worked for some things, [INDISTINCT] can make it work for more or less anything. But in the hands of other people, it has its limitations and something else came along so there was a temporary digression called kernel methods where what you do is you do Perceptrons in a cleverer way. You take each training example and you turn the training example into feature. Basically the feature is how similar are you to this training example. And then, you have a clever optimization algorithm that decides to throw away some of those features and also decides how to weight the ones it keeps. But when you're finished, you just got these fixed features produced according to a fixed recipe that didn't learn and some weights on these features to make your decision. So it's just a Perceptron. There's a lot of clever math to how you optimize it, but it's just a Perceptron. And what happened was people forgot all of Minsky and Papert's criticisms about Perceptrons not being able to do much. Also it worked better than backpropagation in quite a few things which was deeply embarrassing, but it says a lot more about how bad backpropagation was and about how good support in fact the machines are. So if you ask what's wrong with backpropagation, it requires labeled data and some of you here may know it's easy to get data than labels. If you have a--there's a model of the brain, you [INDISTINCT] about that many parameters and you [INDISTINCT] for about that many seconds. Actually, twice as many which is important to some of us. There's not enough information in labels to constrain that many parameters. You need ten to the five bits or bytes per second. There's only one place you're going to get that and that's the sensory input. So the brain must be building a model of the sensory input, not of these labels. The labels don't have enough information. Also the learning time didn't scale well. You couldn't learn lots of layers. The whole point of backpropagation was to learn lots of layers and if you gave it like ten layers to learn, it would just take forever. And then there's some neural things I won't talk about. So if you want to overcome these limitations, we want to keep the efficiency of a gradient method for updating the parameters but instead of trying to learn the probability of a label given an image, where you need the labels, we're just going to try and learn the probability of an image. That is, we're going to try and build a generative model that if you run it will produce stuff that looks like the sensory data. Another is we're going to try and learn to do computer graphics, and once we can do that, then computer vision is just going to be inferring how the computer graphics produce this image. So what kind of a model could the brain be using for that? The building blocks I'm going to use are a bit like neurons. They're intended to be a bit like neurons. They're these binaries stochastic neurons. They get some input, they're given--I put this either a one or a zero, so it's easy to communicate and it's probabilistic. So this is the probability of giving a one as a function of the total input you get which is your external input plus what you get for other neurons times the weights on the connections. And we're going to hook those up into a little module that I call a restricted Boltzmann Machine. This is the module here, it has a layer of pixels and a layer of feature detectors. So it looks like he's never going to learn lots and lots of layers of feature detectors. It looks like we thrown out the baby with the bath water and we're now just restricted to learning one layer of features but we'll fixed that later. We're going to have a very restricting connectivity, hence the name, where this is going to be a bipartite graph. The visible units for now don't connect to each other and the hidden units don't connect to each other. The advantage of that is if I tell you the state of the pixels, these become independent and so you can update them independently and in parallel. So given some pixels and given that you know the weights on the connections, you can update all these units in parallel, and so you've got your feature activations very simply, there's no lateral interactions there. These networks are governed by an energy function and the energy function determines the probability of the network adopting particular states just like in a physical system. These stochastic units will kind of rattle around and they'll tend to enter low energy states and avoid high energy states. The weights determine the energies linearly. The probabilities are an exponential function of the images so the probabilities, the log probabilities are a linear function of the weights, and that makes learning easy. There's a very simple algorithm that Terry Sejnowski and me invented in--back in 1982. In a general network, you can run it but it's very, very slow. In this restricted Boltzmann Machine, it's much more efficient. And I'm just going to show you what the Maximum Likelihood Learning Algorithm looks like. That is, suppose you said take one of your parameter on your connection, how do I change that parameter so that when I run this machine in generative mode, in computer graphics mode, it's more likely to generate stuff like the stuff I've observed? And so here's what you should do, you should take a data vector, an image, and you should put it here on the visible units and then you should let the visible units via their current weights activate the feature detectors. So you provide input to each feature detector and you now make a stochastic decision about what the feature detector should turn on. Lots of positive input, it almost certainly turns on, lots of negative input it almost certainly turns off. Then, given the binary state of the feature detectors, we now reconstruct the pixels from the feature detectors and we just keep going like that. And if we run this chain for a long time, this is called a Markov chain, and this process is called alternating Gibbs sampling, If we go back [INDISTINCT] for a long time, we'll get fantasies from the model. This is the kind of stuff the model would like to produce. These are the things that the model shows you when it's in its low energy states given its current parameters. So that's the sort of stuff it believes in, this is the data and obviously you want to say to it, believe in the data, not your own fantasies. And so we'd like to change the parameters the way it's on the connections, so as to make this more likely and that less likely. And the way to do that is to say, measure how often a pixel i and a feature detector j on together when I'm showing you the data vector v. And then measure how often they're on together when the model is just fantasizing and raise the weights by how often they're on together when it's seeing data and lower the weights by how often they're on together when it's fantasizing. And what that will do is it'll make it happy with the data, low energy, and less happy with its fantasies. And so it will--its fantasies will gradually move towards the data. If its fantasies are just like the data, then these correlations, the probability of pixel i and feature detector j being on together in the fantasies will be just the same as in the data, and so it'll stop learning. So it's a very simple local learning rule that a neuron could implement because it just involves learning the activity of a neuron and the other neuron it connects to. And that will do Maximum Likelihood Learning, but it's slow. You have to settle for like a hundred steps. So, I think about how to make this algorithm go a hundred thousand times faster. The way you do it is instead of running for a hundred steps, you just run for one step. So now you go up, you come down and you go up again. And you take this difference in statistics and that's quite efficient to do. It took me 17 years to figure this out and in that time computers got a thousand times faster. So, the change in the weight now is the difference--is a learning rate times the difference between statistics measured with data and statistics measured with reconstructions of the data. That's not doing Maximum Likelihood Learning but it works well anyway. So I'm going to show you a little example, we are going to take a little image where we're going to have handwritten digits, this is just a toy sample. We're going to put random weights on the connections then we're going to activate the binary feature detectors given the input they're getting from the pixels, then we're going to reconstruct the image and initially we can get a lousy reconstruction, this will be very different from the data because they're random weights. And then we're going to activate the feature detectors again and we're going to increment the connections on the data and we're going to decrement the connections on the reconstructions and that is neither going to learn nice weight for us as I'll show you, nice connection strengths that will make this be a very good model of [INDISTINCT]. It's important to run the algorithm where you take the data and on the data you increment connection strengths and on your--this is really a sort of screwed up version of the data that's being infected by the prejudices of the model. So the model kind of interprets the data in terms of its features then it reconstructs something, it would rather see than the data. Now you could try running a learning algorithm where you take the data, you interpret it, you imagine the data is what you would like to see and then you learn on that. That's the algorithm George Bush runs and it doesn't work very well. So, after you've been doing some learning on this for not very long, I'm now showing you 25,000 connection strengths. Each of these is one of the features, take this slide. That's a feature and the intensity here shows you the strength of the connection to the pixels. So this feature really wants to have these pixels off and it really wants to have these pixels on and it doesn't care much about the other ones, mid-gray means zero. And you can see the features are fairly local and these features are now very good at reconstructing twos. It was trained on twos. So if I show you--show it some twos it never saw before, and get it to reconstruct them, you can see it reconstructs them pretty well. The funny pixels here which aren't quite right is because I'm using Vista. So you can see the reconstruction is very like the data and the--it's not quite identical but it's a very good reconstruction for a wide variety of twos and these are ones it didn't see during training, okay. Now what I'm going to do--that's no that surprising, if you just copied the pixels and copy them back, you'd get the same thing, right? So that would work very well. But now I'm going to show it something it didn't train on. And what you have to imagine is that Iraq is made of threes but George Bush thinks it's made of twos, okay? So here's the real data and this is what George Bush sees. That's actually inconsistent with my previous joke because [INDISTINCT] this learning algorithm. Sorry about that. Okay, so you see that it perverts the data into what it would like to believe which is like what it's trained on. Okay, that was just a toy example. Now what we're going to do is train the letter features like that in the way I just showed you. Forget these features that are good at reconstructing the data, at least for the kind of data it's trained on. And then we're going to take the activations of those features and we're going to make those data and train another layer, okay. And then we're going to keep doing that and for reasons that are slightly complicated and I will partially explain, this works extremely well. You get more and more abstract features as you go up and once you've gone up through about three layers, you got very nice abstract features that are very good then for doing things like classification. But all these features were learned without ever knowing the labels. It can be proved that every time we add another layer, we get a better model of the training data or to be more precise, we improve a lower band on how good a model we got of the training data. So here's a quick explanation on what's going on. When we learn the weights in this little restrictive Boltzmann Machine, those weights define the probability of given a vector here, we're constructing a particular vector there. So that's the probability of a visible vector given a hidden vector. They also define this whole Markov chain, if you went backwards and forwards many times. And so if you went backwards and forwards many times, and then looked to see what you got here, you'll get some probability distribution of the hidden vectors and the weights defining that. And so you can think of the weights as defining both a mapping from these vectors of activity over the hidden units to the pixels, to images, that's this term and the same weights define a prior over these tons of hidden activities. When you learn the next level of Boltzmann Machine up, you're going to say, "Let's keep this, keep this mapping, and let's learn a better model of the posterior that we've got here when we use this mapping," and you keep replacing the posterior--implicit posterior defined by these weights by a better one which is the p of v given h defined by the next Boltzmann Machine. And so what you're really doing is dividing this task into two tasks. One is, find me a distribution that's a little bit simpler than the data distribution. Don't go the whole way to try and find a full model, just find me something a bit simpler than the data distribution. This is going to be easy [INDISTINCT] Boltzmann Machine to model, that's very nonparametric. And then find me a parametric mapping from that slightly simpler distribution to the data distribution. So I call this creeping parameterization. What you're really doing is--it's like taking the shell off an onion, you got this distribution you want to model. Let's take off one shell which is this and get a very similar distribution that's a bit easier to model and some parameters that tell us how to turn this one to this one and then that's going to solve the problem of modeling this distribution. So that's what's going on when you learn these multiple layers. After you've learned say three layers, you have a model that's a bit surprising. This is the last restrictive Boltzmann Machine we learned. So here we have this sort of model that says, "To generate from the model, go backwards and forwards." But because we just kept the p of v given h from the previous models, this is a directed model where you sort of get chunk, chunk to generate. So the right way to generate from this combined model when you've learned three layers of features, is to take the top two layers and go backwards and forwards for a long time. It's fortunate you don't actually need to generate from it, I'm just telling you how you would if you did. We want this for perception so really, you just need to do perceptual imprint which is chunk, chunk, chunk, it's very fast. But to generate, you'd have to get backwards and forwards for a long time and then once you've decided on a pattern here, you go--just go chunk, chunk, that's very directed and easy. So I'm now going to learn a particular model of some handwritten digits but all the digit classes now. So we're going to put slightly bigger images of handwritten digits from a very standard data set where we know how well other methods do. In fact it's a data set on which support back the machine's beat backpropagation which was bad news backpropagation but we're going to reverse that in a minute. We're going to learn 500 features now instead of 50. Once we've learned those, we're going to take the data, map it through these weights which are just these weights in the opposite direction, and get some feature vectors. We're going to treat those as data and learn this guy, then we're going to take these feature vectors, we're going to tack on ten labeled units. So now we needed the labels but I'll get rid of that later. And so we've got a 510 dimensional vector here and we're