Placeholder Image

字幕表 動画を再生する

  • >> It's always fun introducing people who need no introduction. But for those of you

  • who don't know Geoff and his work, he pretty much created--he helped create the field of

  • machine learning as it now exists and was on the cutting edge back when it was the bleeding

  • edge of statistical machine learning and neural nets when they first made their resurgence

  • for the first time in our lifetime, and has been a constant force pushing it--pushing

  • the analysis in the field away from just sort of the touchy-feely, let's tweak something

  • until it thinks and towards getting--building systems that we can understand and that actually

  • do useful things that make our lives better. So you--if you read the talk announcement,

  • you've seen all of his many accomplishments and members of various royal societies, etcetera,

  • so I won't list those. I think instead of taking up more of his time, I'm just going

  • to hand the microphone over to Geoff. >> HINTON: Thank you. I've got--I got it.

  • So the main aim of neural network research is to make computers recognize patterns better

  • by emulating the way the brain does it. We know the brain learns to extract many layers

  • of features from the sensory data. We don't know how it does it. So it's a sort of joint

  • enterprise of science and engineering. The first generation of neural networks--I can

  • give you a two minute history of neural networks. The first generation with things like Perceptrons,

  • where you had hand coded features, they didn't adapt so you might put an image--the pixels

  • of an image here, has some hand coded features, and you'd learn the weights to decision units

  • and if you wanted funding, you'd make decision units like that. These were fundamentally

  • limited in what they could do as these points out in 1969, and so people stopped doing them.

  • Then sometime later, people figured out how to change the weights of the feature detectors

  • as well as the weights of the decision units. So what you would do is take an image share,

  • you'd go forwards through a feed-forward neural network, you will compare the answer the network

  • gave with the correct answer, you take some measure of that discrepancy and you send it

  • backwards through the net and as you go backwards through the net, you compute the derivatives

  • for all of the connections strings here both those once and those once and those once of

  • the discrepancy between the correct answer and what you got, and you change all these

  • weights to get closer to the correct answer. That's backpropagation, and it's just the

  • chain rule. It works for non-linear units so potentially, these can learn very powerful

  • things and it was a huge disappointment. I can say that now because I got something better.

  • Basically, we thought when we got this that we cannot learn anything and we'll get lots

  • and lots of features, object recognition, speech recognition, it'll be easy. There's

  • some problems, it worked for some things, [INDISTINCT] can make it work for more or

  • less anything. But in the hands of other people, it has its limitations and something else

  • came along so there was a temporary digression called kernel methods where what you do is

  • you do Perceptrons in a cleverer way. You take each training example and you turn the

  • training example into feature. Basically the feature is how similar are you to this training

  • example. And then, you have a clever optimization algorithm that decides to throw away some

  • of those features and also decides how to weight the ones it keeps. But when you're

  • finished, you just got these fixed features produced according to a fixed recipe that

  • didn't learn and some weights on these features to make your decision. So it's just a Perceptron.

  • There's a lot of clever math to how you optimize it, but it's just a Perceptron. And what happened

  • was people forgot all of Minsky and Papert's criticisms about Perceptrons not being able

  • to do much. Also it worked better than backpropagation in quite a few things which was deeply embarrassing,

  • but it says a lot more about how bad backpropagation was and about how good support in fact the

  • machines are. So if you ask what's wrong with backpropagation, it requires labeled data

  • and some of you here may know it's easy to get data than labels. If you have a--there's

  • a model of the brain, you [INDISTINCT] about that many parameters and you [INDISTINCT]

  • for about that many seconds. Actually, twice as many which is important to some of us.

  • There's not enough information in labels to constrain that many parameters. You need ten

  • to the five bits or bytes per second. There's only one place you're going to get that and

  • that's the sensory input. So the brain must be building a model of the sensory input,

  • not of these labels. The labels don't have enough information. Also the learning time

  • didn't scale well. You couldn't learn lots of layers. The whole point of backpropagation

  • was to learn lots of layers and if you gave it like ten layers to learn, it would just

  • take forever. And then there's some neural things I won't talk about. So if you want

  • to overcome these limitations, we want to keep the efficiency of a gradient method for

  • updating the parameters but instead of trying to learn the probability of a label given

  • an image, where you need the labels, we're just going to try and learn the probability

  • of an image. That is, we're going to try and build a generative model that if you run it

  • will produce stuff that looks like the sensory data. Another is we're going to try and learn

  • to do computer graphics, and once we can do that, then computer vision is just going to

  • be inferring how the computer graphics produce this image. So what kind of a model could

  • the brain be using for that? The building blocks I'm going to use are a bit like neurons.

  • They're intended to be a bit like neurons. They're these binaries stochastic neurons.

  • They get some input, they're given--I put this either a one or a zero, so it's easy

  • to communicate and it's probabilistic. So this is the probability of giving a one as

  • a function of the total input you get which is your external input plus what you get for

  • other neurons times the weights on the connections. And we're going to hook those up into a little

  • module that I call a restricted Boltzmann Machine. This is the module here, it has a

  • layer of pixels and a layer of feature detectors. So it looks like he's never going to learn

  • lots and lots of layers of feature detectors. It looks like we thrown out the baby with

  • the bath water and we're now just restricted to learning one layer of features but we'll

  • fixed that later. We're going to have a very restricting connectivity, hence the name,

  • where this is going to be a bipartite graph. The visible units for now don't connect to

  • each other and the hidden units don't connect to each other. The advantage of that is if

  • I tell you the state of the pixels, these become independent and so you can update them

  • independently and in parallel. So given some pixels and given that you know the weights

  • on the connections, you can update all these units in parallel, and so you've got your

  • feature activations very simply, there's no lateral interactions there. These networks

  • are governed by an energy function and the energy function determines the probability

  • of the network adopting particular states just like in a physical system. These stochastic

  • units will kind of rattle around and they'll tend to enter low energy states and avoid

  • high energy states. The weights determine the energies linearly. The probabilities are

  • an exponential function of the images so the probabilities, the log probabilities are a

  • linear function of the weights, and that makes learning easy. There's a very simple algorithm

  • that Terry Sejnowski and me invented in--back in 1982. In a general network, you can run

  • it but it's very, very slow. In this restricted Boltzmann Machine, it's much more efficient.

  • And I'm just going to show you what the Maximum Likelihood Learning Algorithm looks like.

  • That is, suppose you said take one of your parameter on your connection, how do I change

  • that parameter so that when I run this machine in generative mode, in computer graphics mode,

  • it's more likely to generate stuff like the stuff I've observed? And so here's what you

  • should do, you should take a data vector, an image, and you should put it here on the

  • visible units and then you should let the visible units via their current weights activate

  • the feature detectors. So you provide input to each feature detector and you now make

  • a stochastic decision about what the feature detector should turn on. Lots of positive

  • input, it almost certainly turns on, lots of negative input it almost certainly turns

  • off. Then, given the binary state of the feature detectors, we now reconstruct the pixels from

  • the feature detectors and we just keep going like that. And if we run this chain for a

  • long time, this is called a Markov chain, and this process is called alternating Gibbs

  • sampling, If we go back [INDISTINCT] for a long time, we'll get fantasies from the model.

  • This is the kind of stuff the model would like to produce. These are the things that

  • the model shows you when it's in its low energy states given its current parameters. So that's

  • the sort of stuff it believes in, this is the data and obviously you want to say to

  • it, believe in the data, not your own fantasies. And so we'd like to change the parameters

  • the way it's on the connections, so as to make this more likely and that less likely.

  • And the way to do that is to say, measure how often a pixel i and a feature detector

  • j on together when I'm showing you the data vector v. And then measure how often they're

  • on together when the model is just fantasizing and raise the weights by how often they're

  • on together when it's seeing data and lower the weights by how often they're on together

  • when it's fantasizing. And what that will do is it'll make it happy with the data, low

  • energy, and less happy with its fantasies. And so it will--its fantasies will gradually

  • move towards the data. If its fantasies are just like the data, then these correlations,

  • the probability of pixel i and feature detector j being on together in the fantasies will

  • be just the same as in the data, and so it'll stop learning. So it's a very simple local

  • learning rule that a neuron could implement because it just involves learning the activity

  • of a neuron and the other neuron it connects to. And that will do Maximum Likelihood Learning,

  • but it's slow. You have to settle for like a hundred steps. So, I think about how to

  • make this algorithm go a hundred thousand times faster. The way you do it is instead

  • of running for a hundred steps, you just run for one step. So now you go up, you come down

  • and you go up again. And you take this difference in statistics and that's quite efficient to

  • do. It took me 17 years to figure this out and in that time computers got a thousand

  • times faster. So, the change in the weight now is the difference--is a learning rate

  • times the difference between statistics measured with data and statistics measured with reconstructions

  • of the data. That's not doing Maximum Likelihood Learning but it works well anyway. So I'm

  • going to show you a little example, we are going to take a little image where we're going

  • to have handwritten digits, this is just a toy sample. We're going to put random weights

  • on the connections then we're going to activate the binary feature detectors given the input

  • they're getting from the pixels, then we're going to reconstruct the image and initially

  • we can get a lousy reconstruction, this will be very different from the data because they're

  • random weights. And then we're going to activate the feature detectors again and we're going

  • to increment the connections on the data and we're going to decrement the connections on

  • the reconstructions and that is neither going to learn nice weight for us as I'll show you,

  • nice connection strengths that will make this be a very good model of [INDISTINCT]. It's

  • important to run the algorithm where you take the data and on the data you increment connection

  • strengths and on your--this is really a sort of screwed up version of the data that's being

  • infected by the prejudices of the model. So the model kind of interprets the data in terms

  • of its features then it reconstructs something, it would rather see than the data. Now you

  • could try running a learning algorithm where you take the data, you interpret it, you imagine

  • the data is what you would like to see and then you learn on that. That's the algorithm

  • George Bush runs and it doesn't work very well. So, after you've been doing some learning

  • on this for not very long, I'm now showing you 25,000 connection strengths. Each of these

  • is one of the features, take this slide. That's a feature and the intensity here shows you

  • the strength of the connection to the pixels. So this feature really wants to have these

  • pixels off and it really wants to have these pixels on and it doesn't care much about the

  • other ones, mid-gray means zero. And you can see the features are fairly local and these

  • features are now very good at reconstructing twos. It was trained on twos. So if I show

  • you--show it some twos it never saw before, and get it to reconstruct them, you can see

  • it reconstructs them pretty well. The funny pixels here which aren't quite right is because

  • I'm using Vista. So you can see the reconstruction is very like the data and the--it's not quite

  • identical but it's a very good reconstruction for a wide variety of twos and these are ones

  • it didn't see during training, okay. Now what I'm going to do--that's no that surprising,

  • if you just copied the pixels and copy them back, you'd get the same thing, right? So

  • that would work very well. But now I'm going to show it something it didn't train on. And

  • what you have to imagine is that Iraq is made of threes but George Bush thinks it's made

  • of twos, okay? So here's the real data and this is what George Bush sees. That's actually

  • inconsistent with my previous joke because [INDISTINCT] this learning algorithm. Sorry

  • about that. Okay, so you see that it perverts the data into what it would like to believe

  • which is like what it's trained on. Okay, that was just a toy example. Now what we're

  • going to do is train the letter features like that in the way I just showed you. Forget

  • these features that are good at reconstructing the data, at least for the kind of data it's

  • trained on. And then we're going to take the activations of those features and we're going

  • to make those data and train another layer, okay. And then we're going to keep doing that

  • and for reasons that are slightly complicated and I will partially explain, this works extremely

  • well. You get more and more abstract features as you go up and once you've gone up through

  • about three layers, you got very nice abstract features that are very good then for doing

  • things like classification. But all these features were learned without ever knowing

  • the labels. It can be proved that every time we add another layer, we get a better model

  • of the training data or to be more precise, we improve a lower band on how good a model

  • we got of the training data. So here's a quick explanation on what's going on. When we learn

  • the weights in this little restrictive Boltzmann Machine, those weights define the probability

  • of given a vector here, we're constructing a particular vector there. So that's the probability

  • of a visible vector given a hidden vector. They also define this whole Markov chain,

  • if you went backwards and forwards many times. And so if you went backwards and forwards

  • many times, and then looked to see what you got here, you'll get some probability distribution

  • of the hidden vectors and the weights defining that. And so you can think of the weights

  • as defining both a mapping from these vectors of activity over the hidden units to the pixels,

  • to images, that's this term and the same weights define a prior over these tons of hidden activities.

  • When you learn the next level of Boltzmann Machine up, you're going to say, "Let's keep

  • this, keep this mapping, and let's learn a better model of the posterior that we've got

  • here when we use this mapping," and you keep replacing the posterior--implicit posterior

  • defined by these weights by a better one which is the p of v given h defined by the next

  • Boltzmann Machine. And so what you're really doing is dividing this task into two tasks.

  • One is, find me a distribution that's a little bit simpler than the data distribution. Don't

  • go the whole way to try and find a full model, just find me something a bit simpler than

  • the data distribution. This is going to be easy [INDISTINCT] Boltzmann Machine to model,

  • that's very nonparametric. And then find me a parametric mapping from that slightly simpler

  • distribution to the data distribution. So I call this creeping parameterization. What

  • you're really doing is--it's like taking the shell off an onion, you got this distribution

  • you want to model. Let's take off one shell which is this and get a very similar distribution

  • that's a bit easier to model and some parameters that tell us how to turn this one to this

  • one and then that's going to solve the problem of modeling this distribution. So that's what's

  • going on when you learn these multiple layers. After you've learned say three layers, you

  • have a model that's a bit surprising. This is the last restrictive Boltzmann Machine

  • we learned. So here we have this sort of model that says, "To generate from the model, go

  • backwards and forwards." But because we just kept the p of v given h from the previous

  • models, this is a directed model where you sort of get chunk, chunk to generate. So the

  • right way to generate from this combined model when you've learned three layers of features,

  • is to take the top two layers and go backwards and forwards for a long time. It's fortunate

  • you don't actually need to generate from it, I'm just telling you how you would if you

  • did. We want this for perception so really, you just need to do perceptual imprint which

  • is chunk, chunk, chunk, it's very fast. But to generate, you'd have to get backwards and

  • forwards for a long time and then once you've decided on a pattern here, you go--just go

  • chunk, chunk, that's very directed and easy. So I'm now going to learn a particular model

  • of some handwritten digits but all the digit classes now. So we're going to put slightly

  • bigger images of handwritten digits from a very standard data set where we know how well

  • other methods do. In fact it's a data set on which support back the machine's beat backpropagation

  • which was bad news backpropagation but we're going to reverse that in a minute. We're going

  • to learn 500 features now instead of 50. Once we've learned those, we're going to take the

  • data, map it through these weights which are just these weights in the opposite direction,

  • and get some feature vectors. We're going to treat those as data and learn this guy,

  • then we're going to take these feature vectors, we're going to tack on ten labeled units.

  • So now we needed the labels but I'll get rid of that later. And so we've got a 510 dimensional

  • vector here and we're