字幕表 動画を再生する
[MUSIC PLAYING]
SPEAKER 1: All right.
Welcome back, everyone, to an introduction
to Artificial Intelligence with Python.
Now last time, we took a look at machine learning-- a set of techniques
that computers can use in order to take a set of data
and learn some patterns inside of that data, learn how to perform a task,
even if we, the programmers, didn't give the computer explicit instructions
for how to perform that task.
Today, we transition to one of the most popular techniques and tools
within machine learning that have neural networks.
And neural networks were inspired as early as the 1940s
by researchers who were thinking about how it is that humans learn,
studying neuroscience and the human brain,
and trying to see whether or not we can apply those same ideas to computers as
well, and model computer learning off of human learning.
So how is the brain structured?
Well, very simply put, the brain consists of a whole bunch of neurons,
and those neurons are connected to one another
and communicate with one another in some way.
In particular, if you think about the structure of a biological neural
network-- something like this--
there are a couple of key properties that scientists observed.
One was that these neurons are connected to each other
and receive electrical signals from one another,
that one neuron can propagate electrical signals to another neuron.
And another point is that neurons process
those input signals, and then can be activated, that a neuron becomes
activated at a certain point, and then can propagate further signals
onto neurons in the future.
And so the question then became, could we take this biological idea of how it
is that humans learn-- with brains and with neurons--
and apply that to a machine as well, in effect,
designing an artificial neural network, or an ANN, which
will be a mathematical model for learning that is inspired
by these biological neural networks?
And what artificial neural networks will allow us to do
is they will first be able to model some sort of mathematical function.
Every time you look at a neural network, which we'll see more of later today,
each one of them is really just some mathematical function
that is mapping certain inputs to particular outputs,
based on the structure of the network, that depending
on where we place particular units inside of this neural network,
that's going to determine how it is that the network is going to function.
And in particular, artificial neural networks
are going to lend themselves to a way that we can learn what
the network's parameters should be.
We'll see more on that in just a moment.
But in effect we want to model, such that it is easy for us
to be able to write some code that allows for the network
to be able to figure out how to model the right mathematical function,
given a particular set of input data.
So in order to create our artificial neural network,
instead of using biological neurons, we're
just going to use what we're going to call units--
units inside of a neural network--
which we can represent kind of like a node in a graph,
which will here be represented just by a blue circle like this.
And these artificial units-- these artificial neurons--
can be connected to one another.
So here, for instance, we have two units that
are connected by this edge inside of this graph, effectively.
And so what we're going to do now is think
of this idea as some sort of mapping from inputs to outputs,
that we have one unit that is connected to another unit,
that we might think of this side as the input and that side of the output.
And what we're trying to do then is to figure out how to solve a problem,
how to model some sort of mathematical function.
And this might take the form of something
we saw last time, which was something like, we
have certain inputs like variables x1 and x2, and given those inputs,
we want to perform some sort of task--
a task like predicting whether or not it's going to rain.
And ideally, we'd like some way, given these inputs x1 and x2,
which stand for some sort of variables to do with the weather,
we would like to be able to predict, in this case,
a Boolean classification-- is it going to rain, or is it not going to rain?
And we did this last time by way of a mathematical function.
We defined some function h for our hypothesis function
that took as input x1 and x2--
the two inputs that we cared about processing-- in order
to determine whether we thought it was going to rain, or whether we thought it
was not going to rain.
The question then becomes, what does this hypothesis function do in order
to make that determination?
And we decided last time to use a linear combination of these input variables
to determine what the output should be.
So our hypothesis function was equal to something
like this: weight 0 plus weight 1 times x1 plus weight 2 times x2.
So what's going on here is that x1 and x2--
those are input variables-- the inputs to this hypothesis function--
and each of those input variables is being
multiplied by some weight, which is just some number.
So x1 is being multiplied by weight 1, x2 is being multiplied by weight 2,
and we have this additional weight-- weight 0--
that doesn't get multiplied by an input variable
at all, that just serves to either move the function up or move the function's
value down.
You can think of this as either a weight that's
just multiplied by some dummy value, like the number
1 when it's multiplied by 1, and so it's not multiplied by anything.
Or sometimes you'll see in the literature,
people call this variable weight 0 a "bias,"
so that you can think of these variables as slightly different.
We have weights that are multiplied by the input
and we separately add some bias to the result as well.
You'll hear both of those terminologies used
when people talk about neural networks and machine learning.
So in effect, what we've done here is that in order
to define a hypothesis function, we just need
to decide and figure out what these weights should be,
to determine what values to multiply by our inputs to get some sort of result.
Of course, at the end of this, what we need
to do is make some sort of classification
like raining or not raining, and to do that, we use some sort of function
to define some sort of threshold.
And so we saw, for instance, the step function, which is defined as 1
if the result of multiplying the weights by the inputs is at least 0;
otherwise as 0.
You can think of this line down the middle-- it's kind
of like a dotted line.
Effectively, it stays at 0 all the way up to one point,
and then the function steps--
or jumps up-- to 1.
So it's zero before it reaches some threshold,
and then it's 1 after it reaches a particular threshold.
And so this was one way we could define what
we'll come to call an "activation function," a function that
determines when it is that this output becomes active--
changes to a 1 instead of being a 0.
But we also saw that if we didn't just want a purely binary classification,
if we didn't want purely 1 or 0, but we wanted
to allow for some in-between real number values,
we could use a different function.
And there are a number of choices, but the one that we looked at was
the logistic sigmoid function that has sort of an S-shaped curve,
where we could represent this as a probability--
that may be somewhere in between the probability of rain of something like
0.5, and maybe a little bit later the probability of rain is 0.8--
and so rather than just have a binary classification of 0 or 1,
we can allow for numbers that are in between as well.
And it turns out there are many other different types
of activation functions, where an activation function just
takes the output of multiplying the weights together and adding that bias,
and then figuring out what the actual output should be.
Another popular one is the rectified linear unit, otherwise known ReLU,
and the way that works is that it just takes as input
and takes the maximum of that input and 0.
So if it's positive, it remains unchanged, but i if it's negative,
it goes ahead and levels out at 0.
And there are other activation functions that we can choose as well.
But in short, each of these activation functions,
you can just think of as a function that gets applied to the result of all
of this computation.
We take some function g and apply it to the result of all of that calculation.
And this then is what we saw last time-- the way of defining
some hypothesis function that takes on inputs,
calculates some linear combination of those inputs,
and then passes it through some sort of activation function to get our output.
And this actually turns out to be the model
for the simplest of neural networks, that we're
going to instead represent this mathematical idea graphically, by using
a structure like this.
Here then is a neural network that has two inputs.
We can think of this as x1 and this as x2.
And then one output, which you can think of classifying whether or not
we think it's going to rain or not rain, for example,
in this particular instance.
And so how exactly does this model work?
Well, each of these two inputs represents one of our input variables--
x1 and x2.
And notice that these inputs are connected
to this output via these edges, which are
going to be defined by their weights.
So these edges each have a weight associated with them--
weight 1 and weight 2--
and then this output unit, what it's going to do
is it is going to calculate an output based on those inputs
and based on those weights.
This output unit is going to multiply all the inputs by their weights,
add in this bias term, which you can think of as an extra w0 term that
gets added into it, and then we pass it through an activation function.
So this then is just a graphical way of representing the same idea
we saw last time, just mathematically.
And we're going to call this a very simple neural network.
And we'd like for this neural network to be
able to learn how to calculate some function,
that we want some function for the neural network to learn,
and the neural network is going to learn what
should the values of w0, w1, and w2 be.
What should the activation function be in order
to get the result that we would expect?
So we can actually take a look at an example of this.
What then is a very simple function that we might calculate?
Well, if we recall back from when we were looking at propositional logic,
one of the simplest functions we looked at
was something like the or function, that takes two inputs--
x and y-- and outputs 1, otherwise known as true, if either one of the inputs,
or both of them, are 1, and outputs a 0 if both of the inputs are 0, or false.
So this then is the or function.
And this was the truth table for the or function-- that as long
as either of the inputs are 1, the output of the function is 1,
and the only case where the output of 0 is where both of the inputs are 0.
So the question is, how could we take this and train a neural network to be
able to learn this particular function?
What would those weights look like?
Well, we could do something like this.
Here's our neural network, and I'll propose
that in order to calculate the or function,
we're going to use a value of 1 for each of the weights,
and we'll use a bias of negative 1, and then
we'll just use this step function as our activation function.
How then does this work?
Well, if I wanted to calculate something like 0 or 0,
which we know to be 0, because false or false is false, then
what are we going to do?
Well, our output unit is going to calculate
this input multiplied by the weight.
0 times 1, that's 0.
Same thing here.
0 times 1, that's 0.
And we'll add to that the bias, minus 1.
So that'll give us some result of negative 1.
If we plot that on our activation function-- negative 1 is here--
it's before the threshold, which means either 0 or 1.
It's only 1 after the threshold.
Since negative 1 is before the threshold,
the output that this unit provides it is going to be 0.
And that's what we would expect it to be, that 0 or 0 should be 0.
What if instead we had had 1 or 0, where this is the number 1?
Well, in this case, in order to calculate
what the output is going to be, we again have to do this weighted sum.
1 times 1, that's 1.
0 times 1, that's 0.
Sum of that so far is 1.
Add negative 1 to that.
Well, then the output of 0.
And if we plot 0 on the step function, 0 ends up being here--
it's just at the threshold-- and so the output here
is going to be 1, because the output of 1 or 0, that's 1.
So that's what we would expect as well.
And just for one more example, if I had 1 or 1, what would the result be?
Well 1 times 1 is 1.
1 times 1 is 1.
The sum of those is 2.
I add the bias term to that.
I get the number 1.
1 plotted on this graph is way over there.
That's well beyond the threshold.
And so this output is going to be 1 as well.
The output is always 0 or 1, depending on whether or not
we're past the threshold.
And this neural network then models the or function-- a very simple function,
definitely-- but it still is able to model it correctly.
If I give it the inputs, it will tell me what x1 or x2 happens to be.
And you could imagine trying to do this for other functions
as well-- a function like the and function, for instance,
that takes two inputs and calculates whether both x and y are true.
So if x is 1 and y is 1, then the output of x and y is 1,
but in all of the other cases, the output is 0.
How could we model that inside of a neural network as well?
Well, it turns out we could do it in the same way, except instead of negative 1
as the bias, we can use negative 2 as the bias instead.
What does that end up looking like?
Well, if I had 1 and 1, that should be 1, because 1, true and true,
is equal to true.
Well, I take 1 times 1.
That's 1.
1 times 1 is 1.
I got a total sum of 2 so far.
Now I add the bias of negative 2, and I get the value 0.
And 0 when I plotted on the activation function is just past that threshold.
And so the output is going to be 1.
But if I had any other input, for example, like 1 and 0, well,
the weighted sum of these is 1 plus 0.
It's going to be 1.
Minus 2 is going to give us negative 1, and negative 1
is not past that threshold, and so the output is going to be zero.
So those then are some very simple functions
that we can model using a neural network, that has two inputs and one
output, where our goal is to be able to figure out
what those weights should be in order to determine what the output should be.
And you could imagine generalizing this to calculate more complex functions as
well, that maybe given the humidity and the pressure,
we want to calculate what's the probability that it's going to rain,
for example.
Or you might want to do a regression-style problem, where
given some amount of advertising and given what month it is maybe,
we want to predict what our expected sales are
going to be for that particular month.
So you could imagine these inputs and outputs being different as well.
And it turns out that in some problems, we're not just going to have two
inputs, and the nice thing about these neural networks is that we can compose
multiple units together-- make our networks more complex--
just by adding more units into this particular neural network.
So the network we've been looking at has two inputs and one output.
But we could just as easily say, let's go ahead
and have three inputs in there, or have even more inputs,
where we could arbitrarily decide, however many inputs there
are to our problem, all going to be calculating some sort of output
that we care about figuring out the value of.
How then does the math work for figuring out that output?
Well, it's going to work in a very similar way.
In the case of two inputs, we had two weights indicated by these edges,
and we multiplied the weights by the numbers, adding this bias term,
and we'll do the same thing in the other cases as well.
If I have three inputs, you'll imagine multiplying each of these three inputs
by each of these weights.
If I had five inputs instead, we're going to do the same thing.
Here, I'm saying sum up from 1 to 5.
xi multiplied by weight i.
So take each of the five input variables,
multiply them by their corresponding weight, and then add the bias to that.
So this would be a case where there are five inputs into this neural network,
for example.
But there could be more arbitrarily many nodes
that we want inside of this neural network,
where each time we're just going to sum up
all of those input variables multiplied by the weight,
and then add the bias term at the very end.
And so this allows us to be able to represent
problems that have even more inputs, just by growing
the size of our neural network.
Now, the next question we might ask is a question
about how it is that we train these internal networks?
In the case of the or function and the and function,
they were simple enough functions that I could just
tell you like here what the weights should be,
and you could probably reason through it yourself
what the weights should be in order to calculate the output that you want.
But in general, with functions like predicting sales or predicting
whether or not it's going to rain, these are much trickier
functions to be able to figure out.
We would like the computer to have some mechanism of calculating what it is
that the weights should be-- how it is to set the weights--
so that our neural network is able to accurately model the function
that we care about trying to estimate.
And it turns out that the strategy for doing this,
inspired by the domain of calculus, is a technique called gradient descent.
And what gradient descent is, it is an algorithm for minimizing loss
when you're training a neural network.
And recall that loss refers to how bad our hypothesis function happens to be,
that we can define certain loss functions,
and we saw some examples of loss functions
last time that just give us a number for any particular hypothesis,
saying how poorly does it model the data?
How many examples does it get wrong?
How are they worse or less bad as compared to other hypothesis functions
that we might define?
And this loss function is just a mathematical function,
and when you have a mathematical function,
in calculus, what you could do is calculate
something known as the gradient, which you can think of is like a slope.
It's the direction the loss function is moving at any particular point.
And what it's going to tell us is in which direction
should we be moving these weights in order to minimize the amount of loss?
And so generally speaking-- we won't get into the calculus of it--
but the high-level idea for gradient descent
is going to look something like this.
If we want to train a neural network, we'll
go ahead and start just by choosing the weights randomly.
Just pick random weights for all of the weights in the neural network.
And then we'll use the input data that we have access to in order
to train the network in order to figure out
what the weights should actually be.
So we'll repeat this process again and again.
The first step is we're going to calculate the gradient based
on all of the data points.
So we'll look at all the data and figure out what the gradient is at the place
where we currently are-- for the current setting of the weights--
which means that in which direction should we move the weights in order
to minimize the total amount of loss in order to make our solution better?
And once we've calculated that gradient--
which direction we should move in the loss function--
well, then we can just update those weights according to the gradient,
take a small step in the direction of those weights
in order to try to make our solution a little bit better.
And the size of the step that we take, that's going to vary,
and you can choose that when you're training a particular neural network.
But in short, the idea is going to be take all of the data points,
figure out based on those data points in what direction the weights should move,
and then move the weights one small step in that direction.
And if you repeat that process over and over again,
adjusting the weights a little bit at a time based on all the data points,
eventually, you should end up with a pretty good solution to trying
to solve this sort of problem.
At least that's what we would hope to happen.
Now as you look at this algorithm, a good question
to ask anytime you're analyzing an algorithm
is, what is going to be the expensive part of doing the calculation?
What's going to take a lot of work to try to figure out what
is going to be expensive to calculate?
And in particular, in the case of gradient descent,
the really expensive part is this all data points part right here,
having to take all of the data points and using all of those data
points to figure out what the gradient is at this particular setting of all
of the weights, because odds are, in a big machine learning problem
where you're trying to solve a big problem with a lot of data,
you have a lot of data points in order to calculate,
and figuring out the gradient based on all of those data points
is going to be expensive.
And you'll have to do it many times, but you'll likely repeat this process
again and again and again, going through all the data points,
taking one small step over and over, as you try and figure
out what the optimal setting of those weights happens to be.
It turns out that we would ideally like to be
able to train our neural networks faster to be able to more quickly converge
to some sort of solution that is going to be a good solution to the problem.
So in that case, there are alternatives to just standard gradient descent,
which looks at all of the data points at once.
We can employ a method like stochastic gradient descent, which will randomly
just choose one data point at a time to calculate the gradient based on,
instead of calculating it based on all of the data points.
So the idea there is that we have some setting of the weights,
we pick a data point, and based on that one data point,
we figure out in which direction should we move all of the weights,
and move the weights in that small direction, then take another data point
and do that again, and repeat this process again and again,
maybe looking at each of the data points multiple times,
but each time, only using one data point to calculate the gradient
to calculate which direction we should move in.
Now just using one data point instead of all of the data points
probably gives us a less accurate estimate
of what the gradient actually is.
But on the plus side, it's going to be much faster to be able to calculate,
that we can much more quickly calculate what the gradient is, based on one data
point, instead of calculating based on all of the data points
and having to do all of that computational work again and again.
So there are trade-offs here between looking at all of the data points
and just looking at one data point.
And it turns out that a middle ground-- and this is also quite popular--
is a technique called mini-batch gradient descent,
where the idea there is instead at looking at all of the data versus just
a single point, we instead divide our dataset up into small batches--
groups of data points-- where you can decide how big a particular batch is,
but in short, you're just going to look at a small number of points
at any given time, hopefully getting a more accurate estimate of the gradient,
but also not requiring all of the computational effort needed
to look at every single one of these data points.
So gradient descent then is this technique
that we can use in order to train these neural networks in order
to figure out what the setting of all of these weights
should be, if we want some way to try and get an accurate notion of how it is
that this function should work, some way of modeling how to transform
the inputs into particular outputs.
So far, the networks that we've taken a look at
have all been structured similar to this.
We have some number of inputs-- maybe two or three or five or more--
and then we have one output that is just predicting like rain or no rain,
or just predicting one particular value.
But often in machine learning problems, we don't just care about one output.
We might care about an output that has multiple different values associated
with it.
So in the same way that we could take a neural network
and add units to the input layer, we can likewise add outputs
to the output layer as well.
Instead of just one output, you could imagine we have two outputs,
or we could have like four outputs, for example, where in each case,
as we add more inputs or add more outputs,
if we want to keep this network fully connected between these two layers,
we just need to add more weights, that now each of these input nodes
have four weights associated with each of the four outputs,
and that's true for each of these various different input nodes.
So as we add nodes, we add more weights in order
to make sure that each of the inputs can somehow
be connected to each of the outputs, so that each output
value can be calculated based on what the value of the input happens to be.
So what might a case be where we want multiple different output values?
Well, you might consider that in the case of weather
predicting, for example, we might not just care
whether it's raining or not raining.
There might be multiple different categories of weather
that we would like to categorize the weather into.
With just a single output variable, we can do a binary classification,
like rain or no rain, for instance--
1 or 0-- but it doesn't allow us to do much more than that.
With multiple output variables, I might be
able to use each one to predict something a little different.
Maybe I want to categorize the weather into one
of four different categories, something like,
is it going to be raining or sunny or cloudy or snowy,
and I now have four output variables that
can be used to represent maybe the probability that it is raining,
as opposed to sunny, as opposed to cloudy, or as opposed to snowy.
How then would this neural network work?
Well, we have some input variables that represent some data
that we have collected about the weather.
Each of those inputs gets multiplied by each
of these various different weights.
We have more multiplications to do, but these
are fairly quick mathematical operations to perform.
And then what we get is after passing them
through some sort of activation function in the outputs,
we end up getting some sort of number, where that number, you might imagine,
you can interpret as like a probability, like a probability
that it is one category, as opposed to another category.
So here we're saying that based on the inputs,
we think there is a 10% chance that it's raining, a 60% chance that it's sunny,
a 20% chance of cloudy, a 10% chance of it's snowy.
And given that output, if these represent a probability distribution,
well, then you could just pick whichever one has the highest value--
in this case, sunny--
and say that, well, most likely, we think
that this categorization of inputs means that the output should be sunny,
and that is what we would expect the weather
to be in this particular instance.
So this allows us to do these sort of multi-class classifications,
where instead of just having a binary classification--
1 or 0-- we can have as many different categories as we
want, and we can have our neural network output these probabilities
over which categories are most more likely than other categories,
and using that data, we're able to draw some sort of inference
on what it is that we should do.
So this was sort of the idea of supervised machine learning.
I can give this neural network a whole bunch of data--
whole bunch of input data--
corresponding to some label, some output data--
like we know that it was raining on this day,
we know that it was sunny on that day--
and using all of that data, the algorithm
can use gradient descent to figure out what all of the weights
should be in order to create some sort of model that
hopefully allows us a way to predict what
we think the weather is going to be.
But neural networks have a lot of other applications as well.
You can imagine applying the same sort of idea
to a reinforcement learning sort of example as well.
Well, you remember that in reinforcement learning, we wanted to do
is train some sort of agent to learn what action to take depending on what
state they currently happen to be in.
So depending on the current state of the world,
we wanted the agent to pick from one of the available actions that
is available to them.
And you might model that by having each of these input variables
represent some information about the state--
some data about what state our agent is currently in--
and then the output, for example, could be
each of the various different actions that our agent could
take-- action 1, 2, 3, and 4, and you might
imagine that this network would work in the same way,
that based on these particular inputs we go ahead
and calculate values for each of these outputs,
and those outputs could model which action is better than other actions,
and we could just choose, based on looking at those outputs, which
actions we should take.
And so these neural networks are very broadly applicable,
that all they're really doing is modeling some mathematical function.
So anything that we can frame as a mathematical function, something
like classifying inputs into various different categories,
or figuring out based on some input state what
action we should take-- these are all mathematical functions that we could
attempt to model by taking advantage of this neural network structure,
and in particular, taking advantage of this technique, gradient descent,
that we can use in order to figure out what the weights should be in order
to do this sort of calculation.
Now how is it that you would go about training a neural network that has
multiple outputs instead of just one?
Well, with just a single output, we could
see what the output for that value should be,
and then you update all of the weights that corresponded to it.
And when we have multiple outputs, at least in this particular case,
we can really think of this as four separate neural networks,
that really we just have one network here
that has these three inputs, corresponding with these three weights,
corresponding to this one output value.
And the same thing is true for this output value.
This output value effectively defines yet another neural network
that has these same three inputs, but a different set of weights
that correspond to this output.
And likewise, this output has its own set of weights as well,
and the same thing for the fourth output too.
And so if you wanted to train a neural network that had four outputs instead
of just one, in this case where the inputs are directly connected
to the outputs, you could really think of this
as just training four independent neural networks.
We know what the outputs for each of these four
should be based on our input data, and using that data,
we can begin to figure out what all of these individual weights should be,
and maybe there's an additional step at the end to make sure
that turn these values into a probability distribution,
such that we can interpret which one is better than another
or more likely than another as a category or something like that.
So this then seems like it does a pretty good job of taking inputs and trying
to predict what outputs should be, and we'll
see some real examples of this in just a moment as well.
But it's important then to think about what
the limitations of this sort of approach is,
of just taking some linear combination of inputs
and passing it into some sort of activation function.
And it turns out that when we do this in the case of binary classification--
I'm trying to predict like does it belong to one category or another--
we can only predict things that are linearly separable, because we're
taking a linear combination of inputs and using that to define some decision
boundary or threshold.
Then what we get is a situation where if we have this set of data,
we can predict a line that separates linearly
the red points from the blue points.
But a single unit that is making a binary classification,
otherwise known as a perceptron, can't deal with a situation like this,
where-- we've seen this type of situation before--
where there is no straight line that just
goes straight through the data that will divide the red points away
from the blue points.
It's a more complex decision boundary.
The decision boundary somehow needs to capture the things
inside of the circle, and there isn't really a line
that will allow us to deal with that.
So this is the limitation of the perceptron--
these units that just make these binary decisions based on their inputs--
that a single perceptron is only capable of learning
a linearly separable decision boundary.
It can do is define a line.
And sure, it can give us probabilities based
on how close to that decision boundary we are,
but it can only really decide based on a linear decision boundary.
And so this doesn't seem like it's going to generalize well to situations
where real-world data is involved, because real-world data often
isn't linearly separable.
It often isn't the case that we can just draw a line through the data
and be able to divide it up into multiple groups.
So what then is the solution to this?
Well, what was proposed was the idea of a multilayer neural network,
that so far, all of the neural networks we've seen have had a set of inputs
and a set of outputs, and the inputs are connected to those outputs.
But in a multi-layer neural network, this is going to be an artificial
neural network that has an input layer still, it has an output layer,
but also has one or more hidden layers in between--
other layers of artificial neurons, or units, that
are going to calculate their own values as well.
So instead of a neural network that looks like this,
with three inputs and one output, you might imagine, in the middle here,
injecting a hidden layer--
something like this.
This is a hidden layer that has four nodes.
You could choose how many nodes or units end up going into the hidden layer,
and you have multiple hidden layers as well.
And so now each of these inputs isn't directly connected to the output.
Each of the inputs is connected to this hidden layer, and then
all of the nodes in the hidden layer, those are connected to the one output.
And so this is just another step that we can
take towards calculating more complex functions.
Each of these hidden units will calculate its output value,
otherwise known as its activation, based on a linear combination
of all the inputs.
And once we have values for all of these nodes,
as opposed to this just being the output, we do the same thing again--
calculate the output for this node, based
on multiplying each of the values for these units by their weights as well.
So in effect, the way this works is that we start with inputs.
They get multiplied by weights in order to calculate
values for the hidden nodes.
Those get multiplied by weights in order to figure out what
the ultimate output is going to be.
And the advantage of layering things like this is it gives us an ability
to model more complex functions, that instead of just having a single
decision boundary-- a single line dividing the red points from the blue
points--
each of these hidden nodes can learn a different decision boundary,
and we can combine those decision boundaries to figure out what
the ultimate output is going to be.
And as we begin to imagine more complex situations,
you could imagine each of these nodes learning some useful property
or learning some useful feature of all of the inputs
and somehow learning how to combine those features together in order to get
the output that we actually want.
Now the natural question, when we begin to look at this now,
is to ask the question of, how do we train a neural network
that has hidden layers inside of it?
And this turns out to initially be a bit of a tricky question,
because the input data we are given is we are given values for all
of the inputs, and we're given what the value of the output should be--
what the category is, for example--
but the input data doesn't tell us what the values for all of these nodes
should be.
So we don't know how far off each of these nodes
actually is, because we're only given data for the inputs and the outputs.
The reason this is called the hidden layer
is because the data that is made available to us
doesn't tell us what the values for all of these intermediate nodes
should actually be.
And so the strategy people came up with was to say that if you know what
the error or the losses on the output node, well,
then based on what these weights are-- if one of these weights is higher than
another--
you can calculate an estimate for how much the error from this node
was due to this part of the hidden node, or this part of the hidden layer,
or this part of the hidden layer, based on the values of these weights,
in effect saying, that based on the error from the output,
I can backpropagate the error and figure out
an estimate for what the error is for each of these the hidden layer as well.
And there's some more calculus here that we won't get into the details of,
but the idea of this algorithm is known as backpropagation.
It's an algorithm for training a neural network
with multiple different hidden layers.
And the idea for this-- the pseudocode for it--
will again be, if we want to run gradient descent with backpropagation,
we'll start with a random choice of weights as we did before,
and now we'll go ahead and repeat the training process again and again.
But what we're going to do each time is now
we're going to calculate the error for the output layer first.
We know the output and what it should be, and we know what we calculated,
so we figure out what the error there is.
But then we're going to repeat, for every layer,
starting with the output layer, moving back into the hidden layer,
then the hidden layer before that if there are multiple hidden layers,
going back all the way to the very first hidden layer,
assuming there are multiple, we're going to propagate the error back one layer--
whatever the error was from the output--
figure out what the error should be a layer before that based on what
the values of those weights are.
And then we can update those weights.
So graphically, the way you might think about this
is that we first start with the output.
We know what the output should be.
We know what output we calculated.
And based on that, we can figure out, all right,
how do we need to update those weights, backpropagating
the error to these nodes.
And using that, we can figure out how we should update these weights.
And you might imagine if there are multiple layers,
we could repeat this process again and again
to begin to figure out how all of these weights should be updated.
And this backpropagation algorithm is really
the key algorithm that makes neural networks possible,
and makes it possible to take these multi-level structures
and be able to train those structures, depending
on what the values of these weights are in order to figure out
how it is that we should go about updating those weights in order
to create some function that is able to minimize the total amount of loss,
to figure out some good setting of the weights that will take the inputs
and translate it into the output that we expect.
And this works, as we said, not just for a single hidden layer,
but you can imagine multiple hidden layers, where each hidden layer--
we just defined however many nodes we want--
where each of the nodes in one layer, we can
connect to the nodes in the next layer, defining more and more complex
networks that are able to model more and more complex types of functions.
And so this type of network is what we might call a deep neural network, part
of a larger family of deep learning algorithms,
if you've ever heard that term.
And all deep learning is about is it's using multiple layers to be
able to predict and be able to model higher-level features inside
of the input, to be able to figure out what the output should be.
And so the deep neural network is just a neural network that
has multiple of these hidden layers, where we start at the input,
calculate values for this layer, then this layer, then this layer,
and then ultimately get an output.
And this allows us to be able to model more and more sophisticated
types of functions, that each of these layers
can calculate something a little bit different.
And we can combine that information to figure out what the output should be.
Of course, as with any situation of machine learning,
as we begin to make our models more and more complex,
to model more and more complex functions, the risk we run
is something like overfitting.
And we talked about overfitting last time
in the context of overfitting based on when we were training our models to be
able to learn some sort of decision boundary, where overfitting happens
when we fit too closely to the training data, and as a result,
we don't generalize well to other situations as well.
And one of the risks we run with a far more complex neural network that
has many, many different nodes is that we
might overfit based on the input data; we
might grow over-reliant on certain nodes to calculate things just purely based
on the input data that doesn't allow us to generalize very well to the output.
And there are a number of strategies for dealing with overfitting,
but one of the most popular in the context of neural networks
is a technique known as dropout.
And what dropout does is it when we're training the neural network, what we'll
do in dropout, is temporarily remove units,
temporarily remove these artificial neurons
from our network, chosen at random, and the goal here
is to prevent over-reliance on certain units.
So what generally happens in overfitting is
that we begin to over-rely on certain units inside the neural network
to be able to tell us how to interpret the input data.
What dropout will do is randomly remove some of these units
in order to reduce the chance that we over-rely on certain units,
to make our neural network more robust, to be
able to handle the situations even when we just drop out particular neurons
entirely.
So the way that might work is we have a network like this,
and as we're training it, when we go about trying
to update the weights the first time, we'll
just randomly pick some percentage of the nodes to drop out of the network.
It's as if those nodes aren't there at all.
It's as if the weights associated with those nodes aren't there at all.
And we'll train in this way.
Then the next time we update the weights, we'll pick a different set
and just go ahead and train that way, and then again randomly choose
and train with other nodes that have been dropped that as well.
And the goal of that is that after the training process,
if you train by dropping out random nodes inside of this neural network,
you hopefully end up with a network that's a little bit more robust, that
doesn't rely too heavily on any one particular node,
but more generally learns how to approximate a function in general.
So that then is a look at some of these techniques
that we can use in order to implement a neural network, to get
at the idea of taking this input, passing it
through these various different layers, in order
to produce some sort of output.
And what we'd like to do now is take those ideas and put them into code.
And to do that, there are a number of different machine learning
libraries-- neural network libraries-- that we can use that
allow us to get access to someone's implementation of backpropagation
and all of these hidden layers.
And one of the most popular, developed by Google,
is known as TensorFlow, a library that we
can use for quickly creating neural networks
and modeling them and running them on some sample data
to see what the output is going to be.
And before we actually start writing code,
we'll go ahead and take a look at TensorFlow's Playground, which
will be an opportunity for us just to play around
with this idea of neural networks in different layers,
just to get a sense for what it is that we can do by taking advantage
of a neural networks.
So let's go ahead and go into TensorFlow's Playground, which you can
go to by visiting that URL from before.
And what we're going to do now is we're going to try and learn the decision
boundary for this particular output.
I want to learn to separate the orange points from the blue points,
and I'd like to learn some sort of setting of weights
inside of a neural network that will be able to separate those from each other.
The features we have access to, our input data,
are the x value and the y value, so the two values along each of the two axes.
And what I'll do now is I can set particular parameters, like what
activation function I would like to use, and I'll just go ahead
and press Play and see what happens.
And what happens here is that you'll see that just by using these two input
features-- the x value and the y value, with no hidden layers--
just take the input, x and y values, and figure out what the decision boundary
is--
our neural network learns pretty quickly that in order
to divide these two points, we should just use this line.
This line acts as the decision boundary that separates this group of points
from that group of points, and it does it very well.
You can see up here what the loss is.
The training loss is zero, meaning we were
able to perfectly model separating these two points from each other inside
of our training data.
So this was a fairly simple case of trying to apply a neural network,
because the data is very clean it's very nicely linearly separable.
We can just draw a line that separates all of those points from each other.
Let's now consider a more complex case.
So I'll go ahead and pause the simulation,
and we'll go ahead and look at this data set here.
This data set is a little bit more complex now.
In this data set, we still have blue and orange points
that we'd like to separate from each other,
but there is no single line that we can draw
that is going to be able to figure out how to separate
the blue from the orange, because the blue is located in these two quadrants
and the orange is located here and here.
It's a more complex function to be able to learn.
So let's see what happens if we just try and predict based on those inputs--
the x- and y-coordinates-- what the output should be.
Press Play, and what you'll notice is that we're not really able
to draw much of a conclusion, that we're not
able to very cleanly see how we should divide
the orange points from the blue points, and you don't
see a very clean separation there.
So it seems like we don't have enough sophistication inside of our network
to be able to model something that is that complex.
We need a better model for this neural network.
And I'll do that by adding a hidden layer.
So now I have the hidden layer that has two neurons inside of it.
So I have two inputs that then go to two neurons inside of a hidden layer
that then go to our output, and now I'll press Play, and what you'll notice here
is that we're able to do slightly better.
We're able to now say, all right, these points are definitely blue.
These points are definitely orange.
We're still struggling a little bit with these points up here though,
and what we can do is we can see for each
of these hidden neurons what is it exactly
that these hidden neurons are doing.
Each hidden neuron is learning its own decision boundary,
and we can see what that boundary is.
This first neuron is learning, all right,
this line that seems to separate some of the blue points
from the rest of the points.
This other hidden neuron is learning another line
that seems to be separating the orange points in the lower
right from the rest of the points.
So that's why we're able to sort of figure out
these two areas in the bottom region, but we're still not
able to perfectly classify all of the points.
So let's go ahead and add another neuron--
now we've got three neurons inside of our hidden layer--
and see what we're able to learn now.
All right.
Well, now we seem to be doing a better job
by learning three different decision boundaries, which
each of the three neurons inside of our hidden layer
were able to much better figure out how to separate these blue points
from the orange points.
And you can see what each of these hidden neurons is learning.
Each one is learning a slightly different decision boundary,
and then we're combining those decision boundaries together
to figure out what the overall output should be.
And we can try it one more time by adding a fourth neuron there
and try learning that.
And it seems like now we can do even better
at trying to separate the blue points from the orange points,
but we were only able to do this by adding a hidden layer,
by adding some layer that is learning some other boundaries,
and combining those boundaries to determine the output.
And the strength-- the size and thickness of these lines--
and indicate how high these weights are, how important each of these inputs
is, for making this sort of calculation.
And we can do maybe one more simulation.
Let's go ahead and try this on a data set that looks like this.
Go ahead and get rid of the hidden layer.
Here now we're trying to separate the blue points
from the orange points, where all the blue points are located, again,
inside of a circle, effectively.
So we're not going to be able to learn a line.
Notice I press Play, and we're really not
able to draw any sort of classification at all,
because there is no line that cleanly separates
the blue points from the orange points.
So let's try to solve this by introducing a hidden layer.
I'll go ahead and press Play.
And all right.
With two neurons and a hidden layer, we're
able to do a little better, because we effectively learned
two different decision boundaries.
We learned this line here, and we learned this line
on the right-hand side.
And right now, we're just saying, all right, well, if it's in-between,
we'll call it blue, and if it's outside, we'll call it orange.
So, not great, but certainly better than before.
We're learning one decision boundary and another, and based on those,
we can figure out what the output should be.
But let's now go ahead and add a third neuron and see what happens now.
I go ahead and train it.
And now, using three different decision boundaries
that are learned by each of these hidden neurons,
we're able to much more accurately model this distinction
between blue points and orange points.
We're able to figure out, maybe with these three decision boundaries,
combining them together, you can imagine figuring out what the output should be
and how to make that sort of classification.
And so the goal here is just to get a sense
for having more neurons in these hidden layers that
allows us to learn more structure in the data,
allows us to figure out what the relevant and important decision
boundaries are.
And then using this backpropagation algorithm,
we're able to figure out what the values of these weights
should be in order to train this network to be
able to classify one category of points away from another category of points
instead.
And this is ultimately what we're going to be trying to do whenever
we're training a neural network.
So let's go ahead and actually see an example of this.
You'll recall from last time that we had this banknotes file that
included information about counterfeit banknotes as opposed
to authentic banknotes, where it had four different values for each banknote
and then a categorization of whether that bank note is considered
to be authentic or a counterfeit note.
And what I wanted to do was, based on that input information,
figure out some function that could calculate
based on the input information what category it belonged to.
And what I've written here in banknotes.py
is a neural network that we'll learn just that, a network that learns,
based on all of the input, whether or not
we should categorize a banknote as authentic or as counterfeit.
The first step is the same as what we saw from last time.
I'm really just reading the data in and getting it into an appropriate format.
And so this is where more of the writing Python code on your own
comes in terms of manipulating this data,
massaging the data into a format that will
be understood by a machine learning library
like scikit-learn or like TensorFlow.
And so here I separate it into a training and a testing set.
And now what I'm doing down below is I'm creating a neural network.
Here I'm using tf, which stands for TensorFlow.
Up above I said, import TensorFlow as tf.
So you have just an abbreviation that we'll often use,
so we don't need to write out TensorFlow every time we want
to use anything inside of the library.
I'm using tf.keras.
Keras is an API, a set of functions that we
can use in order to manipulate neural networks inside of TensorFlow,
and it turns out there are other machine learning
libraries that also use the Kersa API.
But here, I'm saying, all right, go ahead and give me
a model that is a sequential model-- a sequential neural network--
meaning one layer after another.
And now I'm going to add to that model what layers I want inside
of my neural network.
So here I'm saying, model.add.
Go ahead and add a dense layer--
and when we say a dense layer, we mean a layer that
is just each of the nodes inside of the layer
is going to be connected to each from the previous layer,
so we have a densely connected layer.
This layer is going to have eight units inside of it.
So it's going to be a hidden layer inside of a neural network with eight
different units, eight artificial neurons, each of which
might learn something different.
And I just sort of chose eight arbitrarily.
You could choose a different number of hidden nodes inside of the layer.
And as we saw before, depending on the number of units
there are inside of your head and layer, more units
means you can learn more complex functions,
so maybe you can more accurately model the training data,
but it comes at a cost.
More units means more weights that you need to figure out how to update,
so it might be more expensive to do that calculation.
And you also run the risk of overfitting on the data if you have too many units,
and you learn to just overfit on the training data.
That's not good either.
So there is a balance, and there's often a testing process,
where you'll train on some data and maybe validate how well you're
doing on a separate set of data--
often called a validation set-- to see, all right, which setting of parameters,
how many layers should I have, how many units
should be in each layer, which one of those
performs the best on the validation set?
So you can do some testing to figure out what these hyperparameters, so-called,
should be equal to.
Next I specify what the input_shape is, meaning what does my input look like?
My input has four values, and so the input shape
is just 4, because we have four inputs.
And then I specify what the activation function is.
And the activation function, again, we can choose.
There a number of different activation functions.
Here I'm using relu, which you might recall from earlier.
And then I'll add an output layer.
So I have my hidden layer.
Now I'm adding one more layer that will just
have one unit, because all I want to do is predict something
like counterfeit bill or authentic bill.
So I just need a single unit.
And the activation function I'm going to use here
is that sigmoid activation function, which
again was that S-shaped curve that just gave us like a probability of,
what is the probability that this is a counterfeit bill as opposed
to an authentic bill?
So that then is the structure of my neural network-- sequential neural
network that has one hidden layer with eight units inside of it,
and then one output layer that just has a single unit inside of it.
And I can choose how many units there are.
I can choose the activation function.
Then I'm going to compile this model.
TensorFlow gives you a choice of how you would like to optimize the weights--
there are various different algorithms for doing that--
what type of loss function you want to use-- again,
many different options for doing that--
and then how I want to evaluate my model.
Well, I care about accuracy.
I care about how many of my points am I able to classify correctly
versus not correctly of counterfeit or not counterfeit,
and I would like it to report to me how accurate my model is performing.
Then, now that I've defined that model, I
call model.fit to say, go ahead and train the model.
Train it on all the training data, plus all of the training labels--
so labels for each of those pieces of training data--
and I'm saying run it for 20 epochs, meaning go ahead
and go through each of these training points 20 times effectively,
go through the data 20 times and keep trying to update the weights.
If I did it for more, I could train for even longer
and maybe get a more accurate result. But then
after I fit in on all the data, I'll go ahead and just test it.
I'll evaluate my model using model.evaluate,
built into TensorFlow, that is just going to tell me,
how well do I perform on the testing data?
So ultimately, this is just going to give me
some numbers that tell me how well we did in this particular case.
So now what I'm going to do is go into banknotes
and go ahead and run banknotes.py.
And what's going to happen now is it's going
to read in all of that trading data.
It's going to generate a neural network with all my inputs,
my eight hidden layers, or eight hidden units inside my layer,
and then an output unit, and now what it's doing is it's training.
It's training 20 times, and each time, you
can see how my accuracy is increasing on my training data.
It starts off, the very first time, not very accurate,
though better than random, something like 79% of the time,
it's able to accurately classify one bill from another.
But as I keep training, notice this accuracy value improves and improves
and improves, until after I've trained through all of the data points
20 times, it looks like my accuracy is above 99% on the training data.
And here's where I tested it on a whole bunch of testing data.
And it looks like in this case, I was also like 99.8% accurate.
So just using that, I was able to generate a neural network that
can detect counterfeit bills from authentic bills
based on this input data 99.8% of the time, at least
based on this particular testing data.
And I might want to test it with more data
as well, just to be confident about that.
But this is really the value of using a machine learning library
like TensorFlow, and there are others available for Python
and other languages as well, but all I have to do
is define the structure of the network and define the data
that I'm going to pass into the network, and then
TensorFlow runs the backpropagation algorithm
for learning what all of those weights should be,
for figuring out how to train this neural network to be able to,
as accurately as possible, figure out what the output values should
be there as well.
And so this then was a look at what it is that neural networks can do, just
using these sequences of layer after layer after layer,
and you can begin to imagine applying these to much more general problems.
And one big problem in computing, and artificial intelligence more generally,
is the problem of computer vision.
Computer vision is all about computational methods
for analyzing and understanding images, that you might have pictures
that you want the computer to figure out how to deal with,
how to process those images, and figure out how to produce
some sort of useful result out of this.
You've seen this in the context of social media websites
that are able to look at a photo that contains a whole bunch of faces,
and it's able to figure out what's a picture of whom
and label those and tag them with appropriate people.
This is becoming increasingly relevant as we
begin to discuss self-driving cars.
These cars now have cameras, and we would
like for the computer to have some sort of algorithm that
looks at the images and figures out, what
color is the light, what cars are around us and in what direction, for example.
And so computer vision is all about taking an image
and figuring out what sort of computation--
what sort of calculation-- we can do with that image.
It's also relevant in the context of something like handwriting recognition.
This, what you're looking at, is an example of the MNIST dataset--
it's a big dataset just of handwritten digits--
that we could use to, ideally, try and figure out how to predict,
given someone's handwriting, given a photo of a digit that they have drawn,
can you predict whether it's a 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, for example.
So this sort of handwriting recognition is yet another task
that we might want to use computer vision tasks and tools to be
able to apply it towards.
This might be a task that we might care about.
So how then can we use neural networks to be
able to solve a problem like this?
Well, neural networks rely upon some sort of input,
where that input is just numerical data.
We have a whole bunch of units, where each one of them
just represents some sort of number.
And so in the context of something like handwriting recognition,
or in the context of just an image, you might
imagine that an image is really just a grid of pixels, a grid of dots,
where each dot has some sort of color, and in the context
of something like handwriting recognition,
you might imagine that if you just fill in each
of these dots in a particular way, you can generate a 2 or an 8,
for example, based on which dots happen to be shaded in and which dots are not.
And we can represent each of these pixel values just using numbers.
So for a particular pixel, for example, 0 might represent entirely black.
Depending on how you're representing color,
it's often common to represent color values on a 0-to-255 range,
so that you can represent a color using eight bits for a particular value,
like how much white is in the image?
So 0 might represent all black, 255 might represent entirely white
as a pixel, and somewhere in between might represent some shade of gray,
for example.
But you might imagine not just having a single slider that determines how much
white is in the image, but if you had a color image,
you might imagine three different numerical values-- a red, green,
and blue value--
where the red value controls how much red is in the image,
we have one value for controlling how much green is in the pixel,
and one value for how much blue is in the pixel as well.
And depending on how it is that you set these values of red, green, and blue,
you can get a different color.
And so any pixel can really be represented in this case
by three numerical values--
a red value, a green value, and a blue value.
And if you take a whole bunch of these pixels,
assemble them together inside of a grid of pixels, then
you really just have a whole bunch of numerical values
that you can use in order to perform some sort of prediction task.
And so what you might imagine doing is using the same techniques
we talked about before.
Just design a neural network with a lot of inputs, that for each of the pixels,
we might have one or three different inputs in the case of a color image--
a different input-- that is just connected to a deep neural network,
for example.
And this deep neural network might take all
of the pixels inside of the image of what digit a person drew,
and the output might be like 10 neurons that classify it as a 0 or a 1
or 2 or 3, or just tells us in some way what that digit happens to be.
Now there are a couple of drawbacks to this approach.
The first drawback to the approach is just the size of this input array,
that we have a whole bunch of inputs.
If we have a big image, that is a lot of different channels
we're looking at-- a lot of inputs, and therefore, a lot of weights
that we have to calculate.
And a second problem is the fact that by flattening everything
into just the structure of all the pixels,
we've lost access to a lot of the information about the structure
of the image that's relevant, that really,
when a person looks at an image, they're looking
at particular features of that image.
They're looking at curves.
They're looking at shapes.
They're looking at what things can you identify
in different regions of the image, and maybe put those things together
in order to get a better picture of what the overall image was about.
And by just turning it into a pixel values for each of the pixels,
sure, you might be able to learn that structure,
but it might be challenging in order to do so.
It might be helpful to take advantage of the fact that you can use properties
of the image itself-- the fact that it's structured in a particular way--
to be able to improve the way that we learn based on that image too.
So in order to figure out how we can train our neural networks to better
be able to deal with images, we'll introduce a couple of ideas--
a couple of algorithms-- that we can apply that allow us to take the images
and extract some useful information out of that image.
And the first idea we'll introduce is the notion of image convolution.
And what an image convolution is all about is it's about filtering an image,
sort of extracting useful or relevant features out of the image.
And the way we do that is by applying a particular filter that basically adds
the value for every pixel with the values for all of the neighboring
pixels to it.
According to some sort of kernel matrix, which we'll see in a moment,
it's going to allow us to weight these pixels in various different ways.
And the goal of image convolution then is
to extract some sort of interesting or useful features out of an image,
to be able to take a pixel, and based on its neighboring pixels,
maybe predict some sort of valuable information, something
like taking a pixel and looking at its neighboring pixels,
you might be able to predict whether or not
there's some sort of curve inside the image,
or whether it's forming the outline of a particular line or a shape,
for example, and that might be useful if you're
trying to use all of these various different features
to combine them to say something meaningful about an image as a whole.
So how then does image convolution work?
Well, we start with a kernel matrix, and the kernel matrix
looks something like this.
And the idea of this is that given a pixel--
that would be the middle pixel--
we're going to multiply each of the neighboring pixels by these values
in order to get some sort of result by summing up all of the numbers together.
So if I take this kernel, which you can think of is like a filter
that I'm going to apply to the image.
And let's say that I take this image.
This is a four-by-four image.
We'll think of it as just a black and white image, where each one is just
a single pixel value, so somewhere between 0 and 255, for example.
So we have a whole bunch of individual pixel values like this,
and what I'd like to do is apply this kernel--
this filter, so to speak--
to this image.
And the way I'll do that is, all right, the kernel is three-by-three.
So you can imagine a five-by-five kernel or a larger kernel too.
And I'll take it and just first apply it to the first three-by-three section
of the image.
And what I'll do is I'll take each of these pixel values
and multiply it by its corresponding value in the filter matrix
and add all of the results together.
So here, for example, I'll say 10 times 0, plus 20, times negative 1, plus 30,
times 0, so on and so forth, doing all of this calculation.
And at the end, if I take all these values,
multiply them by their corresponding value in the kernel,
add the results together, for this particular set of nine pixels,
I get the value of 10 for example.
And then what I'll do is I'll slide this three-by-three grid effectively over.
Slide the kernel by one to look at the next three-by-three section.
And here I'm just sliding it over by one pixel,
but you might imagine a different slide length,
or maybe I jump by multiple pixels at a time if you really wanted to.
You have different options here.
But here I'm just sliding over, looking at the next three-by-three section.
And I'll do the same math 20 times 0, plus 30, times a negative 1, plus 40,
times 0, plus 20 times negative 1, so on and so forth, plus 30 times 5.
And what I end up getting is the number 20.
Then you can imagine shifting over to this one, doing the same thing,
calculating like the number 40, for example,
and then doing the same thing here and calculating a value there as well.
And so what we have now is what we'll call a feature map.
We have taken this kernel, applied it to each
of these various different regions, and what we get
is some representation of a filtered version of that image.
And so to give a more concrete example of why it is that this kind of thing
could be useful, let's take this kernel matrix,
for example, which is quite a famous one, that has an 8 in the middle
and then all of the neighboring pixels that get a negative 1.
And let's imagine we wanted to apply that
to a three-by-three part of an image that looks like this,
where all the values are the same.
They're all 20, for instance.
Well, in this case, if you do 20 times 8, and then subtract 20,
subtract 20, subtract 20, for each of the eight neighbors,
well, the result of that is you just get that expression,
which comes out to be 0.
You multiply 20 by 8, but then you subtracted 28 times
according to that particular kernel.
The result of all of that is just 0.
So the takeaway here is that when a lot of the pixels are the same value,
we end up getting a value close to 0.
If, though, we had something like this, 20s along this first row,
then 50s in the second row, and 50s in the third row, well,
then when you do this same kind of math--
20 times negative 1, 20 times negative 1, so on and so forth--
then I get a higher value-- a value like 90, in this particular case.
And so the more general idea here is that
by applying this kernel, negative 1s, 8 in the middle,
and then negative 1s, what I get is when this middle value is very
different from the neighboring values--
like 50 is greater than these 20s--
then you'll end up with a value higher than 0.
Like if this number is higher than its neighbors,
you end up getting a bigger output, but if this value is the same as all
of its neighbors, then you get a lower output, something like 0.
And it turns out that this sort of filter
can therefore be used in something like detecting edges in an image,
or want to detect like the boundaries between various different objects
inside of an image.
I might use a filter like this, which is able to tell
whether the value of this pixel is different from the values
of the neighboring pixel-- if it's like greater than the values of the pixels
that happened to surround it.
And so we can use this in terms of image filtering.
And so I'll show you an example of that.
I have here, in filter.py, a file that uses Python's image library, or PIL,
to do some image filtering.
I go ahead and open an image.
And then all I'm going to do is apply a kernel to that image.
It's going to be a three-by-three kernel, the same kind of kernel
we saw before.
And here is the kernel.
This is just a list representation of the same matrix
that I showed you a moment ago, with it's
negative 1, negative 1, negative 1.
The second row is negative 1, 8, negative 1.
The third row is all negative 1s.
And then at the end, I'm going to go ahead and show the filtered image.
So if, for example, I go into convolution directory
and I open up an image like bridge.png, this
is what an input image might look like, just an image of a bridge over a river.
Now I'm going to go ahead and run this filter program on the bridge.
And what I get is this image here.
Just by taking the original image and applying that filter
to each three-by-three grid, I've extracted
all of the boundaries, all of the edges inside the image that separate
one part of the image from another.
So here I've got a representation of boundaries
between particular parts of the image.
And you might imagine that if a machine learning algorithm is
trying to learn like what an image is of, a filter like this
could be pretty useful.
Maybe the machine learning algorithm doesn't care about all
of the details of the image.
It just cares about certain useful features.
It cares about particular shapes that are
able to help it determine that based on the image,
this is going to be a bridge, for example.
And so this type of idea of image convolution
can allow us to apply filters to images that
allow us to extract useful results out of those images-- taking an image
and extracting its edges, for example.
You might imagine many other filters that
could be applied to an image that are able to extract particular values as
well.
And a filter might have separate kernels for the red values, the green values,
and the blue values that are all summed together at the end,
such that you could have particular filters looking for,
is there red in this part of the image?
Are there green in other parts of the image?
You can begin to assemble these relevant and useful filters that are
able to do these calculations as well.
So that then was the idea of image convolution-- applying
some sort of filter to an image to be able to extract
some useful features out of that image.
But all the while, these images are still pretty big.
There's a lot of pixels involved in the image.
And realistically speaking, if you've got a really big image,
that poses a couple of problems.
One, it means a lot of input going into the neural network,
but two, it also means that we really have
to care about what's in each particular pixel, whereas realistically we often,
if you're looking at an image, you don't care
whether it's something is in one particular pixel
versus the pixel immediately to the right of it.
They're pretty close together.
You really just care about whether there is
a particular feature in some region of the image,
and maybe you don't care about exactly which pixel it happens to be.
And so there's a technique we can use known as pooling.
And what pooling is, is it means reducing the size of an input
by sampling from regions inside of the input.
So we're going to take a big image and turn it into a smaller image
by using pooling.
And in particular, one of the most popular types of pooling
is called max-pooling.
And what max-pooling does is it pools just by choosing the maximum value
in a particular region.
So, for example, let's imagine I had this four-by-four image,
but I wanted to reduce its dimensions.
I wanted to make an a smaller image, so that I have fewer inputs to work with.
Well, what I could do is I could apply a two-by-two max
pool, where the idea would be that I'm going
to first look at this two-by-two region and say, what
is the maximum value in that region?
Well, it's the number 50.
So we'll go ahead and just use the number 50.
And then we'll look at this two-by-two region.
What is the maximum value here?
110.
So that's going to be my value.
Likewise here, the maximum value looks like 20.
Go ahead and put that there.
Then for this last region, the maximum value
was 40, so we'll go ahead and use that.
And what I have now is a smaller representation
of this same original image that I obtained just
by picking the maximum value from each of these regions.
So again, the advantages here are now I only
have to deal with a two-by-two input instead of a four-by-four,
and you can imagine shrinking the size of an image even more.
But in addition to that, I'm now able to make
my analysis independent of whether a particular value was
in this pixel or this pixel.
I don't care if the 50 was here or here.
As long as it was generally in this region,
I'll still get access to that value.
So it makes our algorithms a little bit more robust as well.
So that then is pooling--
taking the size of the image and reducing it
a little bit by just sampling from particular regions inside of the image.
And now we can put all of these ideas together-- pooling, image convolution,
neural networks-- all together into another type of neural network called
a convolutional neural network, or a CNN, which is a neural network that
uses this convolution step, usually in the context of analyzing an image,
for example.
And so the way that a convolutional neural own network works is that we
start with some sort of input image-- some grid of pixels--
but rather than immediately put that into the neural network layers
that we've seen before, we'll start by applying a convolution step, where
the convolution step involves applying a number of different image filters
to our original image in order to get what
we call a feature map, the result of applying some filter to an image.
And we could do this once, but in general, we'll
do this multiple times getting a whole bunch of different feature
maps, each of which might extract some different relevant feature out
of the image, some different important characteristic of the image
that we might care about using in order to calculate what the result should be.
And in the same way to when we train neural networks,
we can train neural networks to learn the weights between particular units
inside of the neural networks.
We can also train neural networks to learn what those filters should be--
what the values of the filters should be--
in order to get the most useful, most relevant information out
of the original image just by figuring out what setting of those filter
values-- the values inside of that kernel--
results in minimizing the loss function and minimizing how poorly
our hypothesis actually performs in figuring out the classification
of a particular image, for example.
So we first apply this convolution step.
Get a whole bunch of these various different feature maps.
But these feature maps are quite large.
There is a lot of pixel values that happen to be here.
And so a logical next step to take is a pooling step,
where we reduce the size of these images by using max-pooling,
for example, extracting the maximum value from any particular region.
There are other pooling methods that exist
as well, depending on the situation.
You could use something like average-pooling,
where instead of taking the maximum value from a region,
you take the average value from a region, which has it uses as well.
But in effect, what pooling will do is it will take these feature maps
and reduce their dimensions, so that we end up
with smaller grids with fewer pixels.
And this then is going to be easier for us to deal with.
It's going to mean fewer inputs that we have to worry about,
and it's also going to mean we're more resilient, more robust,
against potential movements of particular values just by one pixel,
when ultimately, we really don't care about those one pixel differences that
might arise in the original image.
Now after we've done this pooling step, now we have a whole bunch of values
that we can then flatten out and just put
into a more traditional neural network.
So we go ahead and flatten it, and then we
end up with a traditional neural network that
has one input for each of these values in each of these resulting feature
maps after we do the convolution and after we do the pooling step.
And so this then is the general structure of a convolutional network.
We begin with the image, apply convolution,
apply pooling, flatten the results, and then put that
into a more traditional neural network that might itself have hidden layers.
You can have deep convolutional networks that
have hidden layers in between this flattened layer and the eventual output
to be able to calculate various different features of those values.
But this then can help us to be able to use convolution and pooling,
to use our knowledge about the structure of an image,
to be able to get better results, to be able to train our networks faster
in order to better capture particular parts of the image.
And there's no reason necessarily why you can only use these steps once.
In fact, in practice, you'll often use convolution and pooling multiple times
in multiple different steps.
So what you might imagine doing is starting with an image,
first applying convolution to get a whole bunch of maps,
then applying pooling, then applying convolution again,
because these maps are still pretty big.
You can apply convolution to try and extract relevant features
out of this result. Then take those results,
apply pooling in order to reduce their dimensions, and then take that
and feed it into a neural network that maybe has fewer inputs.
So here, I have two different convolution and pooling steps.
I do convolution and pooling once, and then I
do convolution and pooling a second time, each time extracting
useful features from the layer before it, each time using
pooling to reduce the dimensions of what you're ultimately looking at.
And the goal now of this sort of model is that in each of these steps,
you can begin to learn different types of features
of the original image, that maybe in the first step
you learn very low-level features, just learn and look for features like edges
and curves and shapes, because based on pixels in their neighboring values,
you can figure out, all right, what are the edges?
What are the curves?
What are the various different shapes that might be present there?
But then once you have a mapping that just represents
where the edges and curves and shapes happen to be,
you can imagine applying the same sort of process
again to begin to look for higher-level features-- look for objects,
maybe look for people's eyes in facial recognition,
for example, maybe look at more complex shapes like the curves
on a particular number if you're trying to recognize a digit in a handwriting
recognition sort of scenario.
And then after all of that, now that you have
these results that represent these higher-level features,
you can pass them into a neural network, which is really
just a deep neural network that looks like this, where you might imagine
making a binary classification, or classifying into multiple categories,
or performing various different tasks on this sort of model.
So convolutional neural networks can be quite powerful and quite popular
when it comes to trying to analyze images.
We don't strictly need them.
We could have just used a vanilla neural network that just operates with layer
after layer as we've seen before.
But these convolutional neural networks can
be quite helpful, in particular, because of the way they
model the way a human might look at an image,
that instead of a human looking at every single pixel
simultaneously and trying to involve all of them by multiplying them together,
you might imagine that what convolution is really
doing is looking at various different regions of the image
and extracting relevant information and features out
of those parts of the image the same way that a human might
have visual receptors that are looking at particular parts of what they see,
and using those, combining them, to figure out
what meaning they can draw from all of those various different inputs.
And so you might imagine applying this to a situation like handwriting
recognition.
So we'll go ahead and see an example of that now.
I'll go ahead and open up handwriting.py.
Again, what we do here is we first import TensorFlow.
And then, TensorFlow, it turns out, has a few datasets
that are built in-- built into the library
that you can just immediately access.
And one of the most famous datasets in machine learning
is the MNIST dataset, which is just a dataset of a whole bunch of samples
of people's handwritten digits.
I showed you a slide of that a little while ago.
And what we can do is just immediately access that dataset,
which is built into the library, so that if I want to do something like train
on a whole bunch of digits, I can just use the dataset that is provided to me.
Of course, if I had my own dataset of handwritten images,
I can apply the same idea.
I'd first just need to take those images and turn them into an array of pixels,
because that's the way that these are going to be formatted.
They're going to be formatted as, effectively,
an array of individual pixels.
And now there's a bit of reshaping I need to do,
just turning the data into a format that I can put
into my convolutional neural network.
So this is doing things like taking all the values and dividing them by 255.
If you remember, these color values tend to range from 0 to 255.
So I can divide them by 255, just to put them into a 0-to-1 range,
which might be a little bit easier to train on .
And then doing various other modifications to the data, just
to get it into a nice usable format.
But here's the interesting and important part.
Here is where I create the convolutional neural network-- the CNN--
where here I'm saying, go ahead and use a sequential model.
And before I could use model.add to say add a layer, add a layer, add a layer,
another way I could define it is just by passing
as input to the sequential neural network a list of all of the layers
that I want.
And so here, the very first layer in my model
is a convolutional layer, where I'm first
going to apply convolution to my image.
I'm going to use 13 different filters, so my model is going to learn--
32, rather-- 32 different filters that I would
like to learn on the input image, where each filter is
going to be a three-by-three kernel.
So we saw those three-by-three kernels before,
where we could multiply each value in a three-by-three grid by value,
multiply it and add all the results together.
So here I'm going to learn 32 different of these three-by-three filters.
I can again specify my activation function.
And I specify what my input shape is.
My input shape in the banknotes case was just 4.
I had four inputs.
My input shape here is going to be 28, comma, 28, comma 1, because for each
of these handwritten digits, it turns out
that the MNIST dataset organizes their data.
Each image is a 28-by-28 pixel grid.
They're going to be a 28-by-28 pixel grid, and each one of those images only
has one channel value.
These handwritten digits are just black and white,
so it's just a single color value representing
how much black or how much white.
You might imagine that in a color image, if you were doing this sort of thing,
you might have three different channels-- a red,
a green, and a blue channel, for example.
But in the case of just handwriting recognition and recognizing a digit,
we're just going to use a single value for shaded-in in or not shaded-in,
and it might range, but it's just a single color value.
And that then is the very first layer of our neural network,
a convolutional layer that will take the input
and learn a whole bunch of different filters
that we can apply to the input to extract meaningful features.
The next step is going to be a max-pooling layer, also built
right into TensorFlow, where this is going
to be a layer that is going to use a pool size of two by two,
meaning we're going to look at two-by-two regions inside of the image,
and just extract the maximum value.
Again, we've seen why this can be helpful.
It'll help to reduce the size of our input.
Once we've done that, we'll go ahead and flatten all of the units just
into a single layer that we can then pass
into the rest of the neural network.
And now, here's the rest of the whole network.
Here, I'm saying, let's add a hidden layer to my neural network with 128
units-- so a whole bunch of hidden units inside of the hidden layer--
and just to prevent overfitting, I can add a dropout to that-- say,
you know what?
When you're training, randomly drop out half from this hidden layer,
just to make sure we don't become over-reliant on any particular node.
We begin to really generalize and stop ourselves from overfitting.
So TensorFlow allows us, just by adding a single line,
to add dropout into our model as well, such that when it's training,
it will perform this dropout step in order
to help make sure that we don't overfit on this particular data.
And then finally, I add an output layer.
The output layer is going to have 10 units, one
for each category, that I would like to classify digits into,
so 0 through 9, 10 different categories.
And the activation function I'm going to use here
is called the softmax activation function.
And in short, what the softmax activation function is going to do
is it's going to take the output and turn it
into a probability distribution.
So ultimately, it's going to tell me, what
did we estimate the probability is that this is a 2 versus a 3 versus a 4,
and so it will turn it into that probability distribution for me.
Next up, I'll go ahead and compile my model
and fit it on all of my training data.
And then I can evaluate how well the neural network performs.
And then I've added to my Python program,
if I've provided a command line argument, like the name of a file,
I'm going to go ahead and save the model to a file.
And so this can be quite useful too.
Once you've done the training step, which
could take some time, in terms of taking all the time--
going through the data; running backpropagation with gradient descent;
to be able to say, all right, how should we adjust
the weight to this particular model--
you end up calculating values for these weights,
calculating values for these filters, and you'd
like to remember that information, so you can use it later.
And so TensorFlow allows us to just save a model to a file,
such that later if we want to use the model we've learned,
use the weights that we've learned, to make some sort of new prediction
we can just use the model that already exists.
So what we're doing here is after we've done all the calculation,
we go ahead and save the model to a file, such
that we can use it a little bit later.
So for example, if I go into digits, I'm going to run handwriting.py.
I won't save it this time.
We'll just run it and go ahead and see what happens.
What will happen is we need to go through the model
in order to train on all of these samples of handwritten digits.
So the MNIST dataset gives us thousands and thousands
of sample handwritten digits in the same format
that we can use in order to train.
And so now what you're seeing is this training process,
and unlike the banknotes case, where there was much,
much fewer data points--
the data was very, very simple--
here, the data is more complex, and this training process takes time.
And so this is another one of those cases where
when training neural networks, this is why computational power is
so important, that oftentimes, you see people wanting
to use a sophisticated GPUs in order to more efficiently be
able to do this sort of neural network we're training.
It also speaks to the reason why more data can be helpful.
The more sample data points you have, the better
you can begin to do this training.
So here we're going through 60,000 different samples
of handwritten digits.
And I said that we're going to go through them 10 times.
So we're going to go through the dataset 10 times, training each time,
hopefully improving upon our weights with every time
we run through this dataset.
And we can see over here on the right what the accuracy is
each time we go ahead and run this model, that the first time,
it looks like we got an accuracy of about 92% of the digits
correct based on this training set.
We increased that to 96% or 97%.
And every time we run this, we're going to see,
hopefully, the accuracy improve, as we continue to try and use
that gradient descent, that process of trying to run the algorithm
to minimize the loss that we get in order to more accurately predict
what the output should be.
And what this process is doing is it's learning not only the weights,
but it's learning the features to use-- the kernel
matrix to use-- when performing that convolution step, because this
is a convolutional neural network, where I'm first performing
those convolutions, and then doing the more traditional neural network
structure.
This is going to learn all of those individual steps as well.
So here, we see the TensorFlow provides me with some very nice output, telling
me about how many seconds are left with each of these training runs,
that allows me to see just how well we're doing.
So we'll go ahead and see how this network performs.
It looks like we've gone through the dataset seven times.
We're going through an eighth time now.
And at this point, the accuracy is pretty high.
We saw we went from 92% up to 97%.
Now it looks like 98%.
And at this point, it seems like things are starting to level out.
There's probably a limit to how accurate we can ultimately
be without running the risk of overfitting.
Of course, with enough nodes, you could just memorize the input and overfit
upon them.
But we'd like to avoid doing that and dropout will help us with this.
But now, we see we're almost done finishing our training step.
We're at 55,000.
All right.
We've finished training, and now it's going
to go ahead and test for us on 10,000 samples.
And it looks like on the testing set, we were 98.8% accurate.
So we ended up doing pretty well, it seems,
on this testing set to see how accurately can
we predict these handwritten digits.
And so what we could do then is actually test it out.
I've written a program called recognition.py using PyGame.
If you pass it a model that's been trained,
and I pre-trained an example model using this input data, what we can do
is see whether or not we've been able to train
this convolutional neural network to be able to predict handwriting,
for example.
So I can try just like drawing a handwritten digit.
I'll go ahead and draw like the number 2, for example.
So there's my number 2.
Again, this is messy.
If you tried to imagine how would you write a program with just like ifs
and thens to be able to do this sort of calculation,
it would be tricky to do so.
But here, I'll press Classify, and all right.
It seems it was able to correctly classify that what I drew
was the number 2.
We'll go ahead and reset it.
Try it again.
We'll draw like an 8, for example.
So here is an 8.
I'll press Classify.
And all right.
It predicts that the digit that I drew was an 8.
And the key here is this really begins to show
the power of what the neural network is doing, somehow looking
at various different features of these different pixels,
figuring out what the relevant features are,
and figuring out how to combine them to get a classification.
And this would be a difficult task to provide explicit instructions
to the computer on how to do, like to use a hole punch of if-thens
to process all of these pixel values to figure out
what the handwritten digit is, like everyone is going to draw
their 8 a little bit differently.
If I drew the 8 again, it would look a little bit different.
And yet ideally, we want to train a network to be robust
enough so that it begins to learn these patterns on its own.
All I said was, here is the structure of the network,
and here is the data on which to train the network,
and the network learning algorithm just tries
to figure out what is the optimal set of weights,
what is the optimal set of filters to use,
in order to be able to accurately classify
a digit into one category or another.
That's going to show the power of these convolutional neural networks.
And so that then was a look at how we can use convolutional neural networks
to begin to solve problems with regards to computer vision, the ability to take
an image and begin to analyze it.
And so this is the type of analysis you might
imagine that's happening in self-driving cars that
are able to figure out what filters to apply to an image to understand what it
is that the computer is looking at, or the same type of idea that
might be applied to facial recognition and social media
to be able to determine how to recognize faces in an image as well.
You can imagine a neural network that, instead of classifying
into one of 10 different digits, could instead classify like, is this person A
or is this person B, trying to tell those people apart just based
on convolution.
And so now what we'll take a look at is yet another type of neural network
that can be quite popular for certain types of tasks.
But to do so, we'll try to generalize and think about our neural network
a little bit more abstractly, that here we have a sample deep neural network,
where we have this input layer, a whole bunch of different hidden layers
that are performing certain types of calculations,
and then an output layer here that just generates some sort of output
that we care about calculating.
But we could imagine representing this a little more simply, like this.
Here is just a more abstract representation of our neural network.
We have some input.
That might be like a vector of a whole bunch of different values as our input.
That gets passed into a network to perform
some sort of calculation or computation, and that network
produces some sort of output.
That output might be a single value.
It might be a whole bunch of different values.
But this is the general structure of the neural network that we've seen.
There is some sort of input that gets fed into the network,
and using that input, the network calculates what the output should be.
And this sort of model for an all network
is what we might call a feed-forward neural network.
Feed-forward neural networks have connections only in one direction;
they move from one layer to the next layer to the layer
after that, such that the inputs pass through various different hidden layers
and then ultimately produce some sort of output.
So feed-forward neural networks are very helpful for solving
these types of classification problems that we saw before.
We have a whole bunch of input.
We want to learn what setting of weights will allow
us to calculate the output effectively.
But there are some limitations on feed-forward neural networks
that we'll see in a moment.
In particular, the input needs to be of a fixed shape,
like a fixed number of neurons are in the input layer,
and there's a fixed shape for the output,
like a fixed number of neurons in the output layer,
and that has some limitations of its own.
And a possible solution to this--
and we'll see examples of the types of problems we
can solve for this in just the second--
is instead of just a feed-forward neural network where there are only
connections in one direction, from left to right effectively,
across the network, we can also imagine a recurrent neural network,
where a recurrent neural network generates
output that gets fed back into itself as input for future runs of that network.
So whereas in a traditional neural network,
we have inputs that get fed into the network that get fed into the output,
and the only thing that determines the output is based on the original input
and based on the calculation we do inside of the network itself,
this goes in contrast with a recurrent neural network,
where in a recurrent neural network, you can imagine output
from the network feeding back to itself into the network
again as input for the next time that you do the calculations
inside of the network.
What this allows is it allows the network to maintain some sort of state,
to store some sort of information that can
be used on future runs of the network.
Previously, the network just defined some weights,
and we passed inputs through the network, and it generated outputs,
but the network wasn't saving any information based on those inputs
to be able to remember for future iterations or for future runs.
What a recurrent neural network will let us do
is let the network store information that
gets passed back in as input to the network again the next time we try
and perform some sort of action.
And this is particularly helpful when dealing with sequences of data.
So we'll see a real-world example of this right now actually.
Microsoft has developed an AI known as the CaptionBot,
and what the CaptionBot does is it says, I
can understand the content of any photograph,
and I'll try to describe it as well as any human.
I'll analyze your photo, but I won't store it or share it.
And so what Microsoft CaptionBot seems to be claiming to do
is it can take an image and figure out what's in the image
and just give us a caption to describe it.
So let's try it out.
Here, for example, is an image of Harvard Square
and some people walking in front of one of the buildings at Harvard Square.
I'll go ahead and take the URL for that image,
and I'll paste it into CaptionBot, then just press Go.
So CaptionBot is analyzing the image, and then it says,
I think it's a group of people walking in front
of a building, which seems amazing.
The eye is able to look at this image and figure out what's in the image.
And the important thing to recognize here
is that this is no longer just a classification task.
We saw being able to classify images with a convolutional neural network,
where the job was to take the images and then figure out, is it a 0, or a 1,
or a 2; or is that this person's face or that person's face?
What seems to be happening here is the input is an image,
and we know how to get networks to take input of images,
but the output is text.
It's a sentence.
It's a phrase, like "a group of people walking in front of a building."
And this would seem to pose a challenge for our more traditional
feed-forward neural networks, for the reason being
that in traditional neural networks, we just
have a fixed-size input and a fixed-size output.
There are a certain number of neurons in the input to our neural network
and a certain number of outputs for our neural network,
and then some calculation that goes on in between.
But the size of the inputs--
the number of values in the input and the number of values in the output--
those are always going to be fixed based on the structure of the neural network,
and that makes it difficult to imagine how a neural network can
take an image like this and say, you know,
it's a group of people walking in front of the building,
because the output is text.
It's a sequence of words.
Now it might be possible for a neural network to output one word.
One word, you could represent us like a vector of values,
and you can imagine ways of doing that.
And next time, we'll talk a little bit more about AI
as it relates to language and language processing.
But a sequence of words is much more challenging,
because depending on the image, you might
imagine the output is a different number of words.
We could have sequences of different lengths,
and somehow we still want to be able to generate the appropriate output.
And so the strategy here is to use a recurrent neural network,
a neural network that can feed its own output back into itself
as input for the next time.
And this allows us to do what we call a one-to-many relationship for inputs
to outputs, that in vanilla, more traditional neural networks--
these are what we consider to be one-to-one neural networks--
you pass in one set of values as input, you get one vector of values
as the output--
but in this case, we want to pass in one value as input--
the image-- and we want to get a sequence-- many values--
as output, where each value is like one of these words that gets produced
by this particular algorithm.
And so the way we might do this is we might imagine starting
by providing input the image into our neural network,
and the neural network is going to generate output,
but the output is not going to be the whole sequence of words,
because we can't represent the whole sequence of words.
I'm using just a fixed set of neurons.
Instead, the output is just going to be the first word.
We're going to train the network to output
what the first word of the caption should be.
And you could imagine that Microsoft has trained
to this by running a whole bunch of training samples through the AI,
giving it a whole bunch of pictures and what the appropriate caption was,
and having the AI begin to learn from that.
But now, because the network generates output
that can be fed back into itself, you can
imagine the output of the network being fed back into the same network--
this here looks like a separate network, but it's really the same network that's
just getting different input--
that this network's output gets fed back into itself,
but it's going to generate another output,
and that other output is going to be like the second word in the caption.
And this recurrent neural network then, this network
is going to generate other output that can be fed back
into itself to generate yet another word, fed back
into itself to generate another word.
And so recurrent neural networks allow us to represent
this sort of one-to-many structure.
You provide one image as input, and the neural network
can pass data into the next run of the network,
and then again and again, such that you could run the network multiple times,
each time generating a different output, still based on that original input.
And this is where recurrent neural networks
become particularly useful when dealing with sequences of inputs or outputs.
My output is a sequence of words, and since I can't very easily
represent outputting an entire sequence of words,
I'll instead output that sequence one word at a time,
by allowing my network to pass information
about what still needs to be said about the photo
into the next stage of running the networks.
So you could run the network multiple times--
the same network with the same weights--
just getting different input each time, first getting input from the image,
and then getting input from the network itself,
as additional information about what additionally
needs to be given in a particular caption, for example.
So this then is a one-to-many many relationship
inside of a recurrent neural network.
But it turns out there are other models that we
can use-- other ways we can try and use recurrent neural networks-- to be
able to represent data that might be stored in other forms as well.
We saw how we could use neural networks in order to analyze images,
in the context of convolutional neural networks that take an image,
figure out various different properties of the image,
and are able to draw some sort of conclusion based on that.
But you might imagine that something like YouTube,
they need to be able to do a lot of learning based on video.
They need to look through videos to detect
if there are copyright violations, or they
need to be able to look through videos to maybe identify
what particular items are inside of the video, for example.
And video, you might imagine, is much more difficult
to put it as input to a neural network, because whereas an image
you can just treat each pixel is a different value, videos are sequences.
They're sequences of images, and each sequence might be a different length,
and so it might be challenging to represent
that entire video as a single vector of values
that you could pass in to a neural network.
And so here too, recurrent neural networks
can be a valuable solution for trying to solve this type of problem.
Then instead of just passing in a single input into our neural network,
we could pass in the input one frame at a time, you might imagine,
first taking the first frame of the video, passing it into the network,
and then maybe not having the network output anything at all yet.
Let it take in another input, and this time, pass it into the network,
but the network gets information from the last time
we provided an input into the network.
Then we pass in a third input and then a fourth input,
where each time, with the network gets it gets the most recent input,
like each frame of the video, but it also
gets information the network processed from all of the previous iterations.
So on frame number four, you end up getting
the input for frame number four, plus information the network is
calculated from the first three frames.
And using all of that data combined, this recurrent neural network
can begin to learn how to extract patterns from a sequence of data
as well.
And so you might imagine if you want to classify
a video into a number of different genres,
like an educational video, or a music video, or different types of videos.
That's a classification task, where you want
to take input each of the frames of the video,
and you want to output something like what it is
and what category that it happens to belong to.
And you can imagine doing this sort of thing--
this sort of many-to-one learning--
anytime your input is a sequence.
And so input is a sequence in the context of a video.
It could be in the context of like, if someone has typed a message,
and you want to be able to categorize that message,
like if you're trying to take a movie review
and trying to classify it as is it a positive review or a negative review.
That input is a sequence of words, and the output
is a classification-- positive or negative.
There too, a recurrent neural network might
be helpful for analyzing sequences of words,
and they're quite popular when it comes to dealing with language.
It could even be used for spoken language
as well, that spoken language is an audio waveform that
can be segmented into distinct chunks, and each of those
can be passed in as an input into a recurrent neural network
to be able to classify someone's voice, for instance,
if you want to do voice recognition, to say is this one person
or is this another?
Here are also cases where you might want this many-to-one architecture
for a recurrent neural network.
And then as one final problem, just to take a look
at in terms of what we can do, with these sorts of networks,
imagine what Google Translate is doing.
So what Google Translate is doing is it's taking some text written in one
language and converting it into text written in some other language,
for example, where now this input is a sequence of data--
it's a sequence of words--
and the output is a sequence of words as well.
It's also a sequence.
So here, we want effectively like a many-to-many relationship.
Our input is a sequence, and our output is a sequence as well.
And it's not quite going to work to just say, take each word in the input
and translate it into a word in the output,
because ultimately, different languages put their words in different orders,
and maybe one language uses two words for something,
whereas another language only uses one.
So we really want some way to take this information-- that's input--
encode it somehow, and use that encoding to generate what the output ultimately
should be.
And this has been one of the big advancements
in automated translation technology is the ability
to use own networks to do this, instead of older, more traditional methods,
and this has improved accuracy dramatically.
And the way you might imagine doing this is, again,
using a recurrent neural network with multiple inputs and multiple outputs.
We start by passing in all the input.
Input goes into the network.
Another input, like another word, goes into network,
and we do this multiple times, like once for each word in the input
that I'm trying to translate.
And only after all of that is done, does the network now
start to generate output, like the first word of the translated sentence,
and the next word of the translated sentence, so on and so forth,
where each time the network passes information
to itself by allowing for this model of giving some sort of state
from one run in the network to the next run,
assembling information about all the inputs,
and then passing in information about which part of the output in order
to generate next.
And there are a number of different types of these sorts
of recurrent neural networks.
One of the most popular is known as the long short-term memory neural
network, otherwise known as LSTM.
But in general, these types of networks can be very, very powerful
whenever we're dealing with sequences, whether those
are sequences of images or especially sequences of words when it comes
towards dealing with natural language.
So that then were just some of the different types of neural networks
that can be used to do all sorts of different computations,
and these are incredibly versatile tools that
can be applied to a number of different domains.
We only looked at a couple of the most popular types of neural networks--
the more traditional feed-forward neural networks,
convolutional neural networks, and recurrent neural networks.
But there are other types as well.
There are adversarial networks, where networks compete with each other
to try and be able to generate new types of data,
as well as other networks that can solve other tasks based on what they happen
to be structured and adapted for.
And these are very powerful tools in machine learning,
from being able to very easily learn based on some set of input data
and to be able to therefore figure out how to calculate
some function, from inputs to outputs.
Whether it's input to some sort of classification, like analyzing an image
and getting a digit, or machine translation where
the input is in one language and the output is in another,
these tools have a lot of applications for machine learning more generally.
Next time, we'll look at machine learning and AI
in particular in the context of natural language.
We talked a little bit about this today, but looking
at how it is that our AI can begin to understand natural language
and can begin to be able to analyze and do useful tasks with
regards to human language, which turns out
to be a challenging and interesting task.
So we'll see you next time.