ml5.js: ピクセルを入力としてニューラルネットワークを学習する (ml5.js: Train a Neural Network with Pixels as Input)

字幕表動画を再生する

And you thought we were done with the ML5 neural network
tutorials.
But no.
There is one more because I am leading to something.
I am going to-- you will soon see in this playlist
a section on convolutional neural networks.
But before I get to convolutional neural networks,
I want to look at reasons why a convolutional layer.
I have to answer this question like, what is a convolution?
I've got to get to that.
But before I get to that, I want to just see why
they exist in the first place.
So I want to start with another scenario
for training your own neural network.
That scenario is an image classifier.
Now you might rightfully be sitting
there saying to yourself, you've done videos
on image classifiers before.
And in fact, I have.
The very beginning of this whole series
was about using a pre-trained model for an image classifier.
And guess what?
That pre-trained model had convolutional layers in it.
So I want to now take the time to unpack what that means more
and look at how you could train your own convolutional neural
network.
Again, first though, let's just think
about how we would make an image classifier
with what we have so far.
We have an image.
And that image is being sent into an ML5 neural network.
And out of that neural network comes either a classification
or regression.
And in fact, we could do an image regression.
And I would love to do that.
But let me start with a classifier
because I think it's a lot simpler to think about
and consider.
So maybe it comes out with one of two things,
either a cat or a dog and some type of confidence score.
I previously zoomed in on the ML5 neural network
and looked at what's inside, right?
We have this hidden layer with some number
of units and an output layer, which, in this case,
would have just two if there's two classes.
Everything is connected, and then there are the inputs.
With post net, you might recall, there were 34 inputs
because there were 17 points on my body,
each with an xy position.
What are these?
Let's just say, for the sake of argument,
that this image is 10 by 10 pixels.
So I could consider every single pixel
to be an individual input into this ML5 neural network.
But each pixel has three channels,
and R, G, and B. So that would make 100 times three inputs,
300 inputs.
That's reasonable.
So this is actually what I want to implement.
Take the idea of a two layer neural network
to perform classification, the same thing I've
done in previous videos, but, this time, use as the input
the actual raw pixels.
Can we get meaningful results from just doing that?
After we do that, I want to return back to here
and talk about why this is inadequate or not going
to say inadequate but how this can be improved on
by adding another layer.
So this layer won't--
sorry.
The inputs will still be there.
We're always going to have the inputs.
The hidden layer will still be there.
And the output layer will still be there.
But I want to insert right in here
something called a convolutional layer.
And I want to do a two dimensional convolutional
layer.
So I will come back.
If you want to just skip to that next video,
if and when it exists, that's when I
will start talking about that.
But let's just get this working as a frame of reference.
I'm going to start with some prewritten code.
All this does, it's a simple P5JS sketch
that opens a connection to the web cam,
resizes it to 10 by 10 pixels, and then
draws a rectangle in the canvas for each and every pixel.
So this could be unfamiliar to you.
How do you look at an image in JavaScript in P5
and address every single pixel individually?
If that's unfamiliar to you, I would refer
to my video on that topic.
That's appearing over next to me right now.
If you go take a look at that and then come back here.
But really, this is just looking at every x and y position,
getting the R, G, B values, filling a rectangle,
and drawing it.
So what I want to do next is think about,
how do I configure this ML5 neural network,
which expects that 10 by 10 image as its input?
I'm going to make a variable called pixel brain.
And pixel brain will be a new ML5 neural network.
I should have mentioned that you could find the link to the code
that I'm starting with, in case you
wanted to code along with me, both the finished code
and the code I'm starting with will
be in this video's description.
So to create a neural network, I call the neural network
function and give it a set of options.
One thing I should mention is while in all the videos
I've done so far, I've said that you
need to specify the number of inputs
and the number of outputs to configure your neural network.
The truth is ML5 is set up to infer
the total number of inputs and outputs
based on the data you're training it with.
But to be really explicit about things
and make the tutorial as clear as possible,
I'm going to write those into the options.
So how many inputs?
Think about that for a second.
The number of columns times the number of the rows times
R, G, B. Maybe I would have a grayscale image.
Maybe I could just make it I don't
need a separate input for R, G, and B. But let's do that.
Why not?
I have the 10 by 10 in a variable called video size.
So let's make that video size times video size times three.
Let's just make a really simple classifier that's
like I'm here or not here.
So I'm going to make that two.
The task is classification.
And I want to see debugging when I train the model.
Now I have my pixel brain, my neural network.
Oops.
That should be three.
Let's go with my usual typical, terrible interface,
meaning no interface.
And I'm just going to train the model based on when
I press keys on the keyboard.
So I'll add a key press function.
And then let me just a little goofy here,
which I'm just going to say when I press the key,
add example key.
So I need a new function called add example.
Label.
So basically, I'm going to make the key that I press the label.
So I'm going to press a bunch of keys
when I'm standing in front the camera
and then press a different key when I'm not standing
in front of the camera.
Now comes the harder work.
I need to figure out how to make an array of inputs
out of all of the pixels.
Luckily for me, this is something
that I have done before.
And in fact, I actually have some code
that I could pull from right in here,
which is looking at how to go through all the pixels
to draw them.
But here's the thing.
I am going to do something to flatten the data.
I am not going to keep the data in its original columns
and rows orientation.
I'm going to take the pixels and flatten them out
into one single array.
Guess what?
This is actually the problem that
convolutional neural networks will address.
It's bad to flatten the data because its spatial arrangement
is meaningful.
I'll start by creating an empty array called inputs.
Then I'll loop through all of the pixels.
And to be safe, I should probably
say video dot load pixels.
The pixels may already be loaded because I'm
doing that for down here.
And I could do something where if I'm drawing them,
I might as well create the data here.
But I'm going to be redundant about it.
And I'm going to say--
ah, but this is weird.
Here's the weird thing.
I thought I wasn't going to talk about the pixel array
in this video and just refer you to the previous one.
But I can't escape it right now.
For every single pixel in an image in P5JS,
there are four spots in the array, a red value,
a green value, a blue value, and an alpha value.
Alpha value for transparency.
The alpha value, I can ignore because it's
going to be 255 for everything.
There's no transparency.
If I wanted to learn transparency,
I could make that an input and have 10 by 10 times 4.
But I don't need to do that here.
So in other words, pixel zero starts here, 0, 1, 2, 3.
And the second pixel starts at index four.
So as I'm iterating over all of the pixels,
I want to move through the array four spaces at a time.
There's a variety of ways I could approach this,
but that's going to make things easiest for me.
So that means right over here, this
should be plus equals four.
Then I can say the red value is video dot pixels index
I. The green value is at I plus one.
And the blue value is at I plus two.
And just to be consistent, I'm going
to just put a plus zero in there so everything lines up nicely.
So that's the R, G, and B values.
Then I want those R, G, and B values
for this particular pixel to go in the inputs array.
The chat is making a very good point,
which is that I have all of the stuff in an array already.
And all I'm really doing is making a slightly smaller array
that's removing every fourth element.
I could do that with the filter function
or some kind of higher order function
or maybe just use the original array.
I'm not really sure why I'm doing it this way.
But I'm going to emphasize this data preparation step.
So I look forward to hearing your comments about
and maybe reimplementations of this that just
use the pixel array directly.
But I'm going to keep it this way for right now.
So I'm taking the R, G, and B and putting them
all into my new array.
Then the target is just the label,
a single label in an array.
And I can now add this as training data,
pixel brain add data inputs target.
Let's console log something just to see that this is working.
So I'm going to console log the inputs.
And let's also console log the target,
just to see that something is coming out.
So, a, yeah.
We can see there's an array there.
And there's the a.
And now if I do b, I'm getting a different array with b there.
So I'm going to assume this is working.
I could say inputs dot length to make sure
that that's the right idea.
Yeah.
It's got 300 things in it.
OK.
Next step is to train the model.
So I'm going to say, if the key pressed is T,
don't add an example but rather train the model.
And let's give it train it over 50 epochs
and have a callback when it's finished training.
Let's also add an option to save the data,
just in case I want to stop and start a bunch of times
and not collect the data again.
And I'm ready to go, except I missed something important.
I have emphasized before that when
working with neural networks, it's
important to normalize your data,
to take the data that you're using as inputs or outputs,
look at its range, and standardize it
to some specific range, typically between zero and one
or maybe between negative one and one.
And it is true that ML5 will do this for you.
I could just call normalized data.
But this is a nice opportunity to show that I can just
do the normalization myself.
For example, I know-- this is another reason
to make a separate array sort of.
I know that the range of any given pixel color
is between zero and 255.
So let me take the opportunity to just divide every R, G,
B value by 255 to squash it, to normalize it
between zero and one.
Let's see if this works.
I'm going to collect it.
So I'm going to press-- this is a little bit silly,
but I'm going to press H for me being
here in front of the camera.
Then I'm going to move off to the side,
and I'm going to use N for not being in front of the camera.
So I'm not here.
And I'm just going to do a little bit right now,
and then I'm going to hit T for train.
And loss function going crazy.
But eventually, it gets down.
It's a very small amount of data that I gave it to train.
But we can see that I'm getting a low loss function.
If I had built in the inference stage to the code,
it would start to guess Dan or no Dan.
So let's add that in.
When I'm finished training, then I'll start classifying.
The first thing I need to do if I'm going to classify the video
is pack all of those pixels into an input array again.
Then I can call classify on pixel brain
and add a function to receive the results.
Let's do something fun and have it say hi to me.
So I'm going to make this label a global variable with nothing
in it.
And then I'll say, label equals results label.
After I draw the pixels, let's either write hi or not
write hi.
So just to see that this works, let's make the label H
to start.
It says hi.
Now let's not make it H. And let's go
through the whole process.
Train the model.
And it says hi.
Oh, I forgot to classify the video again
after I get the results.
So it classified it only once.
And I want to then recursively continue
after I get the results to classify the video again.
Just so we can finish this out, I actually
saved all of the data I collected to a file
called data dot JSON.
And now I can say, pixel brain load data data dot JSON.
And when the data is loaded, then I can train the model.
So now I've eliminated the need to collect
the data every single time.
Let's run the sketch.
It's going to train the model.
I don't really even need to see this.
When it gets to the end, hi.
Hooray.
I'm pleased that that worked.
I probably shouldn't, but I just want
to try having three outputs.
So let's try something similar to what
I did in my previous videos using teachable machine
to train an image classifier.
And we'll look at this ukulele, coding train notebook,
and a Rubik's cube.
So let me collect a whole lot of data.
I'm going to press U for ukulele, R for Rubik's cube,
and N for notebook.
Save the date in case I need it later and train the model.
All right, so now ukulele, U, N for notebook.
And can we get an R?
I stood to the side when I was doing the Rubik's cube,
so that is pretty important.
So it's not working so well.
So that's not a surprise.
I don't expect it to work that well.
This is why I want to make another video that
covers how to take this very simplistic approach
and improve upon it by adding something
called a convolutional layer.
So what is a convolution?
What are the elements of a convolutional layer?
How do I add one with the ML5 library?
That's what I'm going to start looking at in the next section
of videos.
But before I go, I can't resist just
doing one more thing because I really
want to look at and demonstrate to you what happens if you
change from using pixel input to perform a classification
to a regression.
So I took code from my previous examples that just demonstrated
how ML5 in regression works, and I
changed the task to regression.
I had to lower the learning rate.
Thank you to the live chat who helped me figure this
out after like over an hour of debugging.
I had to lower the learning rate to get this to work.
I trained the model with me standing
in different positions associated
with a different frequency that P5 sound library played.
And you can see some examples of me training it over here.
And now, I am going to run it and see if it works,
and that'll be the end of this video.
So I had saved the data.
And now it's training the model.
And as soon as it finishes training,
you'll be able to hear.
All right, so I will leave that to you as an exercise.
I'll obviously include the link to the code
for this in the video's description
on the web page on the codingtrain.com
with this particular video.
I can come back and implement it.
You can go find the link to a Livestream
where I spend over an hour implementing it.
But I'll leave that to you as an exercise.
See if you followed this video and have image classification
working, can you change it to a regression
and have it control something with continuous output?
OK, if you made it this far, [KISSING NOISE] thank you.
And I will be back and start to talk
about convolutional neural networks, what
they mean in the next video.
[MUSIC PLAYING]