ティーチャブル・マシーン 3.音の分類 (Teachable Machine 3: Sound Classifiication)

字幕表動画を再生する

[TRAIN WHISTLE]
Hello, and welcome to another Teachable Machine
video tutorial.
This time, I'm going to look at sound classification.
So I have to say goodbye to my friend the unicorn
but hello, again, to the ukulele, hello to the bell,
and the train whistle.
I am going to look at how I can use
Teachable Machine to train a model
to recognize different sounds.
In the previous video, I looked at image classification.
So you can see here, this example
is recognizing the guitar.
And now I want to make this exact same example but have
it show the guitar when I play the ukulele,
have it show the train whistle when I play the train whistle.
To do this, instead of starting with an image project,
I'm going to start with an audio project.
Now, you might remember, in my previous video
where I made an image classifier, I talked
about the process of transfer learning,
how Teachable Machine starts with a base model,
in this case, MobileNet, which knows how to classify images
into 1,000 categories, removes all those categories,
replaces them with those of our own--
unicorns, train whistles, rainbows, and a ukulele--
and then retrains the model with new images.
The sound classifier is going to do exactly the same thing.
This time, however, the base model
is something called speech commands 18w.
The speech commands model is pre-trained to recognize 18
words in English that a person might say so the digits 0
through 9--
that's 10-- up, down, left, right-- up to 14--
stop, go, yes, no-- that's 18.
It also has a category for unknown word and background
noise, so really there's 20 but 18w for those 18 words.
So when I make an audio project, I'm going to use that model,
remove all of those labels, and put in my own
with my own training sounds instead of images this time.
And my labels will be train whistle, bell, and ukulele.
So let's get started and make an audio project.
The very first thing you need to do
when making an audio classifier in Teachable Machine
is record some samples or audio examples of background noise.
The model-- we'll need this during the training
process to have something to compare the other classes to.
So let's do that.
I'm going to attempt to be very quiet, which
is quite hard for me, and record background samples.
Add Samples, Use Mic, and then I'm
going to record a 20-second sample
and attempt to meditate during that time.
[MUSIC PLAYING]
All right.
I've got the 20 seconds of background noise.
Now the model as its input doesn't
take 20 seconds of audio.
Any given example presented to the model
during the training process has to be 1 second of audio.
But I just recorded 20 seconds, so Teachable Machine
required one additional step to prepare that training data.
And that is through this button, Extract Sample.
So if I click the Extract Sample button,
it's going to take that 20 seconds of audio
and convert it into 20 samples.
Let's just record 20 more seconds of background audio
just so I have 40 samples because I
want to have a little bit more than the minimum required.
[MUSIC PLAYING]
Now I can start adding some of my own audio.
Let's begin with the train whistle.
So I'm going to change class 2 to train.
I'm going to click Use Mic.
Now you might notice that my browser is not
asking me for any permission to use the mic.
Most likely, you're going to see that.
I just have it set already to allow that.
So Use Mic and record two seconds of samples.
[TRAIN WHISTLE]
Once again, I need to extract those, and I have two samples.
Now I need at least eight minimum,
so let me add a whole bunch.
And I'll speed through this for you.
[TRAIN WHISTLE]
And I've got 16 audio samples of train whistle sounds.
Now we move on and try the bell.
Add a class, bell, and Add Samples.
[BELL RINGING]
Two samples.
Here's the thing.
I know that I'm going to need--
maybe I want to have 16 bell samples.
You know what?
I'm going to just record for 16 seconds.
[BELL RINGING]
In some cases, I might want to consider when I'm actually
hitting the bell in relationship to how it's chopping it
up into 1-second increments.
But let's just see if it works even
without me being thoughtful about this.
I'm just ringing this bell and recording 16 seconds of it
and letting Teachable Machine do its extraction however
it's going to do it.
And there we go, 18 samples.
I'm well above the minimum.
I can move on to the ukulele.
Another feature under the Settings is a delay.
And I want to give myself two seconds from when
I hit that Record button to when it starts recording,
so it can give me a minute to get set up with the ukulele.
Save Settings, record 16 seconds, 2, 1.
[UKULELE PLAYING]
Now I can extract those, and I've
got background noise, train whistle, the bell,
and the ukulele.
And I'm ready to train the model.
Before I start the training process,
let me address something.
What are these images?
And wait a second.
Is this actually an image classifier?
Because this kind of looks like what we had before.
Only the images aren't things from the camera.
They are these pictures that somehow
appear based on the sound.
And what they are-- in a way, this is kind of true--
because what these are visualizations
of the audio signal, specifically the spectrogram.
What are the various amplitudes of the different frequencies
of the sound?
Is it a very high pitch sound?
Is it a very low pitch sound?
So that's the actual data.
That spectrogram of 1 second of audio
is what is being sent into the machine learning model itself.
Let's train the model.
(SINGING) Don't switch the tabs.
Don't switch the tabs.
[INAUDIBLE]
[BELL RINGS]
Bell.
[TRAIN WHISTLE]
Train whistle.
It works.
And now we can take this model and follow the same steps we
did with the image classifier.
Step 1, export the model.
I want to upload it.
I can copy that URL.
Click.
Switch over to my p5.js sketch.
In my code example from the previous video,
which was trained to recognize a train whistle or a rainbow
image--
train, rainbow, train, rainbow--
and I can switch it to instead of having an image
classifier, a sound classifier.
I can change the model URL to my new model URL.
I don't need the video anymore, so I can delete that.
I'm going to change this to Classify Audio.
Unlike with the image classifier,
the audio classifier doesn't need
you to specifically say which sound you want to link it to.
It's going to default to the microphone.
So I can remove this video here.
Keep this as gotResults.
There's no video to draw.
The categories, instead of train, rainbow, unicorn,
and ukulele here are, well, train, then I've got bell,
no unicorn-- it's so sad--
and ukulele.
And in the audio case as well, something that's different
is, instead of having to explicitly say now go ahead
and classify the video again, the audio engine is going
to just continue listening.
So I can get rid of this classifyVideo function.
And I can run this sketch.
A train already?
Wait, wait, wait, wait, wait.
So I don't want--
I want to not make the same mistake I
made in the first video.
Let's consider what it should display
if it doesn't hear anything.
I'll just use headphones, so let me put some headphones in.
Then, I'm going to actually say if the label is train,
put in the train emoji.
Now I'm going to start the scratch,
and I'm going to attempt to be very quiet while I do so.
[BELL RINGS]
[UKULELE PLAYING]
Ukulele.
[TRAIN WHISTLE]
[BELL RINGS]
[UKULELE PLAYING]
That works.
Oh, that's so exciting.
Interestingly, I wonder if I'm talking, what it thinks it is.
[BELL RINGS]
Hello.
This is me talking.
I probably was saying things while I was recording
those ukulele sounds, or it just matches the most closely
because me talking is not background noise.
So I could have put another category of just me talking
or specific words, oh, so many possibilities.
Would it even work to train the model on different chords
of the ukulele?
My suspicion is that's not going to work particularly
well because the quality of the sound of those chords,
particularly if they share some of the notes,
some of the frequencies is going to be quite similar.
And the base model was trained on human speech.
But let's give it a try, and maybe we
can control the snake game with different ukulele chords.
So I actually just went and trained a model
with this idea with four chords--
C, G7, F, and A. And you could see it mostly works,
or it kind of works.
[UKULELE PLAYING]
Give me a C. But it's really not getting
as clearly distinct high confidence
scores for what I want.
Maybe if I try individual notes, it'll work better.
The first note on the ukulele is A, the A string.
[UKULELE PLAYING]
E.
[UKULELE PLAYING]
C.
[UKULELE PLAYING]
And finally G.
[UKULELE PLAYING]
The images of the spectrogram look
kind of distinct and different to me,
so I view that as a good sign.
Let's try training the model.
And let's see how it performs.
A. Yeah.
We got a big bump there in the A confidence score.
That's good.
Let's try G. Big bump there in the G confidence score.
Let's try C, E. So you can see this isn't perfect.
These sounds are maybe not as distinct to this model that's
based on how the pre-trained model what kinds of audio
it was trained on, but I'm getting something there.
Let's see how well I can control that snake game
with just these four notes.
If you remember from before--
left, right, up, down-- still working.
Go this way, left, left.
Get that food.
Let's try changing it to use the audio classifier.
I forgot to export the model.
Export, Upload, Copy.
[MUSIC PLAYING]
Now you need to decide which notes go with which movement.
A is left.
E is right.
C is down.
G is up.
I made all the same exact changes just
to convert this from classifying a video to classifying
audio from the mic.
Let's see what happens.
For whatever reason, I typed audioClassifier
when it's actually soundClassifier.
And you can actually look at the ml5 website
to find all the documentation for the soundClassifier.
[UKULELE PLAYING]
I was so close.
I think I have a way of making this work better for my brain.
Up is A. Down is g.
So those are the outer strings.
Then right and left are the inner strings.
Up, up, right, left, down.
Here we go.
[UKULELE PLAYING]
Yay.
I got the food.
So that perhaps wasn't the best solution.
Maybe something I could have done
was work with those confidence scores in a more intentional
way to make sure that some classifications that I got back
of the audio that weren't as confident
didn't disrupt the correct direction of the snake
that I had gotten in the first place.
I'm trying this again one more time off of my own speech
because I want to see if I can really get
finer control over that snake.
So right now, I've collected some data of me saying up,
down, and a meow sound.
And what I've been doing is I've been
trying to time the word-- the time my saying of up and down
and meow with 1-second intervals.
Let me show you what that looks like.
Up, up, down, down, meow, meow, meow, meow, meow.
Now I'm going to go and add whistling.
[WHISTLING]
Time to train this model.
Up, down, meow.
[WHISTLING]
So we can see how this is much more accurate than my attempts
with a ukulele chord and notes.
Export, Upload, Copy, Paste, change the labels--
up, down, right will be meow, and left will be whistle.
Meow.
[WHISTLING]
Down.
Meow.
[WHISTLING]
Down.
Meow.
Oh, yes.
I got the food, and then I died.
But I don't care.
It worked.
But I think you'll remember, from this example,
it works quite well.
And hopefully, this opens up a lot of possibilities for you.
If you made something, I've got a link
in this video subscription to a page at the codingtrain.com
where you could submit a URL to a product you've made.
And I would love to check it out,
share it on a future Livestream.
I can't wait to see what kind of stuff people
make from my strange projects.
And I look forward to seeing you in a future Coding Train video.
Goodbye.
[TRAIN WHISTLE]
That was the first time that something
happened when I blew the train whistle
at the end of the video.
That is great.
[BELL RINGS]
[MUSIC PLAYING]