字幕表 動画を再生する 英語字幕をプリント [TRAIN WHISTLE] Hello, and welcome to another Teachable Machine video tutorial. This time, I'm going to look at sound classification. So I have to say goodbye to my friend the unicorn but hello, again, to the ukulele, hello to the bell, and the train whistle. I am going to look at how I can use Teachable Machine to train a model to recognize different sounds. In the previous video, I looked at image classification. So you can see here, this example is recognizing the guitar. And now I want to make this exact same example but have it show the guitar when I play the ukulele, have it show the train whistle when I play the train whistle. To do this, instead of starting with an image project, I'm going to start with an audio project. Now, you might remember, in my previous video where I made an image classifier, I talked about the process of transfer learning, how Teachable Machine starts with a base model, in this case, MobileNet, which knows how to classify images into 1,000 categories, removes all those categories, replaces them with those of our own-- unicorns, train whistles, rainbows, and a ukulele-- and then retrains the model with new images. The sound classifier is going to do exactly the same thing. This time, however, the base model is something called speech commands 18w. The speech commands model is pre-trained to recognize 18 words in English that a person might say so the digits 0 through 9-- that's 10-- up, down, left, right-- up to 14-- stop, go, yes, no-- that's 18. It also has a category for unknown word and background noise, so really there's 20 but 18w for those 18 words. So when I make an audio project, I'm going to use that model, remove all of those labels, and put in my own with my own training sounds instead of images this time. And my labels will be train whistle, bell, and ukulele. So let's get started and make an audio project. The very first thing you need to do when making an audio classifier in Teachable Machine is record some samples or audio examples of background noise. The model-- we'll need this during the training process to have something to compare the other classes to. So let's do that. I'm going to attempt to be very quiet, which is quite hard for me, and record background samples. Add Samples, Use Mic, and then I'm going to record a 20-second sample and attempt to meditate during that time. [MUSIC PLAYING] All right. I've got the 20 seconds of background noise. Now the model as its input doesn't take 20 seconds of audio. Any given example presented to the model during the training process has to be 1 second of audio. But I just recorded 20 seconds, so Teachable Machine required one additional step to prepare that training data. And that is through this button, Extract Sample. So if I click the Extract Sample button, it's going to take that 20 seconds of audio and convert it into 20 samples. Let's just record 20 more seconds of background audio just so I have 40 samples because I want to have a little bit more than the minimum required. [MUSIC PLAYING] Now I can start adding some of my own audio. Let's begin with the train whistle. So I'm going to change class 2 to train. I'm going to click Use Mic. Now you might notice that my browser is not asking me for any permission to use the mic. Most likely, you're going to see that. I just have it set already to allow that. So Use Mic and record two seconds of samples. [TRAIN WHISTLE] Once again, I need to extract those, and I have two samples. Now I need at least eight minimum, so let me add a whole bunch. And I'll speed through this for you. [TRAIN WHISTLE] And I've got 16 audio samples of train whistle sounds. Now we move on and try the bell. Add a class, bell, and Add Samples. [BELL RINGING] Two samples. Here's the thing. I know that I'm going to need-- maybe I want to have 16 bell samples. You know what? I'm going to just record for 16 seconds. [BELL RINGING] In some cases, I might want to consider when I'm actually hitting the bell in relationship to how it's chopping it up into 1-second increments. But let's just see if it works even without me being thoughtful about this. I'm just ringing this bell and recording 16 seconds of it and letting Teachable Machine do its extraction however it's going to do it. And there we go, 18 samples. I'm well above the minimum. I can move on to the ukulele. Another feature under the Settings is a delay. And I want to give myself two seconds from when I hit that Record button to when it starts recording, so it can give me a minute to get set up with the ukulele. Save Settings, record 16 seconds, 2, 1. [UKULELE PLAYING] Now I can extract those, and I've got background noise, train whistle, the bell, and the ukulele. And I'm ready to train the model. Before I start the training process, let me address something. What are these images? And wait a second. Is this actually an image classifier? Because this kind of looks like what we had before. Only the images aren't things from the camera. They are these pictures that somehow appear based on the sound. And what they are-- in a way, this is kind of true-- because what these are visualizations of the audio signal, specifically the spectrogram. What are the various amplitudes of the different frequencies of the sound? Is it a very high pitch sound? Is it a very low pitch sound? So that's the actual data. That spectrogram of 1 second of audio is what is being sent into the machine learning model itself. Let's train the model. (SINGING) Don't switch the tabs. Don't switch the tabs. [INAUDIBLE] [BELL RINGS] Bell. [TRAIN WHISTLE] Train whistle. It works. And now we can take this model and follow the same steps we did with the image classifier. Step 1, export the model. I want to upload it. I can copy that URL. Click. Switch over to my p5.js sketch. In my code example from the previous video, which was trained to recognize a train whistle or a rainbow image-- train, rainbow, train, rainbow-- and I can switch it to instead of having an image classifier, a sound classifier. I can change the model URL to my new model URL. I don't need the video anymore, so I can delete that. I'm going to change this to Classify Audio. Unlike with the image classifier, the audio classifier doesn't need you to specifically say which sound you want to link it to. It's going to default to the microphone. So I can remove this video here. Keep this as gotResults. There's no video to draw. The categories, instead of train, rainbow, unicorn, and ukulele here are, well, train, then I've got bell, no unicorn-- it's so sad-- and ukulele. And in the audio case as well, something that's different is, instead of having to explicitly say now go ahead and classify the video again, the audio engine is going to just continue listening. So I can get rid of this classifyVideo function. And I can run this sketch. A train already? Wait, wait, wait, wait, wait. So I don't want-- I want to not make the same mistake I made in the first video. Let's consider what it should display if it doesn't hear anything. I'll just use headphones, so let me put some headphones in. Then, I'm going to actually say if the label is train, put in the train emoji. Now I'm going to start the scratch, and I'm going to attempt to be very quiet while I do so. [BELL RINGS] [UKULELE PLAYING] Ukulele. [TRAIN WHISTLE] [BELL RINGS] [UKULELE PLAYING] That works. Oh, that's so exciting. Interestingly, I wonder if I'm talking, what it thinks it is. [BELL RINGS] Hello. This is me talking. I probably was saying things while I was recording those ukulele sounds, or it just matches the most closely because me talking is not background noise. So I could have put another category of just me talking or specific words, oh, so many possibilities. Would it even work to train the model on different chords of the ukulele? My suspicion is that's not going to work particularly well because the quality of the sound of those chords, particularly if they share some of the notes, some of the frequencies is going to be quite similar. And the base model was trained on human speech. But let's give it a try, and maybe we can control the snake game with different ukulele chords. So I actually just went and trained a model with this idea with four chords-- C, G7, F, and A. And you could see it mostly works, or it kind of works. [UKULELE PLAYING] Give me a C. But it's really not getting as clearly distinct high confidence scores for what I want. Maybe if I try individual notes, it'll work better. The first note on the ukulele is A, the A string. [UKULELE PLAYING] E. [UKULELE PLAYING] C. [UKULELE PLAYING] And finally G. [UKULELE PLAYING] The images of the spectrogram look kind of distinct and different to me, so I view that as a good sign. Let's try training the model. And let's see how it performs. A. Yeah. We got a big bump there in the A confidence score. That's good. Let's try G. Big bump there in the G confidence score. Let's try C, E. So you can see this isn't perfect. These sounds are maybe not as distinct to this model that's based on how the pre-trained model what kinds of audio it was trained on, but I'm getting something there. Let's see how well I can control that snake game with just these four notes. If you remember from before-- left, right, up, down-- still working. Go this way, left, left. Get that food. Let's try changing it to use the audio classifier. I forgot to export the model. Export, Upload, Copy. [MUSIC PLAYING] Now you need to decide which notes go with which movement. A is left. E is right. C is down. G is up. I made all the same exact changes just to convert this from classifying a video to classifying audio from the mic. Let's see what happens. For whatever reason, I typed audioClassifier when it's actually soundClassifier. And you can actually look at the ml5 website to find all the documentation for the soundClassifier. [UKULELE PLAYING] I was so close. I think I have a way of making this work better for my brain. Up is A. Down is g. So those are the outer strings. Then right and left are the inner strings. Up, up, right, left, down. Here we go. [UKULELE PLAYING] Yay. I got the food. So that perhaps wasn't the best solution. Maybe something I could have done was work with those confidence scores in a more intentional way to make sure that some classifications that I got back of the audio that weren't as confident didn't disrupt the correct direction of the snake that I had gotten in the first place. I'm trying this again one more time off of my own speech because I want to see if I can really get finer control over that snake. So right now, I've collected some data of me saying up, down, and a meow sound. And what I've been doing is I've been trying to time the word-- the time my saying of up and down and meow with 1-second intervals. Let me show you what that looks like. Up, up, down, down, meow, meow, meow, meow, meow. Now I'm going to go and add whistling. [WHISTLING] Time to train this model. Up, down, meow. [WHISTLING] So we can see how this is much more accurate than my attempts with a ukulele chord and notes. Export, Upload, Copy, Paste, change the labels-- up, down, right will be meow, and left will be whistle. Meow. [WHISTLING] Down. Meow. [WHISTLING] Down. Meow. Oh, yes. I got the food, and then I died. But I don't care. It worked. But I think you'll remember, from this example, it works quite well. And hopefully, this opens up a lot of possibilities for you. If you made something, I've got a link in this video subscription to a page at the codingtrain.com where you could submit a URL to a product you've made. And I would love to check it out, share it on a future Livestream. I can't wait to see what kind of stuff people make from my strange projects. And I look forward to seeing you in a future Coding Train video. Goodbye. [TRAIN WHISTLE] That was the first time that something happened when I blew the train whistle at the end of the video. That is great. [BELL RINGS] [MUSIC PLAYING]
B1 中級 ティーチャブル・マシーン 3.音の分類 (Teachable Machine 3: Sound Classifiication) 2 0 林宜悉 に公開 2021 年 01 月 14 日 シェア シェア 保存 報告 動画の中の単語