字幕表 動画を再生する 英語字幕をプリント [DING] Hello, and welcome to another Beginner's Guide to Machine Learning video tutorial. In this video, I am going to cover the pre-trained model, PoseNet. And I'm going to look at what PoseNet is, how to use it with the ml5,js library with the p5.js library, and track your body in the browser in real time. The model, as I mentioned, that I'm looking at, is called PoseNet. [MUSIC PLAYING] With any machine learning model that you use, the first question you probably want to ask is, what are the inputs? [MUSIC PLAYING] And what are the outputs? [MUSIC PLAYING] And in this case, the PoseNet model is expecting an image as input. [MUSIC PLAYING] And then as output, it is going to give you an array of coordinates. [MUSIC PLAYING] In addition to each of these xy coordinates, it's going to give you a confidence score for each one. [MUSIC PLAYING] And what do all these xy coordinates correspond to? They correspond to the keypoints on a PoseNet skeleton. [MUSIC PLAYING] Now, the PoseNet skeleton isn't necessarily an anatomically correct skeleton. It's just an arbitrary set of what is 17 points that you can see right over here, from the nose all the way down to the right ankle, that it is trying to estimate where those positions are on the human body, and give you xy coordinates, as well as how confident is that it's correct about those points. One other important question you should ask yourself and do some research about whenever you find yourself using a pre-trained model out of the box, something that somebody else trained, is who trained that model? Why did they train that model? What data was used to train that model? And how is that data collected? PoseNet is a bit of an odd case, because the model itself, the trained model is open source. You can use it. You can download it. There's examples for it in TensorFlow and tensorflow.js and ml5,js. But the actual code for training the model, from what I understand or what I've been able to find, is closed source. So there aren't a lot of details. A data set that's used often in training models around images is COCO, or Common Objects In Context. And it has a lot of labeled images of people striking poses with their keypoints marked. So I don't know for a fact whether COCO was used exclusively for training PoseNet, whether it was used partially or not at all. But your best bet for a starting point for finding out as much as you can about the PoseNet model is to go directly to the source. The GitHub repository for PoseNet, in fact there's a PoseNet 2.0 coming out. I would also highly suggest you read the blog post "Real-time Human Post Estimation in the Browser with TensorFlow.js" by Dan Oved and editing and illustrations from Irene Alvarado and Alexis Gallo. So there's a lot of excellent background information about how the model was trained and other relevant details. If you want to learn more about the COCO image data set, I also would point you towards the Humans of AI project by Philip Schmidt, which is an artwork, an online exhibition that takes a critical look at the data in that data set itself. If you found your way to this video, most likely, you're here because you're making interactive media projects. And PoseNet is a tool that you could use to do real time body tracking very quickly and easily. It's frankly, pretty amazing that you could do this with just a webcam image. So one way to get started, which in my view, is one of the easiest ways, is with the p5 Web Editor, in the p5.js library, which very, so I have a sketch here which connects to the camera and just draws the image in a canvas. Also want to make sure you have the ml5,js library imported, and that would be through a script tag in index at HTML. Once you've got all that set up, we're ready to start coding. So I'm going to create a variable called PoseNet. I'm going to say PoseNet equals ml5.posenet. All the ml5 functions are initialized the same way, by referencing the ml5 library dot the name of the function, in this case, PoseNet. Now typically, there's some arguments that go here. And we can look up what those arguments are, by going to the documentation page. Here we can see there are a few different ways to call the PoseNet function. I want to do it the simplest way possible. I'm just going to give it the video element and a callback for when the model is loaded, which I don't even know that I need. [MUSIC PLAYING] I'll make sure there are no errors and run this again. And we can see PoseNet is ready. So I know I've got my syntax right. I've called the PoseNet function, I've loaded the model. The way PoseNet works is actually a bit different than everything else in the ml5 library. And it works based on event handlers. So I want to set up a pose event by calling this method on. On pose, I want this function to execute. Whenever the PoseNet model detects a pose, then call this function and give me the results of that pose. I can add that right here in setup. PoseNet on pose. And then I'm going to give it a callback called, got poses. [MUSIC PLAYING] And now presumably, every single time it detects a pose, it sees me, it sees my skeleton, it will log that to the console right here. Now that it's working, I can see a bunch of objects being logged. Let's take a look at what's inside those objects. The p5 console is very useful for your basic debugging. In this case, I really want to dive deep into this object that I'm logging here, the poses object. So in this case, I'm going to open up the actual developer console of the browser. I could see a lot of stuff being logged here very, very quickly. I'm going to pick any one of these and unfold it. So I can see that I have an array. And the first element of the array is a pose. There can be multiple poses that the model is detecting if there's more than one person. In this case, there's just one. And I can look at this object. It's got two properties, a pose property and a skeleton property. Definitely want to come back to the skeleton property. But let's start with the pose property. I can unfold that, and we could see, oh my goodness, look at all this stuff in here. So first of all, there's a score. I mentioned that with each one of these xy positions of every keypoint, there is a confidence score. There is also a confidence score for the entire pose itself. And because the camera's seeing very little of me, it's quite low, just at 30%. Then I can actually access any one of those keypoints by its name. Nose, left eye, right eye, all these, all the way down once again to right ankle. So let's actually draw something based on any of those keypoints. We'll use my nose. I going to make the assumption that there's always only going to be a single person. If there were multiple people, I'd want to do this differently. And I'm going to make a, hit stop. I'm going to make a variable called pose. Then I'm going to say, if it's found a pose, and I can check that by just checking the length of the array. If the length of the array is zero, then pose equals poses index zero. I'm going to take the first pose from the array and store it into the global variable. But actually, if you remember, the object in the array has two properties, pose and skeleton. So it seems there's a lot of redundant lingo here, but I'm going to say, posesindex0.pose. [MUSIC PLAYING] This could be a good place to use the confidence score. Like, only if it's like of a high confidence actually use it. But I'm just going to take any pose that it gives me. Then in the draw function, I can draw something based on that pose. So for example, let me give myself a red nose. [MUSIC PLAYING] So now if I run the sketch, ah, so I got an error. So why did I get that error? The reason why I got that error is it hasn't found a pose yet, so there is no nose for it to draw. So I should always check to make sure there is a valid pose first. [MUSIC PLAYING] Then draw that circle. And there we go. I now have a red dot always following my nose. If you're following along, pause the video and try to add two more points where your hands are. Now there isn't actually a hand keypoint. It's a wrist keypoint. But that'll probably work for our purposes, I'll let you try that. [TICKING] [DING] How did that go? OK, I'm going to add it for you now. [MUSIC PLAYING] Let's see if this works. Whoo. This is working terribly. It could, I'm almost kind of getting it right. And there we go. But why is it working so poorly? Well, first of all, I'm barely showing, I'm only showing it from my waist up. And most likely, the model was trained on full body images. [MUSIC PLAYING] Now I turned the camera to point at me over here, and I'm further away. And you can see how much more accurate this is, because it seems so much more of my body. I'm able to control where the wrists are and get pretty good accurate tracking as I'm standing further away from the camera. There are also some other interesting tricks we could try. For example, I could estimate distance from the camera by looking at how far apart are the eyes. [MUSIC PLAYING] So for example here, I'm storing the right eye and left eye location in separate variables, and then calling the p5 distance function to look at how far apart they are. And then, I could just take that distance and assign it to the size of the nose. So as I get closer, the nose gets bigger. And you almost can't tell, because it's sizing relative to my face. But it gives it more of a realistic appearance of an actual clown nose that's attached, by changing its size according to the proportions of what it's detecting in the face. You might be asking yourself, well, what if I want to draw all the points, all the points that it's tracking? So for convenience, I was referencing each point by name. Right eye, left eye, nose, right wrist. But there's actually a keypoints array that has all 17 points in it. So I can use that to just loop through everything if that's what I want to do. [MUSIC PLAYING] So I can loop through all of the keypoints and get the xy of each one. [MUSIC PLAYING] And then I can draw a green circle at each location. Oops. So that code didn't work, because I forgot that each element, each keypoint is more than just an xy. It's got the conference score, it's got the name of the part and a position. So I need the keypoints index 0's position dot x. Pose dot keypoints index I dot position dot x. Dot position dot y. Now I believe this'll work. And here we go. Only thing I'm not seeing are my ankles. Oh, it's not. There we go! I got kind of accurate there. Here's my pose. OK, so you can see I'm getting all the points of my body right now, standing about probably six feet away from the camera. There's one other aspect of this that I haven't shown you yet. So if you've seen demos of PoseNet and some of the examples, the points are connected with lines. So on the one hand, you could just memorize like always draw a line between the shoulder to the elbow and the elbow to the wrist. But PoseNet, what I presume is based on the confidence scores, will dynamically give you back which parts are connected to which parts. And that's in the skeleton property of the object found in the array that was returned to us. So I could actually add a new global variable called skeleton. This would've been good for Halloween. Skeleton equals, and let me just stop this for a second. Poses index zero dot skeleton. I can loop over the skeleton. [MUSIC PLAYING] And skeleton is actually a two-dimensional array, because in the second dimension, it holds the two locations that are connected. So I can say a equals skeleton index i index zero. And b is. [MUSIC PLAYING] Index 1. And then I can just draw a line between the two of them. [MUSIC PLAYING] I look at every skeleton point. I get the two parts. Part A, part B, and just draw a line between the x's and y's of each of those. [MUSIC PLAYING] Make it a kind of thicker line, and give it a, the color white. And let's see what this looks like. And there we go. That's pretty much everything you could do with the ml5 PoseNet function. So for you, you might try to do something like make a googly eyes. That's something I actually did in a previous video where I looked at an earlier version of PoseNet. And you could also look at some of these other examples that demonstrate other aspects. For example, you can actually find the pose of a JPEG that you load rather than images from a webcam. But what I want to do, which I'm going to get to in a follow-up video to this, is not take the outputs and draw something. But rather, take these outputs and feed them as training data into an ml5 neural network. What if I say, hey, every time I make this pose, label that a y. And every time I make this pose, label that an m, a c, an a, you see where I'm going. Could I create a pose classifier? I can use all of the xy positions, label them, and train a classifier to make guesses as to my pose. This is very similar to what I did with the teachable machine image classifier. The difference is, with the image classifier, as soon as I move the camera to a different room with different lighting and a different background with a different person, it's not going to be able to recognize the pose anymore, because that was trained on the raw pixels. This is actually just trained on the relative positions. So in theory, somebody around the same size as me, swapping out, it would recognize their pose. And there's actually a way that I could just normalize all the data, so that it would work for anybody's pose potentially. So you can train your own pose classifier that'll work generically in a lot of different environments. So if you make something with ml5 PoseNet or with PoseNet with another environment, please share it with me. I'd love to check it out. You could find the code for everything in this video in the link in this video's description. And I'll see you in the future "Coding Train" ml5 Machine Learning Beginner, whatever, something video [WHISTLE] Goodbye. [MUSIC PLAYING]
B1 中級 ml5.js PoseNetによる姿勢推定 (ml5.js Pose Estimation with PoseNet) 4 0 林宜悉 に公開 2021 年 01 月 14 日 シェア シェア 保存 報告 動画の中の単語