字幕表 動画を再生する 英語字幕をプリント I'd like to talk about face detection All right. So this is the idea or if you've got a picture with one face in it or many faces in it how do we find those faces and The standard approaches is "Ah, we'll just use deep learning" Now you can use deep learning to find faces But actually the approach that everyone uses isn't deep learning and it was developed in the early 2000s So back before deep learning did everything You kind of had to come up with these algorithms yourself right machine learning was still a thing. So people still use machine learning But they used them with handcrafted features and small neural networks and other kinds of classifiers that they tried to use to do these things Now the face detection was you know ongoing research at this time In 2002 Paul viola Michael Jones came up with this paper here called "Rapid object detection using a boosted cascade of simple features", and this is a very very good paper. It's been cited some 17,000 times And despite the fact that deep learning has kind of taken over everything. In face detection, this still performs absolutely fine, right It's incredibly quick and if you've got any kind of camera that does some kind of face detection It's going to be using something very similar to this, right? So what does it do? Let's talk about that. The problem is, right, There's a few problems with face detection one is that we don't know how big the face is going to be So it could be very big could be very small, and another is, you know, Maybe you've got a very high-resolution image. We want to be doing this lots and lots of times a second So what are we going to do to? Look over every every tiny bit of image lots and lots of times? Complicated, um, Machine learning, that says, you know, is this a face? is this not a face? There's a trade-off between speed and accuracy and false-positives and false-negatives. It's a total mess It's very difficult to find faces quickly, right? This is also considering it, you know, we have different ethnic groups young, old people, people who've got glasses on, things like this So all of this adds up to quite a difficult problem, and yet it's not a problem we worry about anymore because we can do it and we can do it because of these guys They came up with a classifier that uses very very simple features, one bit of an image subtracted from another bit of an image and On its own and that's not very good, but if you have thousands and thousands of those, all giving you a clue that maybe this is a face, you could start to come up with proper decision [offscreen] Is this looking for facial features then is it as simple as looking for a nose and an eye and etc? So no, not really, right. So deep learning kind of does that right? It takes it takes edges and other features and it combines them together into objects you know, in a hierarchy and then maybe it finds faces. What this is doing is making very quick decisions about What it is to be a face, so in for example, if we're just looking at a grayscale image Right, my eye is arguably slightly darker than my forehead, right? In terms of shadowing and the pupils darker and things like this So if you just do this bit of image minus this bit of image My eye is going to produce a different response from this blackboard, right, most of the time Now, if you do that on its own, that's not a very good classifier, right? It'll get quite a lot of the faces But it'll also find a load of other stuff as well where something happens to be darker than something else that happens all the time so the question is "can we produce a lot of these things all at once and make a decision that way?" They proposed these very very simple rectangular features Which are just one part of an image subtracted from another part of an image So there are a few types of these features. One of them is a two rectangle features So we have a block of image where we subtract one side from the other side Their approaches are machine learning-based approach Normally, what you would do in machine learning is you would extract -- You can't put the whole image in maybe there's five hundred faces in this image So we put in something we've calculated from the image some features and then we use all machine learning to try and classify bits of the image or the whole image or something like this. Their contribution was a very quick way to calculate these features and use them to make a face classification To say there is a face in this block of image or there isn't And the features they use a super simple, right? So they're just rectangular features like this So we've got two rectangles next to each other which, you know are some amount of pixels so maybe it's a It's nine pixels here and nine pixels here or just one pixel and one pixel or hundred pixels and a hundred pixels It's not really important. and we do one subtract the other right? So essentially we're looking for bits of an image where one bit is darker or brighter than another bit This is a two rectangle feature. It can also be oriented the other way so, you know like this We also have three rectangle features which are like this where you're doing sort of maybe the middle subtract the outside or vice versa And we have four rectangle feature which are going to be kind of finding diagonal sort of corner things So something like this Even if your image is small right you're going to have a lot of different possible features even of these four types So this four rectangle feature could just be one pixel each or each of these could be half the image it can scale You know or move and move around Brady : What determines that? Mike : Um, so they do all of them, right? Or at least they look at all of them originally And they learn which ones are the most useful for finding a face this over a whole image of a face isn't hugely representative of what a face looks like right? No one's face. The corners are darker than the other two corners That doesn't make sense, right but maybe over their eye, maybe that makes more sense I don't know, that's the kind of the idea. So they have a training process at which was down Which of these features are useful, the other problem we've got is that on an image Calculating large groups of pixels and summing them up is quite a slow process So they come a really nifty idea called an integral image which makes this way way faster So let's imagine we have an image Right, and so think -- consider while we're talking about this that we want to kind of calculate these bits of image But minus some other bit of image, right? So let's imagine we have an image which is nice and small It's too small for me to write on but let's not worry about it Right and then let's draw in some pixel values. Sast forward. Look at the state of that. That's that's a total total shambles This is a rubbable-out pen, right? For goodness sake Right right okay okay so all right so Let's imagine this is our input image. We're trying to find a face in it Now I can't see one But obviously this could be a quite a lot bigger and we want to calculate let's say one of our two rectangle features So maybe we want to do these four pixels up in the top Minus the four pixels below it now that's only a few additions : 7 + 7 + 1 + 2 minus 8 + 3 + 1 + 2 But if you're doing this over large sections of image and thousands and thousands of times to try and find faces That's not gonna work So what Viola Jones came up with was this integral image where we pre-compute Some of this arithmetic for us, store it in an intermediate form, and then we can calculate rectangles minus of of rectangles really easily So we do one pass over the image, and every new pixel is the sum of all the pixels Above and to the left and it including it. right, so this will be something like this so 1 and 1 + 7 is 8 so this pixel is the sum of these two pixels and this pixel is going to be all these three So that's going to be 12... 14... 23 and now we fast forward while I do a bit of math in my head 8...17 maybe I did somebody's earlier, 24... On a computer this is much much faster The sum of all the pixels is 113. For example, the sum of this 4x4 block is 68 now The reason this is useful, bear with me here But if we want to work out what, let's say, the sum of this region is what we do is we take this one 113 we subtract this one, minus 64 Alright, and this one? minus 71 and that's taken off all of that and all of that and then we have to add this bit in because we've been Taken off twice so plus 40. All right, so that's four reads. Now funnily enough this is a 4 by 4 block So I've achieved nothing But if this was a huge huge image, I've saved a huge amount of time and the answer to this is 18 Which is 6 plus 6 plus 5 plus 1 So the assumption is that I'm not just going to be looking at these pictures one time to do this, right? There's lots of places a face could be I've got to look at lots of combinations of pixels and different regions So I'm going to be doing huge amounts of pixel addition and subtraction So let's calculate this integral image once and then use that as a base to do really quick Adding and subtracting of regions, right? and so I think for example a 4 rectangle region is going to take something like nine reads or something like that and a little bit addition. It's very simple All right. So now how do we turn this into a working face detector? Let's imagine We have a picture of a face, which is going to be one of my good drawings again Now in this particular algorithm, they look 24 by 24 pixel regions, but they can also scale up and down a little bit So let's imagine there's a face here which has, you know eyes, a nose and a mouth right and some hair Okay, good. Now as I mentioned earlier, there are probably some features that don't make a lot of sense on this So subtracting, for example, if I take my red pen Subtracting this half of image from this half. It's not going to represent most faces It may be when there's a lot of lighting on one side, but it's not very good at distinguishing Images that have faces in and images that don't have faces in So what they do, is they calculate all of the features, right for a 24 by 24 image They calculate all 180,000 possible combinations of 2, 3, and 4 rectangle features and they work out which one For a given data set of faces and not faces, which one best separates the positives from the negatives, right? So let's say you have 10,000 pictures of faces 10,000 pictures of background which one feature best says "this is a face, this is not a face" Right, bearing in mind Nothing is going to get it completely right with just one feature So the first one it looks it turns out is something like this It's a two rectangle region, but works out a difference between the area of the eyes and the air for cheeks So it's saying if on a normal face your cheeks are generally brighter or darker than your eyes So what they do is they say, okay Well, let's start a classifier with just that feature right and see how good it is This is our first feature feature number one, and we have a pretty relaxed threshold so if there's anything plausible in this region we'll let it through right which is going to let through all of the faces and a bunch of other stuff as well that we Don't want right. So this is yes. That's okay, right? That's okay if it's a no then we immediately Fail that region of image right? So we've done one test which is as we know about four additions So we've said for this region of image if this passes will let it through to the next stage Right and we'll say okay it definitely could be a face It's not not-a-face. Does that make sense? Yeah, okay So let's do look at the next feature The next feature is this one So it's a three region feature and it measures the difference between the nose and the bridge and the eyes, right? which may or may not be darker or lighter. All right, so there's a difference there So this is feature number two, so I'm going to draw that in here number two And if that passes we go to the next feature, so this is a sort of binary, they call it "degenerate decision tree" Right, well because the decision tree is a binary tree. This is not really because you immediately stop here you don't go any further. The argument is that Every time we calculate one of these features it takes a little bit of time The quicker we can say "no definitely not a face in there", the better. And the only time we ever need to look at all the features Or all of the good ones is when we think, "okay, that actually could be a face here" So we have less and less general, more and more specific features going forward right up to about the number I think it's about six thousand they end up using. All right, so we we say just the first one pass Yes, just a second one pass Yes, and we keep going until we get a fail and if we get all the way to the end and nothing fails that's a face, right and the beauty of this, is that For the vast majority of the image, there's no computation at all. We just take one look at it, first feature fails "Nah, not a face". They designed a really good way of adding and subtracting different regions of the image And then they trained a classifier like this to find the best features and the best order to apply those features which was a nice compromise between always detecting the faces that are there and false positives and speed right? And at the time, this was running on, I think to give you some idea of what the computational Technology was like in 2002 This was presented on a 700 megahertz Pentium 3 and ran at 15 frames a second which was totally unheard of back then. Face detection was the kind of offline, you know, it was okay at that time So this is a really, really cool algorithm and it's so effective that you still see it used in, you know, in your camera phone and in this camera and so on, when you just get a little bounding box around the face and this is still really useful because you might be doing deep learning on something like face recognition, face ID something like this But part of that process is firstly working out where the face is, and why reinvent the wheel when this technique works really really well You can't really get into the data center necessarily and take all the chips out that you've put in there So you probably will make the chips look like they're meant to be there like they're something else or hide them So the way a modern printed circuit board is constructed. It's a printed circuit board that's got several layers of fiberglass
A2 初級 顔を検出する(ヴィオラ・ジョーンズのアルゴリズム) - コンピュータマニア (Detecting Faces (Viola Jones Algorithm) - Computerphile) 2 0 林宜悉 に公開 2021 年 01 月 14 日 シェア シェア 保存 報告 動画の中の単語