Placeholder Image

字幕表 動画を再生する

  • I'd like to talk about face detection

  • All right. So this is the idea or if you've got a picture with one face in it or many faces in it

  • how do we find those faces and

  • The standard approaches is "Ah, we'll just use deep learning"

  • Now you can use deep learning to find faces

  • But actually the approach that everyone uses isn't deep learning and it was developed in the early 2000s

  • So back before deep learning did everything

  • You kind of had to come up with these algorithms yourself right machine learning was still a thing. So people still use machine learning

  • But they used them with handcrafted features and small neural networks and other kinds of classifiers

  • that they tried to use to do these things

  • Now the face detection was you know ongoing research at this time

  • In 2002 Paul viola Michael Jones came up with this paper here called

  • "Rapid object detection using a boosted cascade of simple features", and this is a very very good paper.

  • It's been cited some 17,000 times

  • And despite the fact that deep learning has kind of taken over everything.

  • In face detection, this still performs absolutely fine, right

  • It's incredibly quick and if you've got any kind of camera that does some kind of face detection

  • It's going to be using something very similar to this, right?

  • So what does it do? Let's talk about that.

  • The problem is, right,

  • There's a few problems with face detection one is that we don't know how big the face is going to be

  • So it could be very big could be very small, and another is, you know,

  • Maybe you've got a very high-resolution image. We want to be doing this lots and lots of times a second

  • So what are we going to do to? Look over every every tiny bit of image lots and lots of times?

  • Complicated, um,

  • Machine learning, that says, you know, is this a face? is this not a face?

  • There's a trade-off between speed and accuracy and false-positives and false-negatives. It's a total mess

  • It's very difficult to find faces quickly, right? This is also considering it, you know, we have different ethnic groups

  • young, old people, people who've got glasses on, things like this

  • So all of this adds up to quite a difficult problem, and yet it's not a problem

  • we worry about anymore because we can do it and we can do it because of these guys

  • They came up with a classifier that uses very very simple features, one bit of an image subtracted from another bit of an image and

  • On its own and that's not very good, but if you have

  • thousands and thousands of those, all giving you a clue that maybe this is a face, you could start to come up with proper decision

  • [offscreen] Is this looking for facial features then is it as simple as looking for a nose and an eye and etc?

  • So no, not really, right. So deep learning kind of does that right?

  • It takes it takes edges and other features and it combines them together into objects

  • you know, in a hierarchy and then maybe it finds faces. What this is doing is making very quick decisions about

  • What it is to be a face, so in for example, if we're just looking at a grayscale image

  • Right, my eye is arguably slightly darker than my forehead, right?

  • In terms of shadowing and the pupils darker and things like this

  • So if you just do this bit of image minus this bit of image

  • My eye is going to produce a different response from this blackboard, right, most of the time

  • Now, if you do that on its own, that's not a very good classifier, right? It'll get

  • quite a lot of the faces

  • But it'll also find a load of other stuff as well where something happens to be darker than something else that happens all the time

  • so the question is "can we produce a lot of these things all at once and make a decision that way?"

  • They proposed these very very simple rectangular features

  • Which are just one part of an image subtracted from another part of an image

  • So there are a few types of these features. One of them is a two rectangle features

  • So we have a block of image where we subtract one side from the other side

  • Their approaches are machine learning-based approach

  • Normally, what you would do in machine learning is you would extract --

  • You can't put the whole image in maybe there's five hundred faces in this image

  • So we put in something we've calculated from the image some features and then we use all machine learning to try and classify

  • bits of the image or the whole image or something like this. Their contribution was a very quick way to

  • calculate these features and use them to make a face classification

  • To say there is a face in this block of image or there isn't

  • And the features they use a super simple, right? So they're just rectangular features like this

  • So we've got two rectangles next to each other which, you know are some amount of pixels

  • so maybe it's a

  • It's nine pixels here and nine pixels here or just one pixel and one pixel or hundred pixels and a hundred pixels

  • It's not really important.

  • and we do one subtract the other right?

  • So essentially we're looking for bits of an image where one bit is darker or brighter than another bit

  • This is a two rectangle feature. It can also be oriented the other way so, you know like this

  • We also have three rectangle features which are like this where you're doing sort of maybe the middle subtract the outside or vice versa

  • And we have four rectangle feature which are going to be kind of finding diagonal sort of corner things

  • So something like this

  • Even if your image is small right you're going to have a lot of different possible features even of these four types

  • So this four rectangle feature could just be one pixel each or each of these could be half the image it can scale

  • You know or move and move around

  • Brady : What determines that? Mike : Um, so they do all of them, right?

  • Or at least they look at all of them originally

  • And they learn which ones are the most useful for finding a face this over a whole image of a face isn't hugely

  • representative of what a face looks like right? No one's face. The corners are darker than the other two corners

  • That doesn't make sense, right but maybe over their eye, maybe that makes more sense

  • I don't know, that's the kind of the idea. So they have a training process at which was down

  • Which of these features are useful, the other problem we've got is that on an image

  • Calculating large groups of pixels and summing them up is quite a slow process

  • So they come a really nifty idea called an integral image which makes this way way faster

  • So let's imagine we have an image

  • Right, and so think -- consider while we're talking about this that we want to kind of calculate these bits of image

  • But minus some other bit of image, right? So let's imagine we have an image which is nice and small

  • It's too small for me to write on but let's not worry about it

  • Right and then let's draw in some pixel values. Sast forward. Look at the state of that. That's that's a total total shambles

  • This is a rubbable-out pen, right? For goodness sake

  • Right right okay okay so all right so

  • Let's imagine this is our input image. We're trying to find a face in it

  • Now I can't see one

  • But obviously this could be a quite a lot bigger and we want to calculate let's say one of our two rectangle features

  • So maybe we want to do these four pixels up in the top

  • Minus the four pixels below it now that's only a few additions : 7 + 7 + 1 + 2

  • minus 8 + 3 + 1 + 2

  • But if you're doing this over large sections of image and thousands and thousands of times to try and find faces

  • That's not gonna work

  • So what Viola Jones came up with was this integral image where we pre-compute

  • Some of this arithmetic for us, store it in an intermediate form, and then we can calculate

  • rectangles minus of of rectangles really easily

  • So we do one pass over the image, and every new pixel is the sum of all the pixels

  • Above and to the left and it including it. right, so this will be something like this

  • so

  • 1 and 1 + 7 is 8 so this pixel is the sum of these two pixels and this pixel is going to be all these three

  • So that's going to be 12... 14... 23

  • and now we fast forward while I do a bit of math in my head

  • 8...17 maybe I did somebody's earlier, 24... On a computer this is much much faster

  • The sum of all the pixels is 113. For example, the sum of this 4x4 block is 68 now

  • The reason this is useful, bear with me here

  • But if we want to work out what, let's say, the sum of this region is what we do is we take this one

  • 113 we subtract this one, minus 64

  • Alright, and this one?

  • minus 71 and that's taken off all of that and all of that and then we have to add this bit in because we've been

  • Taken off twice so plus 40. All right, so that's four reads. Now funnily enough this is a 4 by 4 block

  • So I've achieved nothing

  • But if this was a huge huge image, I've saved a huge amount of time and the answer to this is 18

  • Which is 6 plus 6 plus 5 plus 1

  • So the assumption is that I'm not just going to be looking at these pictures one time to do this, right?

  • There's lots of places a face could be I've got to look at lots of combinations of pixels and different regions

  • So I'm going to be doing huge amounts of pixel addition and subtraction

  • So let's calculate this integral image once and then use that as a base to do really quick

  • Adding and subtracting of regions, right?

  • and so I think for example a 4 rectangle region

  • is going to take something like nine reads or something like that and a little bit addition. It's very simple

  • All right. So now how do we turn this into a working face detector? Let's imagine

  • We have a picture of a face, which is going to be one of my good drawings again

  • Now in this particular algorithm, they look 24 by 24 pixel regions, but they can also scale up and down a little bit

  • So let's imagine there's a face here which has, you know eyes, a nose and a mouth right and some hair

  • Okay, good. Now as I mentioned earlier, there are probably some features that don't make a lot of sense on this

  • So subtracting, for example, if I take my red pen

  • Subtracting this half of image from this half. It's not going to represent most faces

  • It may be when there's a lot of lighting on one side, but it's not very good at distinguishing

  • Images that have faces in and images that don't have faces in

  • So what they do, is they calculate all of the features, right for a 24 by 24 image

  • They calculate all 180,000 possible combinations of 2, 3, and 4 rectangle features and they work out which one

  • For a given data set of faces and not faces, which one best separates the positives from the negatives, right?

  • So let's say you have 10,000 pictures of faces

  • 10,000 pictures of background which one feature best

  • says "this is a face, this is not a face" Right, bearing in mind

  • Nothing is going to get it completely right with just one feature

  • So the first one it looks it turns out is something like this

  • It's a two rectangle region, but works out a difference between the area of the eyes and the air for cheeks

  • So it's saying if on a normal face your cheeks are generally brighter or darker than your eyes

  • So what they do is they say, okay

  • Well, let's start a classifier with just that feature right and see how good it is

  • This is our first feature feature number one, and we have a pretty relaxed threshold

  • so if there's anything plausible in this region

  • we'll let it through right which is going to let through all of the faces and a bunch of other stuff as well that we

  • Don't want right. So this is yes. That's okay, right? That's okay if it's a no then we immediately

  • Fail that region of image right? So we've done one test which is as we know about four additions

  • So we've said for this region of image if this passes will let it through to the next stage

  • Right and we'll say okay it definitely could be a face

  • It's not not-a-face. Does that make sense? Yeah, okay

  • So let's do look at the next feature

  • The next feature is this one

  • So it's a three region feature and it measures the difference between the nose and the bridge and the eyes, right?

  • which may or may not be darker or lighter. All right, so there's a difference there

  • So this is feature number two, so I'm going to draw that in here number two

  • And if that passes we go to the next feature, so this is a sort of binary, they call it "degenerate decision tree"

  • Right, well because the decision tree is a binary tree. This is not really because you immediately stop here

  • you don't go any further. The argument is that

  • Every time we calculate one of these features it takes a little bit of time

  • The quicker we can say "no definitely not a face in there", the better. And the only time we ever need to look at all the features

  • Or all of the good ones is when we think, "okay, that actually could be a face here"

  • So we have less and less general, more and more specific features going forward right up to about the number

  • I think it's about six thousand they end up using. All right, so we we say just the first one pass

  • Yes, just a second one pass

  • Yes, and we keep going until we get a fail and if we get all the way to the end and nothing fails

  • that's a face, right and the beauty of this, is that

  • For the vast majority of the image, there's no computation at all. We just take one look at it, first feature fails

  • "Nah, not a face". They designed a really good way of adding and subtracting different regions of the image

  • And then they trained a classifier like this to find the best features and the best order to apply those features

  • which was a nice compromise between always detecting the faces that are there and false positives and speed right?

  • And at the time, this was running on, I think to give you some idea of what the computational Technology was like in 2002

  • This was presented on a 700 megahertz Pentium 3 and ran at 15 frames a second

  • which was totally unheard of back then. Face detection was the kind of offline, you know, it was okay at that time

  • So this is a really, really cool algorithm and it's so effective

  • that you still see it used in, you know, in your camera phone

  • and in this camera and so on, when you just get a little bounding box around the face and this is still really useful

  • because you might be doing deep learning on something like face recognition, face ID something like this

  • But part of that process is firstly working out where the face is, and why reinvent the wheel when this technique works really really well

  • You can't really get into the data center necessarily and take all the chips out that you've put in there

  • So you probably will make the chips look like they're meant to be there like they're something else or hide them

  • So the way a modern printed circuit board is constructed. It's a printed circuit board that's got several layers of fiberglass

I'd like to talk about face detection

字幕と単語

ワンタップで英和辞典検索 単語をクリックすると、意味が表示されます

A2 初級

顔を検出する(ヴィオラ・ジョーンズのアルゴリズム) - コンピュータマニア (Detecting Faces (Viola Jones Algorithm) - Computerphile)

  • 1 0
    林宜悉 に公開 2021 年 01 月 14 日
動画の中の単語