字幕表 動画を再生する 英語字幕をプリント I wanted to talk a little bit more about deep learning and some of a kind of slightly more, Large and interesting architectures that have been coming along in the last couple of years, last few years. So just a very brief recap, right? We've got videos on this I'm going to draw my network from the top down this time. So rather than there being a square input image I'm just going to draw a line which is the image from the top So you can work with your animation magic and sort this all out for me. Brilliant. So I'm going to be talking about deep learning and convolutional neural networks. So a convolutional neural network is one where you have some input like an image. You filter it using a convolution operation. And then you repeat that process a number of times. To learn something interesting about that image. Some interesting features. And then you make a classification decision based on it. That is usually what you do, right? So you might decide, well this have got a cat in it or this one's got a dog in it. Or this one's got a cat and a dog in it and that's very exciting. So from the top down right because I've always My pens gonna run out of ink if I start trying to draw too many boxes. You've got an input image, but it's quite large usually. So here's an input image and I'm gonna draw it like this. This is from the top. So if this is my image, I'm gonna go to the top and look at it straight down. Which I realized sort of like that. Does that work? Now there's three input channels because of course we had usually red green and blue, right? So in some sense, this is multi-dimensional. We're gonna have our little filtering so I'm going to draw a couple of kernels. Let's maybe draw four. We're gonna do a convolution operation using this one on here So it's going to look over all of these three channels it's going to scan along and it's going to calculate some kind of features like an edge or something like this and that's going to Produce another feature right and now there's four kernels of each gonna do this. So we're gonna have four outputs. Don't worry I'm not going to do an 800 layer deep network this way So each of these gets to look at all of the three something that's a bit a bit of a sort of quirk of deep Learning but maybe isn't explained Often enough, but actually these I'll have an extra dimension that lets them. Look at these so the next layer along will look at all four of these ones and so on what we also then do and I'm going to Sort of get why not? Why not use multiple colors? We then sometimes also spatially down sample. So we take the maximum of a region of pixels. So that we can make the whole thing smaller and fit it better on our graphics card. We're gonna downsample this so it's gonna look like this and then okay, I'll just do a yellow one. Why not? Can we see yellow on this? We'll soon find out. Yeah. Yeah So let's say there's two kernels here and you can kind of see it. I think we need to go pink here. Pink? Pink! Alright pink, forget yellow. No yellow on white. That was what I was told when I first started using PowerPoint. I like pink. Yeah, that kinda, that can work. It kinda looks a bit like the red. So that's going to look at all these four so and there's two of them. So there's going to be two outputs, right? Just think of in terms of four inputs two outputs. So that's going to be sort of like this I'm just going to go back to my blue and forget the colors now and you just repeat this process for quite a while Right depending on the network. There are more advanced architectures like resinates, but let this become very very deep you Know hundreds of layers sometimes but for the sake of argument Let's just say it's into the dozens usually so we're gonna down sample a bit more and so on and then we'll get some kind of final feature vector Hopefully a summary of everything that's in all these images sort of summarized for us And that's where we do our classification so we attach a little neural network to this here and that all connects to all of these and then this is our reading of Whether it's a cat or not, that's the idea the problem with this is that these number of connections here are fixed This is the big drawback of this kind of network You're using this to do this very interesting feature calculation and then you've got this fixed number of it's always three here There's always one here So this always has to be the same size which means that this input also has to always be the same size. Let's say 256 pixels by 256 pixels, which is not actually very big So what tends to happen is that? We take our image that we were interested in and we shrink it to 256 by 256 and put that in you know and so when we train our network We make a decision early on as to what kind of appropriate size we should use now, of course, it doesn't really make any sense Currently because we have lots of different kinds of sizes image, obviously They can't be too big because we're run out of RAM But it would be nice if we if it was a little bit flexible The other issue is but this is actually taking our entire image and summarizing it in one value So all spatial information is lost right? you can see that the spatial information is getting lower and lower as we go through this network to the point where all we Care about is if it's a cat not where is the cat? What if we wanted to find out where the cat was or? Segment the cat tutor or somet in a person or count a number of people right to do that This isn't gonna work because it always goes down to one. So that's kind of a yes or no is yeah Yeah, yes or no. You could have multiple outputs If it was yes, dog, no cat, you know different outputs Sometimes instead of a classification you output an actual value like the amount of something But in this case, that's not that's not worry about it now You've told me that this is an amazing market so I'm gonna have a go at this I said anyone ever raised your marker in your videos. I mean, this is a first that Okay, it's work he's just gonna take quite a while because it stuck this rubber is tiny you know what I qualities Marcus All right. There we go All right. So the same input still produces this little feature vector But now instead of a fixed size neural network on the end We're just going to put another convolution of one pixel by one pixel So it's just a tiny little filter but it's just one by one and that's going to scan over here and produce an image of Exactly the same size but this of course we'll be looking for all of these and working out in detail what the object is So it will have much more information than these ones back here So, you know this could be outputting a heat map of where the cats are or where the dogs are or You know the areas of disease in sort of a medical image or something like this And so this is called a fully convolutional network because there are no longer any Fully connected or fixed size layers in this network. So normal deep learning in some sense or at least up until so 2014-2015 Predominantly just put a little new network on the end of this. That was a fixed size now We don't do that And the nice thing is if we double the size of this input image, I mean we're using more RAM But this is going to double little double and in the end This will also double and we'll just get the exact same result which is bigger so we can now put in different size images the way this actually works in practice is that when one your deep learning library like Cafe 2 or pi stalks or tensorflow will allocate memory as required So you put in an input image and it goes well, ok with that input image We're going to need to allocate this much RAM to do all this and so the nice thing is that this can now have information On where the objects are as well as what they are picks output. So We'll show a few examples of semantic segmentation on the screen so you can see the kind of thing We're talking about the obvious downside here, which is what I'm going to leave for Another video is that this is very very small, you know maybe this is only a few pixels by a few pixels or something like this or You haven't done that much down sampling and so it's not a very deep network and you haven't learned a whole lot if you are Looking for where is the carrier's image? You have kind of it's down in the bottom left. It would be very very general So it would be you know bit sort of area. Maybe there's something else going on over here It depends on the resolution of this image looks great with different colors in line. But what are you actually using this stuff? Alright, so, I mean we have to extend this slightly, which I'm you know Normally going to postpone for another video because this is too small for us to be practical, right? What we could do is just up up sample this we could use linear or bilinear interpolation to just make this way bigger like this and have a bigger output image and It would still be very low resolution you'd get the rough idea of where something was but it wouldn't be great Right, so you could use this to find Objects that you're looking for. So for example in our lab, we're using this for things like analysis of plants So where are the wheat is how many are there that can be useful in a field to try and work out What the yield or disease problems are going to be you can do it for medical images where the tumors in this image Segmenting x-ray images we're also doing it on human pose estimation and face Estimation so you know, where is the face in this image? Where are the eyes? What shape is the face this kind of thing so you can use this for a huge amount of things? But we're going to need to extend it a little bit more to get the best out of it And the extension we'll call an encoder decoder Network Are you tying it up now? What are you doing? It's not neat enough this there's little bits of unwrapped out bits Bear with me I start on the next video in a minute. Yeah That's as good as it's getting it
A2 初級 ディープラーニング - コンピュータマニア (Deep Learning - Computerphile) 7 0 林宜悉 に公開 2021 年 01 月 14 日 シェア シェア 保存 報告 動画の中の単語