字幕表 動画を再生する 英語字幕をプリント If there’s one deep net that has completely dominated the machine vision space in recent years, it’s certainly the convolutional neural net, or CNN. These nets are so influential that they’ve made Deep Learning one of the hottest topics in AI today. But they can be tricky to understand, so let’s take a closer look and see how they work. CNNs were pioneered by Yann Lecun of New York University, who also serves as the director of Facebook's AI group. It is currently believed that Facebook uses a CNN for its facial recognition software. A convolutional net has been the go to solution for machine vision projects in the last few years. Early in 2015, after a series of breakthroughs by Microsoft, Google, and Baidu, a machine was able to beat a human at an object recognition challenge for the first time in the history of AI. It’s hard to mention a CNN without touching on the ImageNet challenge. ImageNet is a project that was inspired by the growing need for high-quality data in the image processing space. Every year, the top Deep Learning teams in the world compete with each other to create the best possible object recognition software. Going back to 2012 when Geoff Hinton’s team took first place in the challenge, every single winner has used a convolutional net as their model. This isn’t surprising, since the error rate of image detection tasks has dropped significantly with CNNs, as seen in this image. Have you ever struggled while trying to learn about CNNs? If so, please comment and share your experiences. We’ll keep our discussion of CNNs high level, but if you’re inclined to learn about the math, be sure to check out Andrej Karpathy’s amazing CS231n course notes on these nets. There are many component layers to a CNN, and we will explain them one at a time. Let’s start with an analogy that will help describe the first component, which is the “convolutional layer” Imagine that we have a wall, which will represent a digital image. Also imagine that we have a series of flashlights shining at the wall, creating a group of overlapping circles. The purpose of these flashlights is to seek out a certain pattern in the image, like an edge or a color contrast for example. Each flashlight looks for the exact same pattern as all the others, but they all search in a different section of the image, defined by the fixed region created by the circle of light. When combined together, the flashlights form what’s a called a filter. A filter is able to determine if the given pattern occurs in the image, and in what regions. What you see in this example is an 8x6 grid of lights, which is all considered to be one filter. Now let’s take a look from the top. In practice, flashlights from multiple different filters will all be shining at the same spots in parallel, simultaneously detecting a wide array of patterns. In this example, we have four filters all shining at the wall, all looking for a different pattern. So this particular convolutional layer is an 8x6x4, 3-dimensionsal grid of these flashlights. Now let’s connect the dots of our explanation: - Why is it called a convolutional net? The net uses the technical operation of convolution to search for a particular pattern. While the exact definition of convolution is beyond the scope of this video, to keep things simple, just think of it as the process of filtering through the image for a specific pattern. Although one important note is that the weights and biases of this layer affect how this operation is performed: tweaking these numbers impacts the effectiveness of the filtering process. - Each flashlight represents a neuron in the CNN. Typically, neurons in a layer activate or fire. On the other hand, in the convolutional layer, neurons perform this “convolution” operation. We're going to draw a box around one set of flashlights to make things look a bit more organized. - Unlike the nets we've seen thus far where every neuron in a layer is connected to every neuron in the adjacent layers, a CNN has the flashlight structure. Each neuron is only connected to the input neurons it "shines" upon. The neurons in a given filter share the same weight and bias parameters. This means that, anywhere on the filter, a given neuron is connected to the same number of input neurons and has the same weights and biases. This is what allows the filter to look for the same pattern in different sections of the image. By arranging these neurons in the same structure as the flashlight grid, we ensure that the entire image is scanned. The next two layers that follow are RELU and pooling, both of which help to build up the simple patterns discovered by the convolutional layer. Each node in the convolutional layer is connected to a node that fires like in other nets. The activation used is called RELU, or rectified linear unit. CNNs are trained using backpropagation, so the vanishing gradient is once again a potential issue. For reasons that depend on the mathematical definition of RELU, the gradient is held more or less constant at every layer of the net. So the RELU activation allows the net to be properly trained, without harmful slowdowns in the crucial early layers. The pooling layer is used for dimensionality reduction. CNNs tile multiple instances of convolutional layers and RELU layers together in a sequence, in order to build more and more complex patterns. The problem with this is that the number of possible patterns becomes exceedingly large. By introducing pooling layers, we ensure that the net focuses on only the most relevant patterns discovered by convolution and RELU. This helps limit both the memory and processing requirements for running a CNN. Together, these three layers can discover a host of complex patterns, but the net will have no understanding of what these patterns mean. So a fully connected layer is attached to the end of the net in order to equip the net with the ability to classify data samples. Let’s recap the major components of a CNN. A typical deep CNN has three sets of layers – a convolutional layer, RELU, and pooling layers – all of which are repeated several times. These layers are followed by a few fully connected layers in order to support classification. Since CNNs are such deep nets, they most likely need to be trained using server resources with GPUs. Despite the power of CNNs, these nets have one drawback. Since they are a supervised learning method, they require a large set of labelled data for training, which can be challenging to obtain in a real-world application. In the next video, we’ll shift our attention to another important deep learning model – the Recurrent Net.
B1 中級 米 畳み込みネット-第8話(ディープラーニングSIMPLIFIED (Convolutional Nets - Ep. 8 (Deep Learning SIMPLIFIED)) 120 13 alex に公開 2021 年 01 月 14 日 シェア シェア 保存 報告 動画の中の単語