字幕表 動画を再生する 英語字幕をプリント [VIDEO PLAYBACK] [MUSIC PLAYING] [END PLAYBACK] DAVID J. MALAN: All right. This is CS50, and this is Yale University. Welcome to week seven. So this class marks a transition between the first part of the course, where we have been learning about the magic wonder world of C-- an extremely powerful low-level programming language that has allowed us to solve many problems, indeed-- to a second part of the course, where we move towards more high-level programming languages, such as Python. In fact, this transition had already begun last week when we learned about HTML, CSS, and the like. And we left off, really, by commenting on the fact that this new programming language, Python, can be conveniently used to write something like the back end of google.com or facebook.com or any web service, really, that accepts some parameters, parse them, possibly look up for some code in a database, possibly store some code in a database, and it gets back to the user with dynamic output. So before we get to that end and to see how Python can used to indeed write the back end of a web server, it is instructive to see how Python can be used as a tool, really, to do data analysis, much like a data scientist will do. And this is what we are going to see today, diving into the magic world of machine learning. But first of all, what is machine learning, really? So these are a funny sequence of vignettes that I found online on pythonprogramming.net. And they represent the stereotypical life of programmers/researchers in machine learning. So let's see on the top left here. Well, society thinks that they are creating hordes of robots, possibly with the idea to conquer the world. Right? Friends think that they're hanging out with robots, really. Now, parents-- their parents think that programmers in machine learning spend most of their time in data centers with no windows, apparently. What about other programmers? Well, they might think that programmers do fancy mathematics. And what about themself? Well, they typically think that they're involved with some fancy visualization, data analysis. But at the end of the day, what they really do-- and here we are-- is using Python, and not just using Python to implement some algorithm, but really using Python to import some algorithm as we are seeing here. So we do not know Python yet. But this line of code looks extremely readable, isn't it? There is English there. It says "from sklearn"-- we don't know what this is yet-- "import"-- I already mentioned svm, support vector machine. It's a function to run a machine-learning algorithm. So we don't know Python yet, but we're already able to decipher, more or less, what is going on. And indeed, sklearn, as we will see, is a so-called module in Python-- a library, if you wish, in C-- from which we are importing an algorithm, a function. Now, this line exemplifies a characteristic feature of Python-- namely, it readability. And it is often the case that Python code is referred to as being similar to pseudocode, precisely for the fact that it allows us to express very powerful ideas with a couple of lines that are extremely readable. Now, this very characteristic of Python, together with our familiarity, at this point in the course, with C, is what will allow us today to quickly see how Python can be used as a data processing tool to implement and run machine learning algorithms. Well, so back to the questionn-- what is machine learning, really? So as the previous sequence of vignettes were suggesting, in the popular culture, at least, machine learning is often associated with the world of AI, artificial intelligence. Typically, we think of machine in terms of robots. And literally, there are countless science fiction movies about this theme, typically representing some robots turning evil against humanity or in this line. Indeed, this is a modern interpretation of an old fear dating back, possibly, to Frankenstein in the 1800s and beyond. But really, if we think of the way machine learning is having an impact on our lives on a day-to-day basis, it's not necessarily related to robots, per se. It has more something to do with the following set of applications that we indeed use on a daily basis. So lets just think of search engines, for instance, that allow us to look for whatever we might feel like in the world wide web, and in a matter of literally milliseconds to get back an order, at least, of results based on the query that we enter. Think about image recognition, the possibility to catalog, to search for an image based on its subject. Think about speech recognition software. These days, we can all talk to our phone and ask, what's the weather like or show me the movies around me. Or finally, just for these four sets of application I mentioned, the world of natural language processing, the amazing ability to translate a document of text from one language to another in real time or the ability to, say, infer the meaning, the semantics of a document with no human intervention. So these are just a few of the applications that are indeed backed up by machine learning algorithms. And here by the word machine, we really mean an algorithm running, most likely, on the cloud in a data center. And today, we use these applications on a daily basis. And so have you ever wondered what they're all about? How can we get started to thinking to design one of these applications? And this is what we're going to learn today. So in particular, we will be focusing on two applications-- image recognition and natural language processing. But before that, let's go back to week zero. So we have seen this diagram early on in the course. It represents a general-purpose algorithm. The black box is an algorithm that takes some input from the left-hand side. It processes them, and it delivers the user with an output. Now the class of application that we're going to see today very much fits into this framework. Just think about image recognition, for instance. Well, we might want to have an algorithm that takes as inputs images-- say here an image of a horse, image of a car-- and is capable or realizing or recognizing that there is, indeed, a horse or a car in that image and get back to us with strings, such as "this is a horse." Or, well, in the world of natural language processing, think about an algorithm where we could do something of the following-- say I want to pass as an input to this algorithm one of my favorite novels, 1984 by George Orwell. So this is just a part of it-- "BIG BROTHER is watching you." But say that we want really the entire book to be fed up as an input to this algorithm. We want the algorithm to be able to recognize the semantics of the book. So we want the algorithm, with no human intervention, to be able to get back to us and tell us, look, Patrick, this book is about politics. It's about propaganda. It's about privacy. So in both of these two applications-- image recognition and natural language processing-- it is already clear that what the machine learning algorithm is doing is trying to infer some hidden structure in the input, right? And if we phrase the problem like that, well, we have seen something like that earlier on in the course-- in fact, many examples of a scenario where we are given an input with some hidden structure, and we want to design-- you have done that in problem set four-- an algorithm that can decipher, that can crack the input and gives us an output. So this is Whodunit. And per the specific, we were given an image with some red noise. And you were told to design an algorithm to get back with the hidden message. So in this respect, you might think applications such as image recognition seem to share the similar behavior, isn't it? But there is a key difference. Anyone can spot the difference? We're getting started. So one difference in the picture is that instead of just one image, there are two images. And indeed, that is somehow to the point in the sense that with problem set four, we were given an image with a hidden message, indeed. But we were told the structure of the hidden message. In other words, we were told that the image had a lot of red noise, and we were told that in order to decipher, to really find the hidden message, we should lower the intensity of the red pixel and possibly rise the intensity of the other pixels. But in machine learning applications, this is not typically what we have in mind. So we want an algorithm that can work with any sort of image we may want to feed it with. And this is what is bringing us to one of the key differences in the class of application that we are going to see, namely the fact that it is the algorithm itself that is trying to find out the hidden structure in the input. And the way you can do this is by having access to what is called training data. In other words, even if we want to design an algorithm that is capable to tell us, Patrick, look, this is a horse, in order to work, this algorithm needs to have access to a lot of images of horses. And based on these images, it is able to figure out the hidden structure so that once you feed, again, an image of a horse, it can say, this is a horse. And this is what we are going to see today, starting precisely from the example of image classification. So what is the example we're going to play with? It's the following-- suppose that we are given a data set of handwritten digits. So there are 10 digits from zero to nine. And for each of these digits, we are given a collection of handwritten images representing that digit. So for instance, for digit zero, we are given many of these images. And each of is having a different way of writing a digit zero. So what we set as our goal for today is to actually design, in this case, an algorithm that can take an image of digit zero or an image of digit six or any other digit and, indeed, get us back with the string "0" or "6" or so on. And the way you can do that, as we will see, is by having access to the so-called training data that is inferring the [INAUDIBLE] structure behind the images we want to use. But before we get to talk about images, let's us just abstract a little bit. And let us think that we live in a one-dimensional world, a line land, if you wish, where the only thing that we can see is a straight line. Say we go to school, and the first thing that we are told in day one of school is that, OK, I see a point here. The teacher is telling us, well, look, Patrick. This is number zero. Then we go again, day two of school. We see another point. The teacher is telling us, this represents a zero. Day three, we see another point. It's the only thing that we can see in a one-dimensional world. And the teacher is telling us, well, this is a six. And so on. So day four, we see another point. This represents number zero. And so on. So this is representing number six. This will represent the number six. So in other words, we have been exposed to a labeled training set of data-- points and the associated label corresponding to the digits they represent. So say now that we are presented, the next day of school, with a so-called test point over here. And we are asked, what should this number be? What should this point represent? Which digit? Anyone? Number six, indeed. And congratulations because this is the first example. This is the first machine learning algorithm that we are going to discuss today, easy as it sounds. So there is a lot going on. First of all, just appreciate the fact that I haven't told you anything about the structure. I just presented to you a set of points. We don't tell you anything about the game. It's not like in problem set four, where we were told about the specifics of the image to crack. In this case, I was just presenting you with points. And then at a certain moment, I was asking you, OK, what is this point? And as you have guessed, well, this should represent the number six. Why? Well, because somehow, we have figured out that the points representing number zero are grouping in one side of the line. Points representing number six are grouping in another side of the line. And so in our head, what we are doing, we're looking at the closest point among the ones that we have been exposed to previously, so the point with the minimal distance, so the so-called nearest neighbor, and just label the new point that we're given-- the test point-- with the label of the closest point we've been exposed to. So this algorithm is called nearest neighbor classifier. And indeed, this is what we're going to use to classify handwritten digits. But before we get to that, let's assume that we get promoted. And so on the second year of school, we now see the world in two dimensions, much like a flat land. And we are exposed to a set of points-- again, a labeled set of points. So this point represents number zero. This is the number six, number six, number zero, number six, number zero, six again. And finally, we're presented with a test point. So any guess on what this test point should represent? It's the number zero. And indeed, same reasoning, just done in two dimensions-- nearest neighbor classified. So in a way, this is very much intuitive. And what we are left off with is the question of, OK, can we map an image into a point in a space? Because if we were able to do that, if we can take an image of a digit and just interpret that image as a point in a space, then we could repeat exactly the same procedure I told you. In this case, there are no points. There are images in this abstract space. But we have access to a labeled training set. So we know what these digits represent. And then when we are asked, OK, what is this new digit? Well, we apply the same logic, right? And this is what we are going to see. Indeed, the world can be seen as having more than one or two or three dimensions. And much of machine learning is really about interpreting the data we are given as points in some sort of high-dimensional world. And in making this jump, we might feel a little bit like the character in one of my favorite novels, namely, Flatland by Edwin Abbott Abbott. This is the [INAUDIBLE] of the first edition in 1884. The plot of Flatland is the following-- the story describes a two-dimensional world occupied by geometric figures. The narrator is a square named A Square who guides the reader through some of the implications of life in two dimensions. On New Year's Eve, A Square dreams about a visit to a one-dimensional world, Lineland, inhabited by lustrous points, in which he attempts to convince the realm's monarch of a second dimension, but he's unable to do so. Following his vision, A Square is himself is visited by a three-dimensional sphere named A Sphere, which he cannot comprehend until he sees Spaceland, a three-dimensional world. So many movies have been produced on this story line. Let's see a trailer of one of those. [VIDEO PLAYBACK] -Imagine a vast plane, a world of only two dimensions on which triangles, squares, pentagons, and other figures live and move freely about. -Configuration makes the man. -Get to your squarical! Now! -You're only a square. -Thanks, brother. -They know nothing of our three-dimensional world. -Such a notion is, of course, absurd and, furthermore, illegal! -But that's all about to change. -Where did you come from? -I come from space, the third dimension. -No, no, no, no, no, no, no, no! No! You're not serious! -Based on the beloved novel by Edwin A. Abbott. Tonight, our world faces a grave threat. [END PLAYBACK] DAVID J. MALAN: Right. And then it goes on. I love it. It's one of my favorite. You should watch the movie. And so indeed, are we ready to go beyond the Lineland, Flatland, Spaceland. So let's do it. So here I just represented what Linland looks like. Just one line, right? The only thing that we can see in one dimension are really points. And here I wrote two points for us. What we have here is the coordinate system, just to be able to measure distances, so to speak. So in this case, this point is at location one. This is what this one represents. And this point is at location four. So this is the picture in one dimension, Lineland. Flatland we have seen-- two-dimensional world. Indeed, we can visualize points here. In this case, each point is represented by two coordinates. We have the horizontal coordinate and the vertical coordinate. So coordinate 1, horizontal, and 2, vertical. And so on for the 4, 4. Now we go to Spaceland-- indeed, a three-dimensional world. And even here, it's pretty easy to visualize what points are, really. There are three coordinates. And so we can simply refer to a point with the associated coordinates. So can we go beyond? Indeed. Indeed we can. It's not that easy to draw points or the like in higher dimensions. But we can indeed think of a point, say, in four dimensions by being referred to by a set a coordinates-- say, 0, 1, 4, 5. Isn't it? So OK, this reference, the axes there don't mean much. We're no longer in three dimensions. Still, they just represent sort of an abstract space. But indeed, I cannot four axes here, but can definitely think of a point indexed by four coordinates as a point living in a four-dimensional world. And so on. If we want a point in a five-dimensional world, well, here we are. And so we can go back to the idea, can we map an image from this data set we have access to to a point in a higher-dimensional space? And as we have seen in problem set four, images are just a collection of pixels-- in this case, for the smile, just a collection of the zeroes and ones, zeroes being associated to the color white, ones to the color black. And even in the data set that we are playing with, each image in this data set is simply an eight by eight array of pixels, whereby each pixel in this case is a number between 0 and 16. So in some sense, we have eight by eight-- so it's 64. So it's really a point in a 64-dimensional space, isn't it? So we can really use this idea and interpret images as points in this 64-dimensional space. And now we can indeed run the nearest neighbor classifier we have seen previously. Namely, we are in this 64-dimensional space. We see labeled images that are presented to us. We are exposed to it, if you wish. This represents a six, a six. This represents a zero. And so on, until we are presented with a test point again. Fine. We know how to deal with it. We just simply assign to this first point the label of the training point that is closest to it. So the only thing that is missing here is a notion of distance, isn't it? So let's see how we can go about thinking about this. So in a one-dimensional world, it's pretty easy to compute distances. Indeed, the distance between these two points is simply what? This line, right? So it's simply 4 minus 1, 3. In a two-dimensional world, a little bit more complicated, but we are still able to do so. If this is the distance we want to have access to, well, we have the old Pythagoras's theorem. So we first compute this distance-- so the difference between the horizontal coordinates of the two points, which is 4 minus 1, is a 3. Then we square it. And we add the vertical distance between the two points, which is 4 minus 2-- 2. And then we square it. And then we take the square root. Isn't it? So in this case, well, it will be the square root of 13. So even in a three-dimensional world, it's the same idea. In order to get this distance between the two points, we can simply work coordinate-wise. So we can take the distance between the first coordinate, which is a 3, square it, plus the distance between the second coordinate, which is a 2, square it. Plus the distance between the first coordinate, which is a 3, and square it again. And we can take the square root of this. So we indeed have a formula to compute distances. And this formula doesn't simply hold in one-, two-, or three-dimensional space. It holds in many dimensions we may want to work. And so we have all the ingredients to run the nearest neighbor classifier at this point. So just to give you an idea what these distances look like, say we want the distance between these two images, thought of as points in this 64-dimensional space. Well, in this case, if we apply this formula coordinate-wise, we get a distance of 31.98. Say now we want to consider the distance between an image representing digit zero and an image representing digit six. Well, we do the math, and we find out 45.97, which is, indeed, bigger than what we had previously. Previously, we had 31 as the distance between two images representing the same digit. So it should be smaller than a distance between two images representing different digits. So at this point, we are ready to see some Python code. So we are going to actually implement the nearest neighbor classifier just described to you in abstract terms by using these data sets. Again, for each digit, we have access to a collection of different images representing the digits. So let's get to see some Python. First of all, let me show you what by Python is. We can go to the CS50 IDE and simply fire up the Python interpreter by writing, python. And here we are. We're inside the interpreter, so to speak. At this point, we can run Python code. So let's see. We can write, x = 3. y = 5. x + y equals-- guess what? 8. Amazing, isn't it? Coming from the world of C, this is totally amazing. Many things happening here. First, there is no need of declaring variables whatsoever. I didn't write int x = 3. I simply wrote x = 3. And in fact, we might write something like x = '8'. y = 'b'. And guess what-- x + y is equal to the string 'ab'. So in Python, there is no difference between single quotes and double quotes, as in C. And indeed, we do not need to define what a variable is. Another key difference with respect to C is the fact that it's immediate. There is no clang or make or the like. There is no need for us to call the compiler. I was simply running the code, and the interpreter was in fact interpreting or running the code line by line. So indeed, there is a compiler behind the scene. But we do not need to get involved with this. This is one of the beauty of Python. And in fact, coming from the world of C, we can read off Python code fairly easy at this point. Now, the syntax is a little bit different. So let's see what a for loop will be. for i in 3, 5, 7, print i. And ta-da, this is a for loop. So a few syntax differences, right? First of all, it is more like a for each loop, where we loop through each element in this array, if you wish. And then there is a column that we don't even see. More interestingly, there are no brackets, no curly brackets. So how is Python knowing that the block of code we want to execute inside the for loop is the line "print i"? Well, it goes by indentation. So if I were to repeat the line of code that I just presented to you but without them indentation here, you will see an error. So in Python, there is no need of curly brackets. It's simply a matter of the indentation you use. And it is good style to use indentation even in C code, so why the curly brackets after all? So OK, I could carry on like this and show you some more Python code within the CS50 IDE. But for the sake of exposition, let me actually close this and go to the following way of presenting code to you. So these are called markdown type notebooks. So the idea is that while the code is indeed run line by line as Python Is doing, this way of presenting the material is allowing me to group lines of code together. So here it is, what I was writing earlier in the CS50 IDE, the same line of code, same line of code again when it comes to manipulating strings, the for loop that I presented to you, and so on. So as we see again, we don't know the syntax yet, but we can read off what is happening coming from C. This is an if statement, for instance. Again, there are syntax differences. Indeed, there are no brackets here. There is the semicolon and there is the indentation. But we can really decipher what is going on. So this one the beauty of Python, and we are going to rely on this beauty now and today to actually parse almost line by line some blocks of code that will allow us to quickly jump into the action and see some meaningful, cool output from machine learning algorithms. So let's do that. We go to the world of supervised learning, namely image recognition. But before we get to that, what is supervised learning? So the class of application in machine learning typically fit into either supervised learning or unsupervised learning. Today we are going to see both. So the example of image recognition fits into the category of supervised learning, as we have access to a labeled, as I mentioned and stressed earlier, data set. So we are presenting a set of code or a set of points or images with a label associated to it. As we had this label, we're doing supervised learning. So let's start with points, and let's see how Python can be used in Flatland, if you wish. So Python has many built data types and functions. Indeed, we are going to see much more of this next week. But when it comes to data processing, data science type of application, really typically, people rely on external data structures and functions. And so this is what we are going to do today. And this is the line of code that I'm writing here. I'm importing two modules, they're called-- library, if you wish, in the world of C. The first module is numpy. It's one of the main modules for scientific computing. The second module is matplot. It will allow us to easily plot graphs. And the third line of code, don't pay attention to it. It is simply required there in this notebook style to tell the notebook just print whatever graph we are producing in line with the output. So one of the cool things about Python is that it is an extremely popular language. And being popular in computer science, it is really important, one reason being that there is a lot of code out there that we can simply go and import. If you wish, these numpy and matplot are on these lines. Someone else has brought libraries, modules of code that we can easily import. So indeed, we need to install this module before we can import them. But for the sake of today, just ignore the fact that we need to install them. So let's just assume that we can easily import this module with a line of code that I am presenting to you. OK, let us create a training data set in Flatland. So here is the code. The first line creates a numpy array, meaning an array within this numpy scientific module. So it's very simple. We're creating an array of six points. Each point is a two-dimensional point, so it is indexed by two coordinates. And so in the second line, instead, we are creating, in this case, a built-in Python list, as we will see. But it's simply an array of strings, if you wish. And each element in the array Y_train is simply a color, name of a color. We have three red strings and three blue ones. So shortly, we will plot this collection of points. But before we get to do that, let me just show you some cool features of the Python syntax. So X_train is an array of six points, each of them being a two-dimensional vector, if you wish. But you can also interpret these array as being a two-dimensional array. And so we can use this syntax that here I present to you. So the print function is simply printing whatever is inside it. And so with this line of code, X_train 5, 0, what we are doing, we are taking the fifth element in the array of points. So Python is a zero-index language, much like C. So we start counting from zero. And so in this case, the fifth point is the last one in this array-- so, namely, 7, 6. And so we take the fifth point, the last one in that array, and we can print the zeroth index coordinate. So if we you that, the interpreter is outputting 7, which is indeed the horizontal coordinate, if you wish, of the last point in the collection. Now, we can do the same while changing from the horizontal coordinate to the vertical one. And so we get 6, which is here. So one other key feature of Python-- we are going to get to see a little bit of it today-- is the so-called slicing syntax. So slicing syntax is a convenient way to extract a collection of elements from an array, as in this case. So in this case, what we are doing, we are taking-- the first line of code is saying, OK, Python. Just look at all the points in the array. Take the horizontal coordinate, the zeroth coordinate, and just print all the elements in the horizontal coordinate. Indeed, if you were to compare, we have a 1, a 2, 3, 5.5. These are all the first, the horizontal, components of each point. And so with the second component. So this is one of the uses of this slicing syntax. And what we can use this syntax for is to conveniently plot this point. So now we are using-- OK, plt.figure is simply saying, Python, just expect that I'm going to print a figure. And the last line, as well, the .show Is simply saying, just output the plot. So the magic, really, everything is happening in the line in between, the scatter function that is contained, again, in the matplot module we are importing. So this scatter function takes two arrays, an array of horizontal coordinates, if you wish, that we have access to through the slicing syntax, and an array of vertical coordinates. And then we have s equals 170-- simply the size, if you wish, of this point. And the color yellow. This is one of the beauties of Python, again. We are using the slicing syntax again from the label arrays. Recall that Y_train was an array of colors. So every time we are plotting a point, we are also taking the corresponding color from the array Y_train. So if we do that, just appreciate that. With essentially one line of code, we have a plot here. And there are indeed six points. Three of them are red. Three of them are blue. So this represents our so-called labeled training set. Instead of having digits zero and six, now we have color red and blue. The idea is the same. So what we can do, we can now add a so-called test point, say at location 3 and 4. So why don't we print it? We use the same line of code as before to print. We add this new line with the new test point. And we have it plot with the color green. And this is the output. Flatland. So now what we want to do, we want to run this nearest neighbor classifier. And we know why, right? We simply look at the point that is closer to the green point here. And we associate to the green point either a color green or blue, depending on whatever color of the nearest neighbor has-- in this case, green. So in order to do that, we need to define a notion of distance in that. Well, we know what the distance should look like. We have the mathematical formula. Let's write it down in Python. And so you get to see how we can indeed define functions in Python. So here, define-- and again, this is resembling pseudocode code, right? def stands for defining. Just appreciate this. So define a function that we call dist that takes two points and returns the following. So let me just pass for you precisely what we are doing here. So the line of code that we want to understand is the following, where we take two points. These could be, in this case, two-dimensional points, each of those. But later on, we are going to 64-dimensional points. And return with the Euclidean of [INAUDIBLE] distance. So let's parse this. Let's assume that we have two points. y is the test point 3, 4, and x is one of the training points. And we want to see what this line above is doing with respect to these inputs. So first of all, we can take the difference. And the way Python is thinking about this is taking the difference coordinate-wise. This is, again, one of the magic properties of the Python language-- namely, we can act, we can apply a function to the entire vector. It's called vectorization. So whenever there are vectors-- in this case, two-dimensional points or the like or arrays-- we can apply a function to each element of the array. So this is one case, if we take the difference, Python will automatically take the difference of each coordinate at once. So the difference between 1 and 3 is indeed minus 2. And the difference between 1 and 4 is indeed minus 3. So now we can take the power, the second-- we can square it. And again, what Python is doing with this code is simply thinking coordinate-wise. So it's taking minus 2, and taking the square of it. That's 4. Taking 3, and taking the square of it as 9. And so on. Now we can use a function that comes with the numpy module simply summing the two coordinates. So 4 plus 9 is indeed 13. And then we can take the square root of it. So this is what this single line of code is doing. Just appreciate the beauty behind that. And indeed, we can define a function that simply returns the distance between any two points. And let us just compute for each point in the training set, we compute the distance with respect to the single point, the test point. So what we are doing here, we are computing. There are six points in the train set, three red and three blue. So we are taking the distance from the green point to any other of these six points in the training set. And so this is what this block of code is doing. The length function is simply returning the amount of points in the X_train array. Then we initialize an array of zeros with six zeros. And then this is simply a for loop that is computing the distance using the dist function we defined between all these six pairs of points. And this is the output that we print with the function print. So this is an array where each element represent the distance with respect to the test point. Then what we can do to just complete our classifier is just choose the point that has the closest distance. In this case, it will be the second one, as we can see. The distance is 1.8 here. And we can simply print the color associated with that point. And indeed, if we go back to the picture here, we see that point number two here is the closest to the green dot. And so here it is. So just appreciate the beauty. With literally three, four lines of code, we can run a machine learning algorithm. In fact, with using some more Pythonic, as it is called, syntax, we can even get down to much less lines of code. So OK, this is the case for points. But let's go back to the image classification example. Let us see how we can exactly take the precise lines of code that I showed to you and apply it to images in the digit data set. So this is what we are doing. We are importing from the module sklearn-- it's a common module in machine learning for Python. We are importing a data set that is the digit data set. We call it digits. So the digit data set contains 1,797 images in that. And there are multiple structures within this database. In particular, there are two arrays-- an array of images, which is called digits.images, an array of labels, which is called digits.target. So let us see. Let us print the first element in the array digits.images. So this is what it contains. It is an eight by eight collection of pixels that indeed represents the number zero, as we can see by actually plotting with that line of code that I showed to you, quickly mapping each pixel from 0 to 16 included to an intensity of black. So indeed, we can realize that, well, this is a zero. There are a few zeroes in the middle. And indeed, if we plot it with this line of code, we indeed get number zero. So this is just the first element indexed by zero in the data set. And we can indeed also plot the true label that counts with the digits.target. And it is a zero. So this is the data set. And what we want to do is to run the nearest neighbor classifier to this data. So in particular, what we are doing-- so let's see. This data set, again, something like this. What we are doing, we are saying, OK. Let us consider just a subset of the database. So let us take 10 images. And let us consider these 10 images as being the training data that we have access to. Potentially, we could have access to the entire database. But somehow, we want to split. And this is a common feature in machine learning, to split the original data set into multiple subsets in order to test the performance of the algorithm you are implementing. So this is what we are doing. By selecting 10 images, we essentially have this picture, where each point in this 64-dimensional space represents an image of a digit. And the yellow label here represents the true label we have access to in this training set. So now what we can do, we can say, OK. Let us take a test point. So say that-- take a test point here. And Let us assume that we only have access to the image of this test point, which is indeed a three. And we want to test the performance of our machine learning algorithm to classify this point. So indeed, this is the line of code that allows us to only take 10 images out of the original data set. I'm using, again, the slicing syntax, but in this case, a little bit differently in the sense that we are selecting elements from zero included to 10 excluded. This is per the specification of the slicing syntax. The right, outmost element is excluded. So indeed, if we use this syntax, we can extract 10 images from it, precisely like in the picture there. And then we can create a test image. We can choose a random number here in the remaining part of the dataset, say the image corresponding to 345. Indeed, we can plot it with the same line of code I presented to you. It is indeed a three. But this is easy to the human eyes. So we want to see how good of a performance we can get by applying the nearest neighbor classifier. And now the lines of code that I'm going to present to you now are precisely the same lines of code that I presented to you earlier in Flatland. So let's see. This is all together. And indeed, we get that the classifier is returning number three. So the classifier, what it's doing, again, is computing all the distances between these test points and all the other points in the training sets. And it's choosing the point that has the closest distance, the nearest neighbor. And it is assigned the same label of this point-- and in this case, indeed, correct. So it should come as a surprise that, indeed, such a simple algorithm-- two, three lines of code in Python to implement it-- allows us to get such good a result. But how good of a result is this, really? Well, we can test it. And indeed, we can plot the true solution. We do have access to the true label, which is indeed a three. So let us test how well we are doing with 100 test images. So what we are doing, instead of just testing the performance of our algorithm with a single image, let us consider a set of 100 images. And let us count how many mistakes the algorithm we just implemented gets. So if we run this code-- I won't it parse it for you. It's simply, again, taking into account that starting from a number of errors equals 0. And then there is a count here. It's adding plus 1 every time that the algorithm is outputting something that is different from the truth. So if we run this algorithm over a set of 100 test images, we get that we commit 37 errors. So we get 63 correct answers out of 100, which is pretty good, really, for such a simple algorithm, isn't it? But indeed, much like the way humans learn, human learning, also machine learning algorithm gets better with the amount of training sets they have access to. So in this case, we have just chosen a subset of the original data base of 10 images. So what we might try to do is to take a training set which is much bigger and see how well the algorithm is doing with that. So we can do that. We indeed, enlarge the training set. Before, it was from 0 to 10 excluded. Now it is from 0 to 1,000 excluded. So it has 1,000 images. We can run exactly the same code as before over 100 test images. And this time, look-- only three mistakes. It's rather surprising, isn't it? I mean, such a simple algorithm. I described it to you starting from Lineland. And basically, the idea was there. Now it was a matter of coding it up. There was a notion of a point in a higher-dimensional space. There was a notion of a distance. But once we figured that out and we code it up in Python, Python doesn't care about the dimension, as we saw. The same distance function that works with two-dimensional points equally works with 64-dimensional points and higher. And so this is what we achieve-- 97% of correctness with, really, five lines of code. So the question is, what if we try the very same algorithm I just presented to you in a data base that looks like more of what we will like to try it with? So this is a popular database. It's called CEFAR-10. It is, again, same idea-- it is a labeled data base that contains, in this case, really, tens of thousands or more of images with a label. So we indeed have 10 labels, as before. But now, instead of the labels being numbers from 0 to 9, the labels are something like airplanes, automobiles, birds, dogs, and so on. So this is just one of the data sets that you can find. And indeed, there are websites such as kaggle.com that host sort of competitions where machine learning researchers and programmers try out their algorithm, and there are challenges going on. So this data set was popular a couple of years ago for one of these challenges. A typical challenge could last anything in between two, three months to even longer-- a year. So it turns out that if we run the nearest neighbor classifier to this new set of images, the performance is 30%. So it is still much better than random guessing. After all, there are 10 categories. So you might suspect that just by random guessing, you get 10% correct. Indeed, you get 30%. But it's not what we would like, is it? And in fact, there are more advanced algorithms that do something a little bit different. But let us see first what could be an issue with the algorithm that we just ran? So this is a training cert for the category zero in the previous data set. And this is a few elements, a few images, from the category horse in the new data set. So one difference that pops to the eye immediately is that these are color pictures. And indeed, they are. So, in fact, instead of being eight by eight pixels, they are 32 by 32. So still rather small pictures, but now each pixel is indeed a triple-- RGB, as we saw-- where each coordinate contains a number from 0 to 255. So in some sense, we are in a higher dimensional space. It's not just 64. But there is another key difference, isn't it? Anyone? Please. AUDIENCE: The image can be rotated. So you can have a horse that's facing one way or [INAUDIBLE]. DAVID J. MALAN: Great. This is definitely one of the issues here, is viewpoint variation. So we indeed have a sequence of pictures representing horses. But the horses are taken from different angles, poses, right? And in fact, there are all sort of issues here. There are viewpoint variations, illumination conditions, scale variation, deformation, occlusions, and the like. And this is what is making image recognition a really tough challenge for machine learning algorithms. Now, what more sophisticated algorithms do is not just interpreting images as collections of pixels, per se. But they work on a higher level somehow. So they group pixels together. Instead of looking at one pixel at the time, what is happening, they sort of try to extrapolate, to abstract some higher-level feature of the code. And this is an example. So the digit zero can indeed be represented as having the following four features, which is four arches and the possible angles. And indeed, we can go higher in the hierarchy. So the top-of-the-art algorithm for this class of application, and not only, is called deep learning. They do work besides like I just described. So instead of working with the pixels themself, as at the bottom of this image, they try to group pixels together. And they try to find out patterns if you group a few pixels together. And the first layer of patterns they can extrapolate are edges, for instance. And then from there, you can go another step up. So you by grouping edges together, you can come up with objects such as eyes or noses or mouth and the like. And then grouping this set of objects again, you can get something like a face. So this is indeed the logic. Deep learning is really a game changer. The idea behind this type of technology is not new, in fact. It relies on so-called neural networks that were invented in the '80s or probably even earlier. But it's only until recently-- literally four years ago, five-- that researchers have been able to use this technology to really achieve amazing results. And now this technology is everywhere, not just for image recognition. Google's search engine uses deep learning. Just name one-- Facebook. The tagging feature for pictures in Facebook is based on deep learning. Speech recognition software-- Siri, Cortana, OK Google-- that one doesn't have a fancy name yet-- they are all based on deep learning. And this is really a game changer. It has brought a 10% or more increase in the performance of [INAUDIBLE] algorithm in a matter of literally one jump. Something like that was unseen in the field. And indeed, deep learning is beyond the scope of this class. In fact, you really need a lot of computational power in order to run deep learning algorithms. Not just that, you need tons of data out there. And this is why in the '80s, they couldn't run-- the theory was there, but first of all, we didn't have access to the amount of data we do have access to today, thanks to the world wide web. And back then, we didn't have the processing capabilities that we have now. So while deep learning algorithms are beyond what we can actually run in the CS50 IDE, we can indeed use some tool sets that are out there that are based on deep learning. So this is one example-- TensorFlow by Google. So what you can do, you can actually download within Python, for instance, a trained algorithm for you. So without us trying to train an algorithm ourselves, as we have been doing with the nearest neighbor classified, we can download a file of 100 megabytes or more for doing image recognition. Another example is the DeepDream generator. So it turns out, as I mentioned, this type of algorithm, they are able to figure out patterns in the input that we feed them with. So what we can do with the DeepDream generator, we can go online, upload whatever picture we might like-- a picture of ourself-- then upload a picture of one of our favorite paints. And the algorithm is capable of recognizing the painting style behind that painting and apply it to the original image we uploaded. And this is why it's called DeepDream, because apparently dreams work by mixing together images. So if you apply this type of technology to that database I showed to you earlier, we get an amazing performance of 95%, which is indeed close to what the human eye can achieve, in this case. So the question is, is 95% enough? Well, it really depends on the application. And just imagine an example-- self-driving car. As you know, it's a hot area. There are a lot of players out there. Tesla is one of the first players who was brought to the market, really, autopilot-like features. So the autopilot feature by Tesla is allowing the car to shift lanes on the highway automatically, to speed up or lower down based on the traffic, and so much more. Now, this is just an assistant, and the company is making it clear, so someone should always be in control of the car. But it is indeed providing a lot of help. And it turns out that the autopilot feature in cars like Tesla do rely on a variety of technologies, such as GPS, ultrasonic sensors, radars, and so on. But they also do rely on forward-facing cameras with image recognition software. And indeed, they use these so-called deep learning technologies these days. What is the problem? Well, let us see a video. [VIDEO PLAYBACK] [MUSIC PLAYING] [END PLAYBACK] DAVID J. MALAN: So indeed, the reason investigation is going on-- and let me actually close this. And as we saw, a driver was on a highway in Florida. He was using the autopilot feature, apparently relying almost exclusively on it, when a tractor trailer drove perpendicular to it. And if you read, from tesla.com, a statement that the company has released after the accident, which happened a few months ago, I read-- "Neither Autopilot nor the driver noticed the white side of the tractor trailer against a brightly lit sky, so the brake was not applied." So it is indeed an issue with image recognition. Apparently, the color of the trailer was whitish, white. And so against a brightly lit sky, the algorithm, although it performs something like 95% of the time correctly, had some challenges. So these are a few of the challenges. And, in fact, the [INAUDIBLE] will be much interested in this respect. Applying this type of technology for self-driving cars will bring a lot of interesting questions in all fields-- not just computer science, of course, but politics with policies, ethics, philosophy, and the like. All right. That was it for image recognition. So let's now just move to the next application-- text clustering. So we are going to go a little bit faster about that. The application we have in mind is the following. Say that I want to design an algorithm that takes as an input the following list of movies-- in fact, not just the movie title, but the IMDB synopsis for the movies. So the movies are Robin Hood, The Matrix, The King's Speech, Aladdin, and so on. And I want to design an algorithm that, just solely based on these inputs, is capable to cluster, as it is called, to group these movies into two categories. So if we stare at the list of movies, it might be evident, if you like, that the clustering that we expect is something like that, where we have a Beautiful Mind, The Matrix, The King's Speech in one group and Robin Hood, Aladdin, Finding Nemo in another group. And indeed, if I were to ask you, just group this list of movies into two groups, most likely, this would have been the answer. But your answer would have been based on something different than what the machine will do, as we see, most likely. In fact, you might say, OK, these are the two categories. Because we know from before that, in a way, Robin Hood, Aladdin, Finding Nemo are really more Disney-like movies, right? They're for kids, if you wish, whereas the other are more action-type movies. But again, this way of categorizing, clustering movies is based on some sort of human learning, whereas the machine has only access to the synopsis of the movies. So let's see what will happen if we indeed try to run an algorithm to do this clustering. So as before, we try to abstract. In this case, the set of applications we are discussing now is called unsupervised learning because contrary to what happened before, where we were presented a list of data, a list of points, a list of images with a label associated to it, this time, we are simply presented with a set of points. And here it is in Lineland. This is what we will see-- just a set of seven points. And so if you are asked, OK, just split this set of points into two groups, well, we have an easy way of doing that. And once again, I haven't told you anything about the structure to be inferred. I was just presenting to you with a set of data points and asking to you, just split this group of points into k equals 2 categories, groups. So as before, this is indeed a well-known machine learning algorithm. It's called K-means. K there represents the number of clusters we want the algorithm to divide the original data into. And as before, from one dimension, we can go to two dimensions. And if I ask you, OK, split this group of points into two groups, easy. That's what K-mean is doing. What about text now? So before, we somehow had an easy way of mapping an image, a collection of pixels, into a point in a higher-dimensional space. How can we go about thinking-- doing something like that with text? It's not that easy, isn't it? Because if we were able to do, if we were able to interpret each of these synopses as a point in a space, then most likely, the picture will look like that. And if I ask you to divide this group of movies into two categories, that will be your answer. So indeed, let's see how we can map some text input into a collection, into a point in a higher-dimensional space. So in order to do so, let me just step back for a moment and describe the following, easier type of example, where as an input, we have four strings. So let's read. First string is, "I love CS50. Staff is awesome, awesome, awesome." String A. String B-- "I have a dog and a cat." String C-- "Best of CS50? Staff. And cakes. OK. CS50 staff." String D-- "My dog keeps chasing my cat. Dogs." OK. Say these are the four strings, and we ask, OK, let's split it into two groups. Most likely, what you will guess is that one cluster, one group should include string A and string C. And the other cluster should include string B and string D, based on the semantics, on the meaning. So how can we do this? And indeed, if this a representation in high-dimensional space, this is what we will get. So the missing step is really this mapping, right? Mapping each of these four strings into a point in a higher-dimensional space. And this is what we can do with the following interpretation. So here, we have the four strings. The first thing that we can do is look at the vocabulary used in these four strings-- namely, extract the words that is used in each of the strings. So if we do not consider so-called stop words-- namely, words such as "I," "is," "a," "and," and the like, which do not provide much meaning, this is the dictionary that we will extract. There is the word "awesome," "best," "cakes," "cats," and so on. And now if we look at each of the strings, we can indeed map each string into a numerical point by using the so-called bags of words interpretation, by which each string is simply represented by the word count. So let's see, for instance, the first string. The word "awesome" is used three times. And that's why we have a three there. The word "CS50" in lowercase is used once. The word "love" is used once, again. And "staff" is used once, again. Again, we do not consider the stop words such as "I" and "is." So on-- so the second string can also be represented by a numerical vector as this. And indeed, there are 12 words in this dictionary. So we can indeed think of each of these strings as being a point in a 12-dimensional space, isn't it? So not so simple. This is a first great step. But we should also normalize by the length of the string, if you wish, just because if a string is very long, then it's more likely that simply the rough counts will be higher. What we can do easily is just divide each numerical vector by the total amount of words in that string. So we get a so-called frequency matrix. So here it is. We have a way to map a string into a point in a high-dimensional space, a 12-dimensional space. And what we can do, we can apply this algorithm K-means. So let me just show you quickly how this can be done with Python in the realm of unsupervised learning. So now we will move much faster than before. In this case, we we're importing the same modules as before, numpy and matplot, and creating-- in this case, in the world of Flatland-- we're creating an array of seven points, which I here plot with the same exact lines of code as before. But in this case, instead of us implementing the machine learning algorithm from scratch, we can do what stereotypically, at least, most machine learning programmer or researcher will do, which is importing the K-means algorithms from an external module. So if we imported this algorithm, the details of how the algorithm works is beyond the scope of this class. But it wouldn't be that difficult, in fact. But we can reasonably run this algorithm and say, OK, algorithm, cluster this group of points into two groups. k equals 2. If we do that-- I won't pass line-by-line what is happening-- simply running the algorithm with k equals 2, these would be the output. So indeed, the algorithm is capable of figuring out the two groups of points based on their distance. So we can also run the algorithm with k equals 3. After all, we are deciding the number of groups to be created. So if we run with k equals 3, the same lines of code as before-- I'm changing k from 2 to 3-- we get three groups. And so on. With k equals 7, there are seven points. And here it is. So the crosses that are there present in the plot are simply, if you wish, the center of mass, center of gravity of the group-- simply the middle of the group. So this is what we can do easily with points in the world of Flatland. And we can, in fact, move, very much like we have done now, to the world of documents. And so this is the collection of strings I presented to you. This is the bags of words matrix that we can easily construct using some function from the external module sklearn. Again, I won't spend much detail on parsing this code. I want you to appreciate the fact that really, in a few lines of code, we can get something like this running in Python. So we can have a look at the dictionary, is what I presented to you earlier-- "awesome," "best," "cakes," and the like. We can get to the frequency matrix, as before. And we can indeed run K-means with k equals 2. And if we do that, we indeed have the output that we expect, meaning the algorithm is capable of figuring out that the two clusters should be divided by the words "dog," "cat," "keeps," and the words "awesome," "staff", and "CS50." So this is per the simple example of strings. We can go to the more interesting example of movies with the IMDB synopsis. Run precisely the same line of code. Now, the inputs for the algorithm is the following list of movies with their title and the synopsis from IMDB. And we can easily import this Google spreadsheet into Python using the pandas modules. Again, I won't spend much time to it. I want to get to the punch line. This is the line of code to import these data sets. Here is, indeed, if we printed these frames [INAUDIBLE] Python, it's the same table as in the Google spreadsheet. And from this time on, we can precisely take the same code that we applied earlier in the easier example in this case. And if we do so, just by creating the frequency matrix, running K-means with k equals 2, we get the following output. So indeed, the algorithm is capable of figuring out that one class of movies should be including movies such as The King's Speech, Frozen Aladdin, Cinderella, Robin Hood, and the like. So The King's Speech, we wouldn't really expect that, right? Frozen, Aladdin, Cinderella, Robin Hood-- OK, kids' movies. But The King's Speech? Well, let's hear the other cluster. The other cluster would be Mad Max, The Matrix, No Country For Old Men, and the like. So the way the algorithm is thinking about it when we map it to this higher-dimensional space is by grouping movies together based on the count words, as we see. And so the reason why it's taking The King's Speech in the same group as Frozen, Aladdin, Robin Hood and so on is because of words such as "king," "prince," "duke." So this is the machine learning behind-- this what the machine is doing. Again, it might sound a little bit counterintuitive, coming from a human learning point of view. But the machine has only access to those inputs that I showed to you earlier. So let's just wrap up. This was Python. We have seen, indeed, two important applications in the real world-- so the image recognition application and the text clustering application. These are just two applications out of countlessly many applications that are changing our life on a daily basis. And in fact-- well, at this point, just to mention something more in this. This should ring a bell at this point in the class. It is, indeed, the pyramid in Mario. And indeed, we can run a machine learning algorithm to play video games such as Mario. The way we do that? Well, we can have as a training set, we can have human players play Mario. And we can have an algorithm watching them and learning from them. Or we can have an algorithm watching another algorithm watching playing. And this is what indeed does happen in the case of Go. So Go is a popular chess-board-like game. It's much like chess. But now the number of combinations are astonishingly large. It's more than the number of atoms in the universe. So playing this game has always been considered sort of hard. And it has been believed for a long time to be outside of the reach of modern applications, modern machine learning algorithms. This was the picture up to a few months ago, in March 2016, when an algorithm made by a researcher at Google played against one of the world champions at this game. And here it is, the world champion, world master of Go, Lee Sedol. So before a series of five games against the machine, Lee released a statement claiming that he would have expected to win something like four to one. Indeed, four to one was the final outcome, but it was the other way around. So the machine won four times out of one. And what is really amazing-- this is perceived, again, as a game-changer. Deep learning algorithms are behind this technology. So much attention has been drawn to the game not just because the machine won, in fact, four out of five, but because during the game, really, new plays were made by the algorithm. And the algorithm was trained not just by observing human masters playing the game Go. But it was also trained by looking at itself-- at the algorithm itself-- playing against another algorithm at Go. So in having access to this set of training data, the algorithm simply came up with new, astonishing moves that not even commentators of the game were able to command. And so this is where we left off by watching some of the reaction of these commentators while trying to interpret the moves made by the machine. CS50 will be back next week with more on Python and on web servers. [APPLAUSE] [VIDEO PLAYBACK] -And this is what [INAUDIBLE] from the Google team was talking about, is this kind of evaluation, value-- -That's a very surprising move. -I thought it was a mistake. -Well, I thought it was a quick miss. But-- -If it were online Go, we'd call it a clicko. -Yeah, it's a very strange-- something like this would be a more normal move. -OK, you're going to have to-- so do you have to think about this? -This would be kind of a normal move. And locally, white would answer here. -Sure. -But-- [END PLAYBACK] DAVID J. MALAN: Thanks again. [VIDEO PLAYBACK] -All those freshman were, like, stoked for the first lecture, you know? Like cake, candy, swag, candy. They, like, got their own DJ. -But the first lecture didn't go as planned. -Malan, he had that phone book ready to, like, tear its heart out. -Well, you might first open the phone book roughly to the middle, look down, and-- -And then the dude just let it drop. [MUSIC PLAYING] -Hmm. -Mind if I bum one of those? -Oh, sure. -Dude, there was something in those pages. This was like nothing I'd ever seen before on the YouTubes. -Was there any mention of Rosebud? -Rosebud? Is that, like, a programming language? -[EXHALES] [END PLAYBACK]
B1 中級 CS50 2016 - 第7週 - 機械学習 (CS50 2016 - Week 7 - Machine Learning) 206 30 林宗炫 に公開 2021 年 01 月 14 日 シェア シェア 保存 報告 動画の中の単語