字幕表 動画を再生する 英語字幕をプリント Hello, everyone. Welcome back to Mining of Massive Datasets. We're going to continue our discussion of Clustering today, by looking at the k-Means Algorithm. Recall that we previously, looked at another means, way of doing clustering called Agglomerative Hierarchical Clustering. Now, that algorithm worked very well in terms of producing clusters, but unfortunately, it doesn't work very well for large da, datasets because of its computational complexity. The k-Means Algorithm, or the k-Means family of Algorithms actually, is a way of addressing that problem, of creating Algorithms that are computational tractable and work with very large datasets. Now the k-means Algorithm assumes, a Euclidean space and the Euclidean distance. And the first step, is to pick k, the number of clusters. Now, for now let's assume that we pick a number k, and say that's the number of clusters that we finally want. Towards the end, I'll ca, I'll show you how to actually, pick this this value k. For now, let's just assume that k is the given. Now, we're going to initialize k clusters, by picking one point per cluster. For example, we could just pick k points at random, one, and assign one point to each cluster. And that would be the one way of picking the points. Later on we'll examine other ways of doing this much better. Now that we have clusters populated with these k-points picked at random, here's how we're going to proceed. We're really go through all the points in our datasets. And for each point, we're going to place it in the cluster. To whose centroid it's closest. So, you're going to find the cluster with the closest centroid to the data point, and then we'll assign that data point to that cluster. Okay? And we're going to do this for each data point. Now, when we assign a whole bunch of new data points to a cluster, the centroid of a cluster might change. Because of all the new data points that have been added with the cluster. So, we're going to go up, update the locations of the centroids of each of the k clusters, by taking into account the new data points that have been added to those clusters. Now, once we do this, we'll find that the centroids of, of the k clusters have moved, and now a point that was close to one cluster may be closer to another cluster. So, we need to go through and reassign all points to their closest centroid. And sometimes this moves points between the clusters. And notice that we, we may have to do poi, you know, steps 2 and 3 again and again. And it, thi, this is exactly, what we're going to do. We're going to repeat steps 2 and 3, until Convergence. And what we mean by Convergence, is that the k centroids don't move any further, and the point don't move en, any further across the across the centroids. At that point, we have a stable k-means clustering. So let's look at an example. So, here's a set of points and we are going to do obviously, say that k is equal to 2, so we're looking to find two clusters in this dataset. Now in, in Round 1, I'm at random going to pick two points since k equal 2 and call those a centroid of two clusters. Let's say, I pick these two points so the, the point that I've highlighted with the in pink here and the point that I've highlighted in blue here as the two centroids. I've picked this totally arbitrarily and at random. Now, what I'm going to do is I'm going to go through, all the data points and for each data point, I'm going to assign it to the centroid that it's closest to. So, observe that this point for example here, is close to the red cluster. So, I'm going to assign it to the red cluster. Whereas, this point here is closer to the blue cluster than to the red cluster, so I'm going to assign it to the blue cluster. And, you know, when I go through all the data points and do this. Here is the clustering that I end up with. These two points here end up in the you know, in, in the pink cluster here and this set of points are closer to the the blue point here, and so they end up in a blue cluster. So, that's round 1 of k-means clustering. Now, what I'm going to do now is, because I've created these two new clusters I'm going to compute the new centroids. And when I compute the new centroids, the the centroid of the pink cluster moves over a little bit to the left to this new position here. And the centroid of the blue cluster moves to that new position. Now, is because I have new centroids, I'm going to go through in round 2, I'm going to go through each data point again and assign it to it's closest centroid. And when I do that,observe that, point for example this point here is now closer to the the pink cluster than it is to the blue cluster. And so it moves from the blue cluster to the pink cluster. Okay? [SOUND] And so we have this new clustering and the end of round 2. Now that, we have the new clustering, I'm going to recompute the centroids once again, and when I recompute the centroids, the the centroids migrate further. And because the centroids are migrated further, I have to recompute the, the assignment of points to the centroids again. And when I do that these two points here are then are closer to the the pink cluster, so they move to the to the pink cluster and and we end up with the clustering show here at the end of Round 3. Okay? Now, as it turns out, in this case, Round 3 is the last round. a, at this point, the centroids and the, points don't migrate any further, and we have a stable clustering. So, what we have here is the, clustering, the k-means clustering for k is equal to 2, for, for this dataset. Observe that I started from a completely random assignment of you know, of, of centroids and actually migrated the centroids to where they finally ended up. Now, the important question here is, how to we pick the right value of k? In the previous example, we arbitrarily picked k is equal to 2 and that actually turned out to be the right value for that datasets because you can see that there are roughly, two different clusters of data points here. But in general, how do we know what the right value of k is up front? So that we can pick the number of clusters up front. So, since we don't know the right value of k, the obvious answer is to try different values of k. And see what looks good. Now the question, now we have to decide. What do we mean, by what looks good? One obvious answer is to look at the average distance of points from the centroid as k increases. Right? And let's see let's look at an example to see what we mean here. So here's here's an example with a bunch of data points, and they've actually picked k is equal to 2. Because I pi, pick k equal 2, we have two clusters. And you, you can see that, that's a centroid off of the first cluster. And you can see, there's a lot of points that are very far away from the centroid in this case. Right? Now, if, if k is too small, as in this case, then there are going to be lots of points that are a large distance away from the centroid. So, the average distance from the distance is going to be fairly large. Now, I made k equal 3. and, as it turns out, this is the right value for this dataset, and you can see that the, distances to the, centroid shrink quite a lot, when I went from, k equal 2 to k equal 3. That splits the, top cluster into two, and that shrinks the average value, or the, or the average distance of the centroid significantly. Right? And the and but if I increase k, further. Let's say, I make k equal 4 and that splits that cluster, further. Now, that does shrink the average sums of the centroid, but not as much as when we went from k equal 2, to k equal 3. Right? So, when we make k, too large,. It does decrease the average distance of the centroid. But not, by not as much. Now, we can see this very clearly if I plot a graph. Right, if I plot a graph, where the x axis is k, the number of clusters. And the y axis is the average distance to the centroid. You can see that as k increases. The average distance to the centroid keeps falling. But, at some point, it's, it, you know, the, there's a knee of the curve and the average distance to the centroid falls very slowly. The obvious thing to do is to pick a value of k, that's close to the knee of the curve, where you get a more, you know, a fairly low distance to the average distance to the centroid. Without too high, a value of k. So, in this case, the best value of k, is the one that's, that's shown in this, shown in the picture. The final question we need to address, with k-means clustering is the picking of the k initial point to initialize the k clusters. In the examples, that we've seen so far, we've picked the. The initial k points entire completely at random. Now that model not a very bad example, but in general it might not work for quite so bad. If you make for example pick k point that all happen to be in the same cluster. In which case the final clustering, won't reflect the actual clustering of the data. Right, or we might pick points that are outliers that are not near, near any of the near clusters. So, the the final clustering depends on the initial k point that they picked. And so it's important that they pick the right k points to start the clustering from. The first approach, to picking the initial k point is Sampling. So remember that the data set is very large and so we can't actually run a complicated algorithm like hierarchical clustering on it. But what we can do is to sample the data and take a smaller sample of the data. And then using the sample of the data, we can then another algorithm like Hierarchical Agglomerative Clustering, in which we covered in the last lecture. And we can run that algorithm until we obtain k clusters. And then we can pick a point from each of the k clusters. For example, we could pick the point that's closest to the centroid for each cluster. And we could call those, our initial k k centroids. Another approach is to not resort to another clustering algorithm, but to just pick a dispersed set of points. Points that are as far away from each other. In the dataset, as possible. So, one approach to this, is to first pick a, the first point entirely at random. Now, the second point, we pick to be the point that's as far away from the first point as possible, in the dataset. The third point, we pick to be the point that is as far away from both point one and point two as possible. In general, we pick you know, once we've picked the first few points, we pick the next point to be the one who's minimum distance from the already selected points, is as large as possible. And then, we repeat this, until we pick k points. So, this ensures that the k point, that's we've picked, are as far apart or as disperse as possible, you, you know, in the, in the data set. And that gives us a better chance of finding the right clustering. Let's consider the Complexity of k-means clustering. Now in each round, we have to examine each input point exactly, once to find the closest centroid to that point and assign it to that centroid. Now, since there are endpoints and there are k centroids, right? We have to compute the distance of each point from each centroid, so the algorithm is order k times N for endpoints and k clusters. Now each round is it's, it's it's it's a Complexity k times N. Now this is actually, not bad because you know when N is really large and k is fairly small, we have a linear algorithm in N. Each round is actually, linear in N. But the real problem is that, the number of rounds to convergence can be really, really large. There is actually, no theoretical limit on the number of rounds to for the algorithms to convert to converge. And the number of rounds can be really, really large. So, the algorithm could spend actually, a really long time getting to getting to convergence. So the question is, if the dataset is really, really large and we don't want to go through this large number of rounds? Can we actually, do something like k-means clustering in a single pass over the data? Because we have a really large dataset, and we don't want to scan it multiple times. We're going to answer this question in the next segment.