 ## 字幕表 動画を再生する

• Welcome back to Mining of Massive Datasets.

• We're going to continue or

• discussion of clustering by looking at the CURE Algorithm.

• CURE is an acronym that stands for Clustering Using Representatives.

• But before we get to the algorithm itself, let's see why we need it.

• Remember, we've looked at the, BFR algorithm, or

• clustering very large datasets that don't fit in memory.

• The BFR algorithm is great, because you can scan the data in one pass, and

• obtain clusters.

• The problem, though, is that the BFR algorithm makes very

• strong assumptions about the data, and about what the clusters look like.

• The first assumption that the BFR Algorithm makes is that the clusters

• are normally distributed in each dimension.

• That in each dimension there is a fixed centroid and

• a fixed standard deviation that the, that each cluster follows along each dimension.

• The second strong assumption that the BFR Algorithm makes is

• that the axes are fixed.

• So the clusters then, if you follow both these assumptions, the,

• that the clusters are normally distributed in each dimension and the axes are fixed.

• The clusters that are discovered by the BFR Algorithm had this

• the cigar kind of shape that, that you see here on the left.

• it, it, it could either be a horizontal cigar shape or a vertical cigar shape.

• Or a circle, which is kind of a limiting case of, of an ellipse.

• But if your clusters actually are not oriented along the x or

• the y axis in this case, or along the axis in general in the multi-dimensional case,

• but are at an angle to the axis as, as I show in the, in the fig,

• in, in the second picture here.

• that's, that's not okay.

• The BFR Algorithm will not find a cluster that looks like a tilted ellipse.

• It can only find clusters that look like either upright or horizontal ellipses.

• And clusters actually look very, very different, like the picture on

• the extreme right here where there are two clusters and the clusters look kind of

• like crescent moons except in, in opposite directions, those would definitely not be

• found by the BFR Algorithm because they don't look like cigars at all.

• They don't look like ellipses at all or in any dimension.

• So, that's the kind of cluster that will never be found by the BFR Algorithm.

• So the BFR Algorithm,

• even though it's super efficient makes the strong assumptions the clusters are going

• to look like the, the pictures on the extreme left, and not like the other two.

• And we'd like to avoid this assumption, and

• try to find clusters regardless of what they actually look like,

• because we don't control what the clusters look like in the, in the, in the data.

• The CURE Algorithm tries to fix this problem with the with the BFR Algorithm.

• The CURE Algorithm assumes a Euclidean distance.

• I remember a Euclidean distance metric means between any two points, we can

• always find a mid point of two points by taking the average of those two points.

• However, the CURE Algorithm,

• unlike the BFR Algorithm, allows clusters to assume any shape whatsoever.

• There is no restriction on the, on the shape of the clusters.

• So in the CURE Algorithm any of these clusters, the the, the first,

• the second, or the third are perfectly fine.

• The CURE Algorithm works, can find clusters of, of those shapes.

• Now difference between the CURE Algorithm and

• the BFR Algorithm is that the BFR we represent each cluster using its centroid.

• Whereas in the CURE Algorithm, we're going to use instead of a centroid,

• we're going to represent each cluster by a collection of representative points.

• So, instead of representing a cluster by one point,

• we're going to represent it by many points.

• Here's an example of a dataset where the clusters don't look anything at

• all like ellipses or cigars.

• So, this data, on x axis we have the age of faculty members at

• a university like Stanford and on the y axis we have their salaries.

• Now these are, these are, this is not the actual data, but more a representation of

• what the data might look like, although it's based on, on real world experience.

• Now, the, this, the data points marked by h,

• are salaries of faculty members in the humanities.

• Where the, data points marked with an e are salaries of faculty members in,

• in engineering, departments.

• And as you can see, it's apparent from the, from the graph here, that in

• the humanities the, the starting salary is, is somewhat lower than in engineering.

• A humanities, faculty member starts at a much lower salary than

• an engineering faculty member.

• But, as, over time, as their tenure increases,

• the salary of a humanities, faculty member, keeps increasing and

• eventually overtakes the salary of a an engineering faculty member.

• But in the salary of engineering faculty members increases a little bit with their

• tenure, but then kind of flattens out, it doesn't increase beyond that.

• And this is just a phenomenon that has been observed in, in terms of salaries at,

• at most universities and presumably this is because, in,

• in the engineering departments the fields keep changing so

• much that there's a lot of value in bringing in new

• faculty with with new interests and and new you know, new expertise.

• Whereas in the humanities I guess you age better as you age.

• So if you sort of look at the look at these two sets of salaries and

• you try to cluster them, what you really want in an ideal world is, is, is two,

• is two clusters.

• One that you know looks at the engineering salaries,

• and one that looks at the humanities salaries and see it,

• and cleanly separate out these two data points into, into two separate clusters.

• Now when you're looking at the data, remember you do know that some of

• these of salaries corresponding to engineering faculty members and

• some to humanities facilities members so so

• the clustering algorithm doesn't have access to this information but you'd yet

• like it to find these find these clusters in the data.

• Now it's too much to hope that a clustering algorithm can actually

• find these exact clusters because these are overlapping clusters, and

• most clustering algorithm cannot find cluster that actually overlap with

• with each other where a single data point is in two clusters.

• But at the very least, we can hope that the clustering algorithm finds some

• approximation for these clusters.

• For example we might want it to discover one cluster there.

• Another cluster there.

• And a third cluster there.

• So that would be a nice, nice outcome for from from any clustering algorithm.

• And the CURE Algorithm can indeed find clusters of this nature.

• The CURE Algorithm is a two pass algorithm.

• And let's look at the, the first pass of the two pass algorithm.

• In the first pass,

• we sample a random set of points from the, dataset that's given to us.

• Remember the dataset is really large and doesn't fit in memory at all.

• It's sitting on disk somewhere.

• But we're going to randomly sample a point from this very large dataset.

• And we're only going to sample enough points, that, that fit in memory.

• we'll, we've covered techniques for, sampling, in another lecture, so

• you know how to do this already.

• Now, once we have a lot of, sample points that, you know, that

• you've randomly sampled, we're going to use any main memory clustering algorithm.

• So, for example, you can use a hierarchical clustering algorithm that

• we covered in our previous lecture.

• And you can cluster the the sample points that are in memory using the hierarchical

• clustering algorithm to create an initial set of clusters.

• Right?

• And because hierarchical clustering is is, is, is,

• is a complex algorithm, it can actually find clusters of the nature you know,

• it doesn't have any kind of restriction on the kind of clusters that it can find.

• So it can find clusters of any shape.

• Now, once we've actually clustered the sample points, and fi,

• figured out initial sort of clusters, for each of those clusters,

• we're going to pick representative points to represent them.

• So what we're going to do, is we're going to pick a number, k, and

• let's say 4, and we're going to find pick k representative points for each cluster.

• Now our goal is to, you know, take a cluster like this.

• Let's say, here's a cluster that, that we found using hierarchical clustering.

• And we want to find a representative point, that are, you know, as far away

• from each other to sort of get a good coverage of the cluster as possible.

• For example, we might want to, if, if we find the first representative point there,

• you might want to find the second there, the third there, and the fourth there.

• So we have four points that are sort of well distributed in the cluster and

• if they, they cover as much of the cluster's, you know, volume as possible.

• And the, you know, we've sort of in our previous lecture,

• we've discussed how to pick, points that are as dispersed as possible in a cluster.

• The technique is to, for each cluster, first pick, the first point at random, and

• then you pick the second point to be within the cluster but

• be as far away from the first point as possible.

• And you pick the third point to be still within the cluster but

• as far away from points one and two as possible, and so on.

• And we've covered this technique in a, in a previous previous lecture.

• So once we've picked these representative points you know,

• what we're going to do is that we're going to create the synthetic points.

• And the synthetic points are obtained by moving each,

• representative point a certain, distance towards the centroid of the cluster.

• Right? So, we have the cluster.

• We know its centroid.

• And we have these, these k representative points.

• Now, we're going to take each representative point and we're going to

• create a synthetic point, that is obtained by moving the representative point

• a fixed fraction, maybe 20%, towards the centroid of the cluster.

• And the 20% here is a, is a parameter to the algorithm.

• So let's, let's look at an example, that should make this clear.

• So, here are faculty salary data.

• And let's say when you run a hierarchical clustering on a sample of the data,

• let's say that this is in fact a sample of the data and not the actual data.

• When you run a hierarchical clustering on this, let's say the,

• the here is the first cluster that's found.

• Here is the second cluster and here's the third cluster.

• So we end up with these three clusters that are found by

• the hierarchical clustering algorithm.

• Now, once we've found these clustering algorithms,

• let's take one of these clusters and and

• let's say we want to find four remote, or representative point for each cluster.

• And let's start at the cluster to the right.

• And there, there's the first representative point,

• let's say it's the one that there that we've ended up picking.

• Now we want to pick the second representative point to be in this

• cluster, but as far away from the first representative point as possible.

• So that's the point we end up picking.

• The third representative point is going to be,

• still in the cluster but as far away from these two points as possible.

• So we end up with that point there.

• And the fourth representative point similarly ends up being that,

• that point there.

• Now, once you have picked these four remote points for

• the cluster we're going to move these points towards a centroid of the cluster.

• No, we, we don't actually going to affect the data itself.

• But we're going to you know, pick synthetic points that are closer to

• the centroid of a cluster than the remote points that we've actually picked.

• Now the centroid of the,

• of the cluster is somewhere in the middle of the cluster there so each of

• these points is move, going to move towards the center of the circle here.

• So when you move the points 20% towards the centroid, we end up

• with these synthetic points and these synthetic points are manufactured points,

• are going to be the representative points for this cluster.

• And we repeat this process for the other clusters as well.

• So for each cluster we end up with k,

• in this case 4, represented points representing that cluster.

• So that's the first pass of the CURE Algorithm.

• In the, in the second pass, of the, of the algorithm, we scan the whole dataset.

• Remember, so

• far we've been working with, a sample of the dataset that fits in memory.

• Now we go back to the whole dataset and, which is sitting on disc.

• And we re-scan the whole dataset, and visit each point p.

• And, and we're going to take the point p and

• we're going to place it in the cluster that's closest to it.

• And the definition of closest is very simple.

• We're going to find the point you know, the, we're going to take p, and we're

• going to find the, the, among the set of representative points of all the clusters.

• We're going to find the representative point, that's closest to p, and

• we're going to assign p to the cluster, belonging to that representative point.

• And that's it.

• It's a very, very simple algorithm, where we just scan the data in one pass, and

• we place, each point, in, in its closest cluster, which is the cluster,

• with the closest representative point.

• Now the CURE is a, it's a very, very simple algorithm as as we saw.

• It just requires a main memory sample and

• clustered cluster, finding some representative points, and

• then one scan of the entire dataset to place each data point into you know,

• in, into the closest cluster, but surprisingly CURE really works well for

• a large number of used cases in finding, complicated clusters of the,

• of the kind that we saw with the faculty's salaries and so on.

• Our discussion of the CURE Algorithm wraps up our discussion of clustering,

• which we've covered over the past several lectures.

• Remember that the clustering point, problem is one where we're

• given a large set of points with a notion of distance between the points,

• usually in a very high-dimensional data space.

• And we, and our go, goal is to group these points into some number of clusters,

• with the points that are closer together being in the same cluster.