字幕表 動画を再生する
Today we're going to talk about clustering
Do you ever find when you're on YouTube you'll watch a video on something and then suddenly you're being recommended a load of other videos
That you hadn't even heard of that are actually kind of similar. This happens to me
I watched some video on some new type of saw trying to learn it because you know don't know what I'm doing and suddenly I'm
Being recommended videos on turning verses on wooden lathes and all kinds of weird stuff
And what's happening is I'm being clustered into groups of people
Who are liking those and watching those kind of videos or these kind of videos are all being clustered together as similar, right?
So clustering is it's one of the sort of core technologies at the heart of this kind of stuff in fairness
I did end up watching a bunch of those woodturning videos
We've talked about the different kinds of datasets you might have right and until up to now we've been talking about things like cleaning data
transforming data and reducing data
Now what we want to do is start trying to derive some knowledge now sort of a typical way to do
This would be something like a classification algorithm or maybe a regression ad with them
But today we're going to talk about how do you separate out data and group data together?
When we don't have any labels for that data, so this is an example of unsupervised learning
Different types of machine learning that we can use one is unsupervised. This is where we don't have any labels
Then we have supervised learning where we have labels
so for example a supervised learning task might be one where you have some labels for you for your products or your videos and you're
Trying to classify them like that. So maybe you're trying to classify videos into a genre right or
Unsupervised learning we don't have any labels
Maybe we've just got a load of products and we're trying to group them into into similar categories
So these are all the tools and these will be electronic products
And these are all the toys
right
and maybe we want to do that without having to have a person go through and click on them all so why wouldn't we just
Label everything and then use that to perform nice powerful machine learning algorithms like classification
Well, sometimes it's just too expensive to produce labels if you're running a music recommendation system
I
like the wonder power Spotify maybe going through and defining what genre everything is by hand is going to take
Absolutely ages and one waste of time right way if we can do this automatically so sometimes you aren't gonna have labels for data
And it's too expensive to obtain. It's too time-consuming. It's too difficult
Maybe you don't
people disagree over what John or two pieces of music are so in that case you are going to have labels and so
Clustering is a good option
right
Let's group things together with things that are similar in terms of their attributes without actually knowing what they are
What we're going to try and do then is take all the attributes and all the instances
objects and use them to group them into similar objects, but the question is what is similar like well
Let's think back to the way that we structure our data we're going to have rows as our instances and we're going to have columns
As our attributes and another way
We remember to think about that is that these actually are data points in some space
Where the position of each point or the position of each instance depends on the attributes?
So, for example, we could have a three attribute data set
So maybe we have Row 1 2 3 4 and we have attribute a B and C, right?
So, I don't know maybe one has a value for a in a value for B and a value for C
So this means this is a sort of three dimensional data set with our axes a B and C
So we're going to have something like this and then see you guys office
So this is a this is B, and this is C
So maybe in this data set, you know instance 1 appears here and 2 over here and 3 over here and 4 over here
Right you've just I'm imagine
These are in some sort of 3d space, you know, perhaps intuitively we can say well ok
One is maybe closer to 4 than 2 is because this is shorter distance
But of course, this is a three dimensional space is hard to visualize but doesn't matter how many attributes we have
We can still say well ok, in this space. These instances are closer to these instances and then we can start grouping things together
So maybe I mean actually 2 is very far away from free
So maybe we sort of group these two up and group these two up or something like this
So typically we're going to use you're clearly in distance for this
Right, which is going to be this best distance here between points 1 & 2 in this 3 dimensional space
There's obviously going to be questions about how many groups are we grouping them into?
It's 3 too far away to be in any of these groups
These are things we have to think about but this applies to any number of dimensions
as just simply the only thing holding you back is just how fast your computer is and how fast
Go food is we're gonna look at two different
Clustering algorithms, right? The first is going to be k-means and then we're going to look at pam
Alright, which is slightly different now
We talked about k-means in computer file before but we're going to talk about it here for this course
So just you know to keep things simple for my arm and drawing i'm gonna think the two dimensions here and we've got some sort
Of data, right which is sort of looks like this and there may be some more data over here
And you know to us we can sort of say well maybe there's two groups
But we want to sort of formalize this process and you've got to consider that in two dimensions
Maybe it's quite clear that there's two groups
if you've got a sort of
n-dimensional space maybe a thousand dimensions or ten thousand dimensions
Picking out where the groups are is not something you want to be doing by hand
So what k-means does is it splits some data into K groups, right?
So I'm going to specify them Oh strike straight away but K is 2 in this case because I think there are two classes here
Now if I get that wrong, obviously, that's a problem. We'll talk about that. Later
But what we're gonna do is we're gonna pick two random points in this space. So let's say this one here and
This one here
So we've got two classes and we're going to start to assign each of these points based on whichever of these means is closer
So these are the center points for our new groups generally speaking. Obviously, this is going to be clearly in distance
So essentially a circle in this case
So we're going to sort of look sort of like this and the blue one's going to come around
So kind of like this kind of like this and these will probably be red because they're slightly closer
So now but all these are red. What we're going to do is we're going to label these all red
I can only do one iteration of this because now painted all over my picture we start by assigning all of them now
We might be finished. But let's imagine. We're not what we want to try and do is
Reevaluate where the positions of our clusters are based on this information
So we take the mean or the average position of this group here the red group and we can say well, okay
It's sort of bang in the middle here. So we get rid of this one. I'm gonna this above our pen. Oh it worked
Here's my new center position here
Right, the blue one, which I'm going to have to scribble out is going to move to our about there something like this
So that's iteration one right now. We've we calculated these center points
so this blue region of what's going to be classified as blue and what's going to be classified as red it's kind of going to
Move this way a little bit. So I guess we're going to maybe reevaluate and this is going to become blue
Ooh, this is going to be an iterative process
we're going to keep recalculating these means based on the points that have moved back and forth between these two groups and
Eventually, these means should begin to converge and stop moving around as things settle down
And usually this actually happens pretty quickly. I even in a large dimensional space k-means is a very popular algorithm. It's got a few drawbacks
One is that let's imagine. We had a single point way over here an outlier right now
Hopefully you've got rid of most of our lives from the previous video
But if you haven't and you've got an outlier here that you weren't expecting
Then what's going to happen is this is going to be assigned it in the first iteration to be blue
It's going to pull the mean of this group this way
which means that more of them are going to be assigned red and
Red is going to go this way as well and it's just going to move the means around and cause a bit of a problem
We might get away of it in this case
But you can imagine if you've got a large high dimensional space and you're trying to cluster lots and lots of clusters
Getting the means in the wrong position could cause a bit of instability cause the wrong plate things to be classified and clustered together
There's a couple more issues one is that you know
Where you start your means on the first iteration is obviously quite important if you place it at random
There's a charge you're going to put it right up here and things could take a lot longer to converge or could settle on some
Clustering that you're not happy with so this outlaw is going to be a problem, right?
It's going to make K means struggle slightly
So as an alternative we can use which is called Pam or partitioning around meds by or Kay meds
Whatever you want to call it instead of calculating a mean for our cluster and moving those means around what we're going to do is
Use actual points from our cluster
So what we do is we start off exactly the same as k-means but instead of picking two random positions we pick two random points
So for example, what we'll do is we'll pick this red one here and we'll pick this blue one here now
These are treated exactly like the means in kami
So we've in cluster our data around these two points and then we calculate an error for each cluster
That is the distance from all the other points. We assign to it
into that cluster so you can imagine hopefully if this point has been chosen in the middle of a cluster then the distance will be
Quite small because everything will be tightly bound together if it's we're over here as an outlier
It's going to be a huge error because the distance to all of these points is massive
So then what we do is we pick a group at random and we move the center to another point
So we okay we were here
let's move to here and we
Repartition our data and we calculate a new error per distance to all our new clusters based on this new position that we just picked
And if it's better, we permanently move our center point there
if it's not we go back to where we were before we pick a new cluster at random and a new point at random and
We repeat this process. So in k-means you move both means in fact, however, many
Group clusters, you've got you're going to move all the means at the same time, right?
Because you repartition the data all the means are going to move around and then you reposition the data and you repeat like this in
Pam you just move one mean or one
Exemplar or meadow at a time?
So let's say you pick the red one first
You move that and maybe pick the red one again and you move that and then it's blues turn you move that
And obviously this is gonna take a little while to do over time
Hopefully what will happen is you find that?
More and more of a time you try and move and it doesn't work because you just increase the error because you settled on something
really helpful
and
Also eventually if you take long enough doing this you're gonna visit it all your points and then you might as well stop as well
so typically
What you would do is stop after you
Fail a number of times to move somewhere better
Because really you actually found somewhere pretty good
this neatly avoids our problem of outliers because this one here won't affect the position of this cluster because if we ever chose it to
Be a center it will be immediately discarded because the error is so large
As opposed to it actually affecting the mean and pulling this cluster this direction
So there's one last problem and that is the problem of how did we get? This - I
Said that I thought there were two clusters in this data and happily there were and that worked out really nicely
But if you've got, you know a huge data set
There's no way to guess how many clusters this is going to be
And or if you do maybe that's not the optimal number of clusters
So for example, if you're trying to cluster up songs and Spotify, I mean how many clusters is that?
I have no idea like lots so you put 80 in and it's okay
But is that should you go up should you do 100 or should you do 60? I don't know
so there are approaches like DB scan which will try and bring in the concept of a neighborhood and have the ability to increase or
Decrease the number of clusters as appropriate for your data. All right
So what's going to happen is they'll say this looks good
But if we split this in two and had two clusters here instead
That will be a better fit right so these are very useful
Technique so you can use if you want something a little bit more powerful
Now it wouldn't be a date of an artist course if we didn't look at the iris dataset at least once this is a classic
Data set everyone uses and it's good for clustering nice and small and we can have a look and this data set
We've got three different species of flower. We've got so Tosa versicolor and
Virginica, we've got four attributes. We've got several length sepal width petal length petal width just for this occasion
I looked up what a sepal is and it's the green bit that covers the flower when it's folded up right now
I don't know much about these flowers, but they are subtly different. One of them is a little bit more different than the others
So it makes for a good clustering problem because we're hoping for three distinct clusters
the iris dataset is one of the ones that's built into our you can literally call data iris and
It'll load it up for you now
Let's have a quick look at what we've got because they're lovely function in are called pairs
Which just shows us a load of scatter plots of different attributes
so if I run this
This is only going to work for a few attributes before the whole thing becomes very difficult to look at
So we've got things like sepal length sepal width and the correlations of these and these are colored by the different class of flower
so you can see if the three class is one of them is actually quite different a lot of the time and then some of
Them bees this red and green class. They've got quite a lot of overlap
So clustering nose is going to be a little bit more difficult bearing in mind. We're using four dimensions to do it
Not the two you're seeing in any individual scatter plot. Okay. So let's just start off with standard k-means
so we're going to call km3 k-means with three clusters is
K-means, there's a function for this in R on the iris data set all of the rows 1 to 4
So not the species of plant
We're not going to custom on that three clusters and we're going to allow it to go 400 iterations
K-means will stop early if it doesn't improve itself, but if it keeps going maybe it's just going back and forth a little bit
It's time to stop that did not take very long
This object returned by the k-means function is going to have an integer
determining which of our
instances have been assigned to which cluster so all of these first ones have been assigned to cluster two and
The Centers for all of our clusters as well
So remember that in our we only have a data frame like this iris we can add other
columns to it
So we're going to just add our k-means result back into our it's data frame so we can keep track of it
So we're going to say iris
km3
is equal to km3
dollar
Cluster that's gonna be in there. Okay, so let's put it in a table
We'll have a look at how our clusters match up to our actual number of flowers
We've got so it's going to be a table of the irf species versus the iris clusters from k-means
Alright, so if we have a look at that
the first thing we'll see is that it doesn't make absolutely much sense because for example say Tozer which is our class 1 in some
Sense has been assigned to cluster 3. So what we're going to do is we're going to reorder these columns so that the
Correct. Classifications are down the diagonal much like a confusion matrix. So we have a function to do that that we're going to call and
If we look at this result, we can see that the results are an 89%
Classification accuracy, there were 50 of each plant in this dataset 48 of these plants have been correctly assigned to cluster two together
But two of them were in cluster 1 along with the other virginities and finally the virginica has been 36 of 50
Correctly assigned to cluster 1 and 14 have been incorrectly clustered into cluster 2, right so it worked pretty well. It's not perfect
Bearing in mind if you really want to separate out plants. Maybe you need more than 4 dimensions
Maybe you can't absolutely tell what a plant is just based on 4 dimensions
All right, some of these plants are similar enough, but the clustering isn't very well defined
So perhaps we can make our life a little bit easier by using principal component analysis to do dimensionality reduction
Or just to reframe our data onto some different axes to get better clustering result. So we're going to do a very similar thing
We're going to run PCA on
The iris dataset and we're going to project our points into that new
Principal component space and then we're going to take only the first two dimensions
So this is principal component 1 and principal component 2 as we covered in the principal component video
Then what we're going to do is we're going to apply caming stavos results rather than the original data
So what we've done is we've transformed our 4 dimensions of sepal width sepal length
Petal length and petal width onto our principal component axes and then we've discarded the last two and kept just two axes
So I'm going to run that that didn't take very long
Ok
We're going to sign that back to our iris data set just like we did with the results of k-means and then we can bring
Up another table and see how the results compare table to and then we'll order that again by the diagonal
Results were almost exactly the same. I think it was 88% 89% something like this
You can see that one extra
Versicolor was put into cluster 2 when it shouldn't have been I but this is with only 2 dimensions instead of 4 dimensions
So we've Harbor number of dimensions but by using PCA
we've got almost the exact same result for datasets that
You don't have labels for maybe the labels are too hard to get or you don't know what they would be
I
Think clustering is a good way to group up data and start to derive some knowledge the knowledge we can derive or what what items
are similar to each other by which products in our database are similar to each other so that we can start using them for a
Recommender system, you know, what movies are like each other what songs are like each other like what flowers are like each other?
So the ideas that were clustering data up and by doing that we can look at these clusters and start to gain some knowledge
Don't forget also that each of these is going to have a prediction as well
so this one here attribute one is going to have let's say like a label if we did play tennis or this person is
Healthy or this person has this disease. It depends on you