Placeholder Image

字幕表 動画を再生する

  • Today we're going to talk about clustering

  • Do you ever find when you're on YouTube you'll watch a video on something and then suddenly you're being recommended a load of other videos

  • That you hadn't even heard of that are actually kind of similar. This happens to me

  • I watched some video on some new type of saw trying to learn it because you know don't know what I'm doing and suddenly I'm

  • Being recommended videos on turning verses on wooden lathes and all kinds of weird stuff

  • And what's happening is I'm being clustered into groups of people

  • Who are liking those and watching those kind of videos or these kind of videos are all being clustered together as similar, right?

  • So clustering is it's one of the sort of core technologies at the heart of this kind of stuff in fairness

  • I did end up watching a bunch of those woodturning videos

  • We've talked about the different kinds of datasets you might have right and until up to now we've been talking about things like cleaning data

  • transforming data and reducing data

  • Now what we want to do is start trying to derive some knowledge now sort of a typical way to do

  • This would be something like a classification algorithm or maybe a regression ad with them

  • But today we're going to talk about how do you separate out data and group data together?

  • When we don't have any labels for that data, so this is an example of unsupervised learning

  • Different types of machine learning that we can use one is unsupervised. This is where we don't have any labels

  • Then we have supervised learning where we have labels

  • so for example a supervised learning task might be one where you have some labels for you for your products or your videos and you're

  • Trying to classify them like that. So maybe you're trying to classify videos into a genre right or

  • Unsupervised learning we don't have any labels

  • Maybe we've just got a load of products and we're trying to group them into into similar categories

  • So these are all the tools and these will be electronic products

  • And these are all the toys

  • right

  • and maybe we want to do that without having to have a person go through and click on them all so why wouldn't we just

  • Label everything and then use that to perform nice powerful machine learning algorithms like classification

  • Well, sometimes it's just too expensive to produce labels if you're running a music recommendation system

  • I

  • like the wonder power Spotify maybe going through and defining what genre everything is by hand is going to take

  • Absolutely ages and one waste of time right way if we can do this automatically so sometimes you aren't gonna have labels for data

  • And it's too expensive to obtain. It's too time-consuming. It's too difficult

  • Maybe you don't

  • people disagree over what John or two pieces of music are so in that case you are going to have labels and so

  • Clustering is a good option

  • right

  • Let's group things together with things that are similar in terms of their attributes without actually knowing what they are

  • What we're going to try and do then is take all the attributes and all the instances

  • objects and use them to group them into similar objects, but the question is what is similar like well

  • Let's think back to the way that we structure our data we're going to have rows as our instances and we're going to have columns

  • As our attributes and another way

  • We remember to think about that is that these actually are data points in some space

  • Where the position of each point or the position of each instance depends on the attributes?

  • So, for example, we could have a three attribute data set

  • So maybe we have Row 1 2 3 4 and we have attribute a B and C, right?

  • So, I don't know maybe one has a value for a in a value for B and a value for C

  • So this means this is a sort of three dimensional data set with our axes a B and C

  • So we're going to have something like this and then see you guys office

  • So this is a this is B, and this is C

  • So maybe in this data set, you know instance 1 appears here and 2 over here and 3 over here and 4 over here

  • Right you've just I'm imagine

  • These are in some sort of 3d space, you know, perhaps intuitively we can say well ok

  • One is maybe closer to 4 than 2 is because this is shorter distance

  • But of course, this is a three dimensional space is hard to visualize but doesn't matter how many attributes we have

  • We can still say well ok, in this space. These instances are closer to these instances and then we can start grouping things together

  • So maybe I mean actually 2 is very far away from free

  • So maybe we sort of group these two up and group these two up or something like this

  • So typically we're going to use you're clearly in distance for this

  • Right, which is going to be this best distance here between points 1 & 2 in this 3 dimensional space

  • There's obviously going to be questions about how many groups are we grouping them into?

  • It's 3 too far away to be in any of these groups

  • These are things we have to think about but this applies to any number of dimensions

  • as just simply the only thing holding you back is just how fast your computer is and how fast

  • Go food is we're gonna look at two different

  • Clustering algorithms, right? The first is going to be k-means and then we're going to look at pam

  • Alright, which is slightly different now

  • We talked about k-means in computer file before but we're going to talk about it here for this course

  • So just you know to keep things simple for my arm and drawing i'm gonna think the two dimensions here and we've got some sort

  • Of data, right which is sort of looks like this and there may be some more data over here

  • And you know to us we can sort of say well maybe there's two groups

  • But we want to sort of formalize this process and you've got to consider that in two dimensions

  • Maybe it's quite clear that there's two groups

  • if you've got a sort of

  • n-dimensional space maybe a thousand dimensions or ten thousand dimensions

  • Picking out where the groups are is not something you want to be doing by hand

  • So what k-means does is it splits some data into K groups, right?

  • So I'm going to specify them Oh strike straight away but K is 2 in this case because I think there are two classes here

  • Now if I get that wrong, obviously, that's a problem. We'll talk about that. Later

  • But what we're gonna do is we're gonna pick two random points in this space. So let's say this one here and

  • This one here

  • So we've got two classes and we're going to start to assign each of these points based on whichever of these means is closer

  • So these are the center points for our new groups generally speaking. Obviously, this is going to be clearly in distance

  • So essentially a circle in this case

  • So we're going to sort of look sort of like this and the blue one's going to come around

  • So kind of like this kind of like this and these will probably be red because they're slightly closer

  • So now but all these are red. What we're going to do is we're going to label these all red

  • I can only do one iteration of this because now painted all over my picture we start by assigning all of them now

  • We might be finished. But let's imagine. We're not what we want to try and do is

  • Reevaluate where the positions of our clusters are based on this information

  • So we take the mean or the average position of this group here the red group and we can say well, okay

  • It's sort of bang in the middle here. So we get rid of this one. I'm gonna this above our pen. Oh it worked

  • Here's my new center position here

  • Right, the blue one, which I'm going to have to scribble out is going to move to our about there something like this

  • So that's iteration one right now. We've we calculated these center points

  • so this blue region of what's going to be classified as blue and what's going to be classified as red it's kind of going to

  • Move this way a little bit. So I guess we're going to maybe reevaluate and this is going to become blue

  • Ooh, this is going to be an iterative process

  • we're going to keep recalculating these means based on the points that have moved back and forth between these two groups and

  • Eventually, these means should begin to converge and stop moving around as things settle down

  • And usually this actually happens pretty quickly. I even in a large dimensional space k-means is a very popular algorithm. It's got a few drawbacks

  • One is that let's imagine. We had a single point way over here an outlier right now

  • Hopefully you've got rid of most of our lives from the previous video

  • But if you haven't and you've got an outlier here that you weren't expecting

  • Then what's going to happen is this is going to be assigned it in the first iteration to be blue

  • It's going to pull the mean of this group this way

  • which means that more of them are going to be assigned red and

  • Red is going to go this way as well and it's just going to move the means around and cause a bit of a problem

  • We might get away of it in this case

  • But you can imagine if you've got a large high dimensional space and you're trying to cluster lots and lots of clusters

  • Getting the means in the wrong position could cause a bit of instability cause the wrong plate things to be classified and clustered together

  • There's a couple more issues one is that you know

  • Where you start your means on the first iteration is obviously quite important if you place it at random

  • There's a charge you're going to put it right up here and things could take a lot longer to converge or could settle on some

  • Clustering that you're not happy with so this outlaw is going to be a problem, right?

  • It's going to make K means struggle slightly

  • So as an alternative we can use which is called Pam or partitioning around meds by or Kay meds

  • Whatever you want to call it instead of calculating a mean for our cluster and moving those means around what we're going to do is

  • Use actual points from our cluster

  • So what we do is we start off exactly the same as k-means but instead of picking two random positions we pick two random points

  • So for example, what we'll do is we'll pick this red one here and we'll pick this blue one here now

  • These are treated exactly like the means in kami

  • So we've in cluster our data around these two points and then we calculate an error for each cluster

  • That is the distance from all the other points. We assign to it

  • into that cluster so you can imagine hopefully if this point has been chosen in the middle of a cluster then the distance will be

  • Quite small because everything will be tightly bound together if it's we're over here as an outlier

  • It's going to be a huge error because the distance to all of these points is massive

  • So then what we do is we pick a group at random and we move the center to another point

  • So we okay we were here

  • let's move to here and we

  • Repartition our data and we calculate a new error per distance to all our new clusters based on this new position that we just picked

  • And if it's better, we permanently move our center point there

  • if it's not we go back to where we were before we pick a new cluster at random and a new point at random and

  • We repeat this process. So in k-means you move both means in fact, however, many

  • Group clusters, you've got you're going to move all the means at the same time, right?

  • Because you repartition the data all the means are going to move around and then you reposition the data and you repeat like this in

  • Pam you just move one mean or one

  • Exemplar or meadow at a time?

  • So let's say you pick the red one first

  • You move that and maybe pick the red one again and you move that and then it's blues turn you move that

  • And obviously this is gonna take a little while to do over time

  • Hopefully what will happen is you find that?

  • More and more of a time you try and move and it doesn't work because you just increase the error because you settled on something

  • really helpful

  • and

  • Also eventually if you take long enough doing this you're gonna visit it all your points and then you might as well stop as well

  • so typically

  • What you would do is stop after you

  • Fail a number of times to move somewhere better

  • Because really you actually found somewhere pretty good

  • this neatly avoids our problem of outliers because this one here won't affect the position of this cluster because if we ever chose it to

  • Be a center it will be immediately discarded because the error is so large

  • As opposed to it actually affecting the mean and pulling this cluster this direction

  • So there's one last problem and that is the problem of how did we get? This - I

  • Said that I thought there were two clusters in this data and happily there were and that worked out really nicely

  • But if you've got, you know a huge data set

  • There's no way to guess how many clusters this is going to be

  • And or if you do maybe that's not the optimal number of clusters

  • So for example, if you're trying to cluster up songs and Spotify, I mean how many clusters is that?

  • I have no idea like lots so you put 80 in and it's okay

  • But is that should you go up should you do 100 or should you do 60? I don't know

  • so there are approaches like DB scan which will try and bring in the concept of a neighborhood and have the ability to increase or

  • Decrease the number of clusters as appropriate for your data. All right

  • So what's going to happen is they'll say this looks good

  • But if we split this in two and had two clusters here instead

  • That will be a better fit right so these are very useful

  • Technique so you can use if you want something a little bit more powerful

  • Now it wouldn't be a date of an artist course if we didn't look at the iris dataset at least once this is a classic

  • Data set everyone uses and it's good for clustering nice and small and we can have a look and this data set

  • We've got three different species of flower. We've got so Tosa versicolor and

  • Virginica, we've got four attributes. We've got several length sepal width petal length petal width just for this occasion

  • I looked up what a sepal is and it's the green bit that covers the flower when it's folded up right now

  • I don't know much about these flowers, but they are subtly different. One of them is a little bit more different than the others

  • So it makes for a good clustering problem because we're hoping for three distinct clusters

  • the iris dataset is one of the ones that's built into our you can literally call data iris and

  • It'll load it up for you now

  • Let's have a quick look at what we've got because they're lovely function in are called pairs

  • Which just shows us a load of scatter plots of different attributes

  • so if I run this

  • This is only going to work for a few attributes before the whole thing becomes very difficult to look at

  • So we've got things like sepal length sepal width and the correlations of these and these are colored by the different class of flower

  • so you can see if the three class is one of them is actually quite different a lot of the time and then some of

  • Them bees this red and green class. They've got quite a lot of overlap

  • So clustering nose is going to be a little bit more difficult bearing in mind. We're using four dimensions to do it

  • Not the two you're seeing in any individual scatter plot. Okay. So let's just start off with standard k-means

  • so we're going to call km3 k-means with three clusters is

  • K-means, there's a function for this in R on the iris data set all of the rows 1 to 4

  • So not the species of plant

  • We're not going to custom on that three clusters and we're going to allow it to go 400 iterations

  • K-means will stop early if it doesn't improve itself, but if it keeps going maybe it's just going back and forth a little bit

  • It's time to stop that did not take very long

  • This object returned by the k-means function is going to have an integer

  • determining which of our

  • instances have been assigned to which cluster so all of these first ones have been assigned to cluster two and

  • The Centers for all of our clusters as well

  • So remember that in our we only have a data frame like this iris we can add other

  • columns to it

  • So we're going to just add our k-means result back into our it's data frame so we can keep track of it

  • So we're going to say iris

  • km3

  • is equal to km3

  • dollar

  • Cluster that's gonna be in there. Okay, so let's put it in a table

  • We'll have a look at how our clusters match up to our actual number of flowers

  • We've got so it's going to be a table of the irf species versus the iris clusters from k-means

  • Alright, so if we have a look at that

  • the first thing we'll see is that it doesn't make absolutely much sense because for example say Tozer which is our class 1 in some

  • Sense has been assigned to cluster 3. So what we're going to do is we're going to reorder these columns so that the

  • Correct. Classifications are down the diagonal much like a confusion matrix. So we have a function to do that that we're going to call and

  • If we look at this result, we can see that the results are an 89%

  • Classification accuracy, there were 50 of each plant in this dataset 48 of these plants have been correctly assigned to cluster two together

  • But two of them were in cluster 1 along with the other virginities and finally the virginica has been 36 of 50

  • Correctly assigned to cluster 1 and 14 have been incorrectly clustered into cluster 2, right so it worked pretty well. It's not perfect

  • Bearing in mind if you really want to separate out plants. Maybe you need more than 4 dimensions

  • Maybe you can't absolutely tell what a plant is just based on 4 dimensions

  • All right, some of these plants are similar enough, but the clustering isn't very well defined

  • So perhaps we can make our life a little bit easier by using principal component analysis to do dimensionality reduction

  • Or just to reframe our data onto some different axes to get better clustering result. So we're going to do a very similar thing

  • We're going to run PCA on

  • The iris dataset and we're going to project our points into that new

  • Principal component space and then we're going to take only the first two dimensions

  • So this is principal component 1 and principal component 2 as we covered in the principal component video

  • Then what we're going to do is we're going to apply caming stavos results rather than the original data

  • So what we've done is we've transformed our 4 dimensions of sepal width sepal length

  • Petal length and petal width onto our principal component axes and then we've discarded the last two and kept just two axes

  • So I'm going to run that that didn't take very long

  • Ok

  • We're going to sign that back to our iris data set just like we did with the results of k-means and then we can bring

  • Up another table and see how the results compare table to and then we'll order that again by the diagonal

  • Results were almost exactly the same. I think it was 88% 89% something like this

  • You can see that one extra

  • Versicolor was put into cluster 2 when it shouldn't have been I but this is with only 2 dimensions instead of 4 dimensions

  • So we've Harbor number of dimensions but by using PCA

  • we've got almost the exact same result for datasets that

  • You don't have labels for maybe the labels are too hard to get or you don't know what they would be

  • I

  • Think clustering is a good way to group up data and start to derive some knowledge the knowledge we can derive or what what items

  • are similar to each other by which products in our database are similar to each other so that we can start using them for a

  • Recommender system, you know, what movies are like each other what songs are like each other like what flowers are like each other?

  • So the ideas that were clustering data up and by doing that we can look at these clusters and start to gain some knowledge

  • Don't forget also that each of these is going to have a prediction as well

  • so this one here attribute one is going to have let's say like a label if we did play tennis or this person is

  • Healthy or this person has this disease. It depends on you

Today we're going to talk about clustering

字幕と単語

ワンタップで英和辞典検索 単語をクリックすると、意味が表示されます

B1 中級

データ分析7:クラスタリング - Computerphile (Data Analysis 7: Clustering - Computerphile)

  • 4 0
    林宜悉 に公開 2021 年 01 月 14 日
動画の中の単語