字幕表 動画を再生する
Principal component analysis is perhaps the most widely used data reduction technique on the planet
Everyone uses it but here's the thing. It doesn't actually do data reduction
Principal component analysis is the idea of trying to find a different view for our data in which we can separate it better
And I'll show an example piece of paper
And the idea is that what we want to try and do is reframe our data
Maybe move it around so that we can better separate things out better cluster things perhaps it's better for machine learning
Now as a side effect of this
in PCA, we also order our
Axes by the most to least useful in some sense. So then we can perform a separate data reduction
Technique later by taking the slightly less useful axes away by or in this case
Dimensions or attributes of our data PCA is commonly pitched as a data reduction technique. Actually. It's a data transformation technique
It just makes our data and meaning able to production later
So let's imagine we have some attributes and we we know that some are correlated some are not correlated
The problem is that maybe we don't want to just delete some of the attributes may be
0.65 correlation is I mean
It's it's good
But it's not it doesn't mean we definitely want to delete attribute two and keep only attribute one on the other hand maybe
We do need to reduce some of the number of dimensions we've got or maybe we just want to try and make our data
More amenable to things like clustering. So let's look at a quick example
Typically PCA is done over many dimensions when we've got lots and lots of attributes
I'm just going to show two because obviously I starts to break down when I try and draw that many on the page
So if we have two attributes
And what we want to try and do is work out what the contribution of each of these is to our data set
Which of these is useful which of these is not useful and now obviously if we had many dimensions, you know
Seven hundred ten thousand we can still apply the same technique. So maybe we have some data that's like this
We have some datasets over here and perhaps we have a little gap and maybe some data over here and in general
Our data is kind of increasing like this. So this means that attribute one and attribute two are positively correlated to some extent
but maybe the correlation is not so strong that we just wanted elite attribute to what we want to try and do is
Transform our data into a way where these are more useful imagine that you've got some data but looks a bit like this
But if we rotate our data
We take a different view we can see there's actually two
objects and then we can separate them out and maybe if you were to take them again you could see there was four objects and
So on this is the idea what PCA is going to do is find new axes for this data
That separate it better for PCA to work. What we would start by doing is standardizing our data
So all of our dimensions attribute one attribute to all of the attributes are going to be centered around zero
And they're going to have a standard deviation of one
PCA will not work really at all. If you have widely different scales for your data
So what we want to try and do is find a direction or an axis through these two attributes
That separates out our data better than individual attributes
Do let's see how this data looks just from attribute one
If we trace down this way
you can see that it's sort of got this amount of spread in actually Bute one and they kind of
Dotted around like this and they sort of should go all the way along like this so you can't really see anything on here
Meaningful about these two groups, right?
And of course the more dimensions you have the more this could be a problem
Similarly about tribute to if we trace along here it goes from this range to this range
This is the variance of attribute - like the range and we can see that roughly speaking
The data is as spread out in attribute one as it is in attribute to that spread is about the same and both of them
are kind of useful for looking of a data but not really because again,
We have an equal distribution of point all the way along here. So that's not hugely useful
All right. So if we look at just attribute one, that's not hugely helpful
If we look at just attribute two, that's not hugely helpful either. So what can we do?
well
what we want to try and do is find a new axis like some new attribute that fits through this data like this and can
Really separate everything out because the spread of this data is actually diagonally in some sense not this way or this way
So what principal component analysis is going to do is find this principal component miss axis through our data like this
Such that when we look at the spread of a data, it's maximized, right?
So the data is as spread out as we can find it
And this is going to happen over any number of attributes
So actually one here
attribute to attribute three attribute for all the way to attribute n when we've got maybe 700 or 800 or
$1000 so at the moment which is fitting one principal component, this is one line through our two-dimensional data
There's going to be more principal components later, right?
But what we want to do is we want to pick the direction through this data
However, many attributes it has that has the most spread. So how do we measure this?
There's really two goals which were exactly the same one is to maximize the variance
So we find a direction for this line such that these points at the very edge are farthest apart
The other one is that we minimize the error
so we take this error from here this distance this distance from all these points to our new axes and we minimize it so
You can imagine if we do this for all our points we can get the sum of the squared
distances from these points to this line and then as we move this line around
Sometimes if it's going to be better
Sometimes it's not if we have a line that goes like this
Some of these lines are going to be very large like this and that's going to be a higher amount of error
So what we'll find is that if we do this our first principal component will sit through whichever direction in the data
minimizes these distances and by definition
Maximizes this spread which makes this axis super useful if we use this axis now as our new X and we rotate this whole page
All our data is lovely and separated. And actually we have two distinct clusters in this data set, right?
So that's what we're going to do
Now as I mentioned PCA doesn't typically reduce the number of attributes from two to one just like that
We're going to have another principal component which represents the second amount of most variance orthogonal e so at ninety degrees
So that's going to be this one here
We find the first principal component which maximizes variance and then we find the next one along that maximizes events in the next direction
Now if there were multiple dimensions we'd keep applying this process
We keep finding new axes for our data that systematically show more and more of a spread of our data
But we're crucially we're ordering this by the amount of variance that they represent
So this is PC one or principal component one. This is principal component two and
Principal component one is always going to have the most
Varied data in it principal component to the next most three the next most all the way to the end with the least
so a natural
Side-effect of this process is that we're going to have new axes through our data
Which and we're going to have a same number of axes as there are original dimensions in our data
But they're going to get less and less useful in terms of the variance of our data as we go forward
So PC one is going to be the most important
most of our data is spread out across
Pc-1 pc2 a little bit less spread out PC three a little bit less still all the way down to PC n all the way
Down here if you wanted to perform dimensionality reduction because you felt you had too many dimensions to your data
you could just for example keep the first 10 principal components project your data into that space and
Still retain most of the information
we won't go into the mathematics of how to calculate these principal components because you can find out very easily online and all has a
Lovely function to do it for us
I wanted to focus on intuitively what PCA does but how we will actually project these points onto these new
axes and rotate the whole thing is
Each of these principal components is going to be a weighted sum of all the attributes
So for example PC one is going to be some amount of attribute one
Added to some amount of attribute two now in this case because it sort of goes off at sort of 45 degrees
It's going to be about the same but you could imagine if your data was like this
It'll be mostly attribute one and a little bit of attribute two if it was like this
It'll be mostly attribute to a little bit attribute one
All right. Now, of course the n-dimensional data or we have many more dimensions that I can't draw on the page
The principle is exactly the same some amount of attribute one attribute to attribute three and so on all the way to the end
Right and that's going to project our points straight onto this line through that data
So when we talk about minimizing the error you can imagine
Rotating this about the center of these points here like this
And as you do this these red lines are going to change in length
And it's going to settle on the very center line where these weights are minimized
Right and as it happens that also maximizes the variance of these points here because of the fact that this mathematics is based around
eigenvectors and eigenvalues
Pc2 is always going to come out or foggin all or in this case at 90 degrees to pc one now
This is true of however many dimensions. You've got every single new axis that appears or new vector
A new principal component is going to come out or foggin all to the ones before
Until you run out of dimensions and you can't do it anymore
We've already reached the most we can fit in on this two-dimensional plane
We've got one here and we've got another one orthogonal to it
There is no other lines I can draw for that to be true
Right, but obviously if we had more attributes, that would be the case
so the reason that it's so important to scare your data appropriately is that you're trying to find the direction for your data that
Maximizes the variance now, if one of your dimensions is much much bigger than the other of course
That one is the one that's going to maximize the variance
if you've got salary that's between naught and
10,000 and all your others are between naught and 1
Then your first principal component is going to be predominately salary because that's the most important thing as far as it know
If as it knows this is why it's so important to standardize your data first
We're going to continue to use our music data set for this video now for those of you
Forgotten this data set is a set of music files that are freely available online
Where we've got the metadata of a genres or titles for different tracks and then for those tracks
We've also calculated some features about the actual audio for example
Temporal features how loud they are?
How fast the music is how upbeat it is whether you could dance to it this kind of thing
Apparently dance ability is a measurable trait
Apparently these teachers have been generated by two different libraries once called Lib rosa
Which is freely available online and the other ways echo nests
Which are the features that a core of Spotify and how it does its music recommender system and its playlists
So let's load the data set
So I'm going to read it. It takes quite a long time to load
It'll probably be faster if it wasn't in a CSV, you've got to remember if your files are in CSV
You've got to actually pass the more than workout, whether they're numerical or text, you know for every cell. Okay, so we've got
13,000 instances or rows in our data and we've got
751 attributes or dimensions to our data? So these are going to include features from both liberals
ER and echo nest and the other metadata of these tracks
So we're going to select just the echo nest features for this part
Just be it's a little bit easier to have fewer dimensions to look at this would work just as well
On all the other features as long as they're numeric
So we're gonna select echo nest is equal to the music data frame all of the rows and just the echo nests columns
Which are 528 to the end and then we're going to standardize all this data now
So we're going to Center it around 0 a mean of 0 and a standard deviation of 1 using the scale function
now take a minute to