字幕表 動画を再生する
People need to learn to use standardized measures for things. So take me
For example when I Drive anywhere I driving miles I Drive in miles per hour
My fuel economy is messaging miles per gallon, but of course, I don't pump fuel in my in gallons
I pump it in liters
And then but when I run anywhere so short distances I run in kilometres and I run in kilometers per hour
So I'm using two different systems there and any short distances. I'm measuring are going to be in meat. It's not feet, right
so if I'm measuring let's say
around my house for painting I'm going to measure in square meters so I know how much paint to buy but then
If I'm selling a house, or I'm buying a house
I'm going to be looking at the size of the house in square feet again. What who knows why British people?
If I'm baking anything, it's going to be weight in grams or kilograms going into the recipe
but if I'm weighing myself is going to be in stones and
pounds but of course a ton would for me would be a metric tonne not an imperial time and
As I said, I measure fuel in litres and most of my liquids are measured in liters except for coarse for beer and milk
Which are in pints? So this is the kind of problem
You're going to be dealing with when you're looking at data. You're trying to transform your data into a usable form
Maybe the data is coming from different sources
None of it goes together. You need standardized units standardized scales so we can go on and analyze it
Let's think back
we what we're doing is we're trying to prepare our data into a
Densest most clean format so that we can apply modeling or machine learning or some kind of statistical
Test to work out what's going on and draw knowledge from our data. So this is going to be an iterative process
We're going to be cleaning the data
We're going to transform the data and then we're going to reduce for data and transforming data is what we're going to do today
So let's imagine that you've cleaned your data. So we've got rid of as many missing variables as possible
Hopefully all of them with deleted instances and attributes that just we're not going to work out for us
Now what we're going to try and do is we're going to try and transform our data so that everything's on the same scale
Everything makes sense together and if we're bringing datasets from different places
We need to also make sure what the units are the same and everything makes sense
There's no point in trying to use machine learning or sum or clustering or any other mechanism
To draw knowledge from our data if our data is is all wrong
So today we're going to be looking at census data now census data is kind of a classic example of a kind of data you
Might look at in data analysis. It's got lots of different kinds of attributes things that are going to need cleaning up and transforming
So we're back in our we're going to read the census data using census is read CSV
So we've downloaded some census data that represents samples from the US population
To begin with we're going to read that in and you can see that we've got 32,000 observations and 15 attributes or variables
So what are the first timers so let's have a quick look at just a little bit of it and we can see the kind
Of thing. We're looking at so we're going to say head of census and that's just going to produce the first few rows
So we can kind of see the kind of data so you can see we've got age
we've got what working classification that person has their educational level a
Numerical representation about whether they're married or not this kind of thing
So there's a lot of different kinds of data here some of its going to be nominal
So for example, this working-class state government private employee. That's a nominal value
We might have ordinal values or ratio values or interval values
All right
We're gonna have to delve in a little bit closer to find out what these are now
What we do to transform this data into a usable format for clustering or machine learning
It's going to depend on exactly what these types of these columns are and what we want to do with them
So let's look at it just a couple of the attributes and see what we can do with them, right?
we're going to use a process called codification the idea is that may be things like random forests or
Multi-layer perceptrons, you know neural networks aren't going to be very amenable to putting in text-based inputs
And what we want to do is try and replace these attributes with a numerical score
All right
So let's look at just for example of a working class and also for example
The educational level so education now work class is the kind of class of worker that we're looking at here
So for example a state worker or in private sector or someone that worked in a school or something like this now
This is a nominal value. That means there's no order to this data at all
we can't say but someone in state is higher or lower than someone in private and we can't also say but let's say
State is two times more or less than some other one. That makes no sense at all
So what we can't we can replace this with numbers?
so let's say we could replace private with zero and state with one and
You know self-employed with two and so on right and that week that's perfectly reasonable thing to do, but it's still nominal data
so what we can't do is then calculate a mean and
Say are the mean is halfway between private and public that doesn't make any sense just because something has been replaced by a numerical score
Doesn't mean that it actually represents something that we can quantify in that way right? It's still nominal data
Okay, so I bet the best advice I can give is feel free to codify your data into easy-to-read numbers
but just bear in mind that you can calculate the mode just like
you know the most common but you can't calculate the median and you can't calculate the mean another example would be something like the
Educational level now fear letting me this is ordinal data so we could save it someone with a an undergraduate degree
It's maybe slightly higher in terms of their the amount of time. They spent in education, but someone with a high school diploma
But we don't know exactly what the distance is
And what's the distance between let's say a high school when a degree and then a PhD?
And so on an MD and things like this
We can represent these
Using numbers and probably in order right so we could say that zero is no
Education and one is sort of the end of primary school and two is the end of high school and so on and so forth
But again, it's difficult to calculate distances between these things
We don't know what high school is two times more than primary school and half of a degree or something like that
That doesn't really make sense
So again, you might be able to calculate a median on this or a mode, but you can't calculate an average
You can't say the average level of ocation. It's halfway between high school and undergraduate that doesn't make any sense either
So for any kind of attribute that is nominal or possibly ordinal and it's sort of represented using text
We can codify this so but it's more amenable to things like decision trees depending on the library you're using right?
But you just have to be careful all machine learning
Algorithms will take any number you give them and you just have to be careful that this makes sense to do
So what you would do is you would go through your data and you'd begin to systematically replace appropriate attributes with numerical versions of themselves
Remembering all the time, but they don't necessarily represent true numbers, you know in a ratio or interval format
So for any text-based value, we're going to start with places and possibly with numerical scores. What about the numerical values?
Well, they might be okay, but the issue is going to be one of scale
you might find for example in this census data that one of the
Dimensions or one of the attributes is much much larger than another one. So for example, this data set has hours per week
which is obviously going to be somewhere between naught and maybe 60 or 70 hours for someone that's got
you know a very strong work ethic and
Salary right or salary or income or any other measure of you know?
monetary gain now obviously hours per week is going to be in the tens and
Salary could be into the tens of thousands. Maybe even the hundreds of thousands
Those scales are not even close to being the same. That means if you're doing clustering or machine learning on this kind of data
You're going to be finding the salary is kind of overbearing everything, right?
So it's going to be very easy for your clustering to find differences in salary and it's harder for it to spot differences in hours
Because they're so small in comparison
Right. So we need to start to bring everything onto the same scale the more attributes you have which is another way of saying the
More dimensions you have to your data
Then the further everything is going to be spread around if we can scale all of these values to between sort of let's say around
0 & 1 then everything gets more tightly sort of controlled in the middle
And so it gets much easier to do
Clustering or machine learning or any kind of analysis we want
So let's look back at our data and see what we can do to try and scale some of this into the right range
So we're going to look back at the head of our data again
so our numerical values are things like the capital gain the capital loss which I guess Zuma bleah how much money they've made in the
Loss that year probably for normal license on some scale
and then things like the hours per week that they work and their salary which at this case is rate of an or less than
50,000. So let's have a quick look at the kind of range of values
We're looking at here so we can see if scalings even necessary
Maybe we got lucky and the person did it before they sent us the data
So we're going to apply a function across all the columns and we're going to calculate the range of the data
So this is going to be apply on a census data division, too
So that's all of our columns and we're going to use the range function for this and this is going to tell us okay
So for example the age ranges from 17 to 90 the educational level from 1 to 16
It gives you the range for things like nominal values as well, but they don't really make any sense
I mean working class ranges from question mark to without pay, you know is meaningless and then so for example capital gain
ranges from zero to nearly one hundred
Thousand and capital loss from zero
To four thousand and finally the hours per week main gist from 1 to 99
So you can see that the capital gain is many orders of magnitude larger in scale than the hours per week
We're going to need to try and scale this data. Well begin by doing to make our lives a little bit easier
It's just focus on the numerical attributes, right so we'd have to worry about the nominal values, which we've not codified yet
We're going to select all the columns from the data where they are numeric. So that's this line here a star then down here
So we're going to s apply that applies over each of the fields
is it numeric and that's going to give us a
Logical list that says true or false depending on whether those columns are numeric
What we're doing here is selecting from this list any bit of true and then finding their name
So what are the names of a columns for the numeric?
So let's have a look at just a range of these attributes to make a life a little bit easier
So I'm gonna run this line
and so this is a simplified version of what I was just showing you can see that capital gain is
massive compared to the hours per week
for example
Let's have a look at the standard deviation
the call that the standard deviation
Is the average distance from the mean so it kinda gives us an idea of the spread of some data
Like is it very tight and everyone owns roughly the same or is it very spread out and it's huge
Deviations and the answer is there's pretty huge deviations. So the age has a standard deviation of 13
so it obviously
That means that most people are going to be kind of in the middle and on average
they're going to be 13 years younger or older, but you can see that things like capital gain have a
7,000 standard deviation, which is a huge amount to give you some idea what we're aiming for
It's very common to standardize this kind of data. So but the standard deviation is 1 right so
7,000 much too big let's plot an example but gives you some idea of what the kind of problem is when we have these massive
Ranges, so I'm going to plot here a graph of age vs. Capital games, right?
We know age goes between about one and a hundred and capital gain is much much larger
So if I run this basically the figure makes no sense at all because the capital gain ranges from zero to one hundred
Thousand and as a few people earning right at the top scale, everything is sort of squished down the bottom. We can't see anything
That's going on. There's no way of telling whether the capital gain of an individual is related to their age
I mean it probably is like because a retired people people who are very young. Perhaps her and slightly less
We can't really see that here because it's just too compressed, right?
We need to start trying to bring these things together so that we can perform better analysis
What we're going to do is create a new data frame with just the numerical attribute
so we want to focus on just to make our life a little bit easier and then we're going to write a normalized function to
Move all our data to between 0 and 1 and we will do this per attribute
so for example
If you've got some data which goes between a minimum and a maximum
And we want to scale this data to between 0 and 1
All we need to do is first of all take away the minimum and that's going to move everything to be from 0
To max minus min and then we're going to divide by this distance here
So this is max minus min. And if we divide by this everything is going to go from 0 to 1
So that's exactly what we're doing in this function here
we've got a function X and it subtracts the minimum of X and then divides by the difference between the maximum and the minimum all
Right. So this is very standard. So I'm going to run this. I'll let you write functions like this and then use them in
Applications over data, so we're going to calculate a normalized Census data set which is we're going to apply
over dimension to this normalized function
We just wrote and then now if we look at the range will see that our range is now
Between 0 and 1 for all of our data, which is exactly what we want
The normalization is a perfectly good way of handling your data
If everything is between 0 and 1 we have fewer problems with the scale of things being way off now
Some statistical techniques like PCA that we're going to talk video
They require standardized data that's data, but it's centered around zero
It has a mean of zero and a standard deviation of one now. We can standardize data pretty easily in the same way
Actually, we don't need to write our own function for this the scale function in our performs this force
So we're going to take the census data over numerical attributes and we're going to call the scale function and that's going to take all
Of the attributes and center them around their mean so that means the mean will become close to zero and it's going to divide them
All by the standard deviation so their standard deviation becomes one
So if we run that and then we have a look at the mean of this data
So for example here, we calculate the mean you can see that I mean these values are very very close to one
That's 10 to the minus 17 or something like that very very small and if we look at the standard deviation
Similarly, they're all going to be 1. All right, so this is now standardized data
This is a very good thing to do
If you want to use your data in some kind of machine learning algorithm or some kind of clustering
Let's imagine now that we want to join some data sets together
So we standardize data everything's between 0 and 1 or it's centered around 0 with a standard deviation of 1 we've codified some attributes
What happens if we get other data from other sources, you can imagine that census data from the US might be a bit useful
But maybe we want census data from Spain or from the UK or from another country
Can we join organs together to get a bigger more useful data set?
Now the thing to think about when you're doing this is just to make sure that everything makes sense
Right are the scales the same are they all normalized or none of them normalized?
Because otherwise, what you're going to be doing is you're going to be adding, you know
Pay between naught and a hundred thousand to somewhere between no and one nothing makes any sense anymore
You're gonna wreck your data. So let's have a look at this on the census data set
We have some Spanish census data in a very similar format to our census data from the United States
Let's have a quick look so I'm going to read the CSV file of Spain data
Let's remind ourselves of the columns that we had in our census data from the United States. These are the numerical columns
So we have age
Education number capital gain capital loss this kind of thing
Let's look at the Spanish data set to see if we can just join the two together
so I'm gonna run head Spain that's going to give us the first few rows and you
Can see that there's some of the stuff in there is as it was before so things like what their level of education is
Whether they work in the private sector or the public sector why we're going to need to remove these things to create just a numerical
attributes and the other problem is if you look carefully
You'll see that the capital gain in the Spanish dataset is in euros not in dollars now, that's a huge problem
They don't they're not nationally different obviously
They're on the same order of magnitude
But we don't want to be jamming capital gain in euros next to dollars because those two scales are not the same, right?
So what we need to do first is scale this data using some kind of exchange rate
So here what we're going to do is we're going to create a new column in Spain
so given a spain data frame we're going to say the spain capital gain is equal to the
Euro capital gain times by 1.1 3, which is the exchange rate. We're going to use now
It's quite important in this kind of situation. Not just to look up the exchange rate online
You've got to consider but this might have been collected a while ago
What was the exchange rate when this data was collected why these are things you're going to have to think about?
So let's run that line and let's do the same thing for the capital loss now
We're going to keep just for numerical attributes of our census data and of the spanish data
And we're also going to add another column. That is what country they come from
otherwise
we're not going to know so we're going to use the columbine function to combine the census data as numerical attributes and
The native country which in this case will be the United States
We're going to do the exact same thing for the spain data, which will be basically exactly the same except obviously
we're also going to have spain as the native country and
Then we're going to use the roe bind feature to just join those two tables together
Now that will only work if those two datasets have the exact same attributes
New sense is not found
What did I do wrong? So I had a typo? So let's join these two together using our bind
There we go. And so our United dataset now has the combined observations for the United States and Spain now
what you wouldn't want to do is just join them together and just leave it at that why you want to perhaps have a little
Look at some plots to make sure that the distributions of the data. You've just joined together make sense. For example
Right. Thus the United States data has a nice broad distribution of different ages
We want to make sure that the Spanish data has that same distribution
Otherwise, you're kind of going to secure your data set
so for example
Let's have a look at roughly whether the levels of capital gain are
Approximately the same for both the United States and the Spanish data set so I'm gonna use ggplot for this
We're gonna plot a bar chart where we've color-coded United States and Spain and you can see that broadly speaking
There's a lot in the kind of around zero or less than 50k and then there's a few a little bit above
All right, so that looks broadly speaking the same distribution. I'm fairly happy with that
This is gonna be a judgement call when you get your own data
So I'll clear the screen and then let's have a look at the next plot
So the next plot is going to be capital loss versus the native country. Let's make sure those distributions are the same
So it's posting there and broadly speaking again
Yes
the majority are down the bottom and then there's a few United States ones and a couple of Spanish ones up at the top as
Well again, it's not a disaster. That's probably ok. Finally, let's have a look at ages by native country
So if we plot this we can see two very very similar distributions
You can see that it's essentially a bell curve. Maybe slightly skewed towards older participants
For the United States and very very similar for Spain. This is okay. If we
Hypothesized that capital gain capital loss and salary was something to do with your age
Then it would make sense to have two data sets that you're joining together have very similar distributions in this regard
So let's look at one more data set from Denmark. All right, so it's the same thing same format
We're gonna read the CSV and we're going to have a look at just the top few rows to make sure it's in the same
format, so that's using a head function and you can see actually we've already removed the
nominal and other text attributes from here and we've just got the
numerical ones and actually also
capital gain and capital loss are already in dollars in this data set so we don't have to perform a conversion so we can use
Our bind to put these two things together and now we just need to check the distributions are the same
so again
We're going to put the age
Against the native country and see if these towards the same
Distributions and you actually you can see this isn't looking too good the United States and the Spanish dataset have very similar
Distributions the participants or the people who have been polled from Denmark are much much older on average, right?
This could have an effect on things like capital gain, so I wouldn't necessarily feel comfortable
Just joining this data set in without you thinking about it a little bit more closely
Alright, so whenever you're joining data set like this taking data from different sources
think carefully
To make sure that it's fair and what you're doing is a reasonable
concatenation of datasets
And actually these are the features that power Spotify recommender system and numerous others. So we've got things like acoustic nurse
How acoustic does it sound from from a zero to a one we've got instrumental nurse?
I'm not convinced as a word speech enos the hat hat to what extent is it speech or not speech and then things like ten