分散、標準偏差、変動係数 (Variance, Standard Deviation, Coefficient of Variation)

字幕表動画を再生する

There are many ways to quantify variability, however, we will focus on the most common
ones: variance, standard deviation, and coefficient of variation.
In the field of statistics, we will typically use different formulas when working with population
data and sample data.
Let’s think about this for a bit.
When you have the whole population, each data point is known so you are 100% sure of the
measures you are calculating.
When you take a sample of this population and you compute a sample statistic, it is
interpreted as an approximation of the population parameter.
Moreover, if you extract 10 different samples from the same population, you will get 10
different measures.
Statisticians have solved the problem by adjusting the algebraic formulas for many statistics
to reflect this issue.
Therefore, we will explore both population and sample formulas, as they are both used.
You must be asking yourself why there are unique formulas for the mean, median and mode.
Well, actually, the sample mean is the average of the sample data points, while the population
mean is the average of the population data points.
Technically there are two different formulas, but they are computed in the same way.
Okay, now.
After this short clarification, it’s time to get onto variance.
Variance measures the dispersion of a set of data points around their mean value.
Population variance, denoted by sigma squared, is equal to the sum of squared differences
between the observed values and the population mean, divided by the total number of observations.
Sample variance, on the other hand, is denoted by s squared and is equal to the sum of squared
differences between observed sample values and the sample mean, divided by the number
of sample observations minus 1.
Alright.
*** When you are getting acquainted with statistics,
it is hard to grasp everything right away.
Therefore, let’s stop for a second to examine the formula for the population and try to
clarify its meaning.
The main part of the formula is its numerator, so that’s what we want to comprehend.
The sum of differences between the observations and the mean, squared.
Hmm… so, the closer a number to the mean, the lower the result we will obtain, right?
And the further away from the mean it lies, the larger this difference.
Easy.
But why do we elevate to the second degree?
Squaring the differences has two main purposes.
First, by squaring the numbers, we always get non-negative computations.
Without going too deep into the mathematics of it, it is intuitive that dispersion cannot
be negative.
Dispersion is about distance and distance cannot be negative.
If, on the other hand, we calculate the difference and do not elevate to the second degree, we
would obtain both positive and negative values that when summed would cancel out, leaving
us with no information about the dispersion.
Second, squaring amplifies the effect of large differences.
For example, if the mean is 0 and you have an observation of 100, the squared spread
is 10,000! Alright, enough dry theory.
It is time for a practical example.
We have a population of five observations – 1, 2, 3, 4 and 5.
Let’s find its variance.
We start by calculating the mean: 1+2+3+4+5 divided by 5 equals 3.
Then we apply the formula we just saw: 1 minus 3 squared, plus, 2 minus 3 squared, plus,
3 minus 3, squared, plus, 4 minus 3, squared, plus, 5 minus 3, squared.
All of these components have to be divided by 5.
When we do the math, we get 2.
So, the population variance of the data set is 2.
But what about the sample variance?
This would only be suitable if we were told that these five observations were a sample
drawn from a population.
So, let’s imagine that’s the case.
The sample mean is once again 3.
The numerator is the same, but the denominator is going to be 4, instead of 5, giving us
a sample variance of 2.5.
To conclude the variance topic, we should interpret the result.
Why is the sample variance bigger than the population variance?
In the first case, we knew the population, that is, we had all the data and we calculated
the variance.
In the second case, we were told that 1, 2, 3, 4 and 5 was a sample, drawn from a bigger
population.
Imagine the population of this sample were these 9 numbers: 1, 1, 1, 2, 3, 4, 5, 5 and
5.
Clearly, the numbers are the same, but there is a concentration around the two extremes
of the data set – 1 and 5.
The variance of this population is 2.96.
So, our sample variance has rightfully corrected upwards in order to reflect the higher potential
variability.
This is the reason why there are different formulas for sample and population data.
*** While variance is a common measure of data
dispersion, in most cases the figure you will obtain is pretty large and hard to compare
as the unit of measurement is squared.
The easy fix is to calculate its square root and obtain a statistic known as standard deviation.
In most analyses you perform, standard deviation will be much more meaningful than variance.
As we saw in the previous lecture, there are different measures for the population and
sample variance.
Consequently, there is also population and sample standard deviation.
The formulas are: the square root of the population variance and square root of the sample variance
respectively.
I believe there is no need for an example of the calculation, right?
If you have a calculator in your hands, you’ll be able to do the job.
Alright.
The other measure we still have to introduce is the coefficient of variation.
It is equal to the standard deviation, divided by the mean.
Another name for the term is relative standard deviation.
This is an easy way to remember its formula – it is simply the standard deviation relative
to the mean.
As you probably guessed, there is a population and sample formula once again.
So, standard deviation is the most common measure of variability for a single data set.
But why do we need yet another measure such as the coefficient of variation?
Well, comparing the standard deviations of two different data sets is meaningless, but
comparing coefficients of variation is not.
Aristotle once said: “Tell me, I’ll forget.
Show me, I’ll remember.
Involve me, I’ll understand.”
To make sure you remember, here’s an example of a comparison between standard deviations.
Let’s take the prices of pizza at 10 different places in New York.
They range from 1 to 11 dollars.
Now, imagine that you only have Mexican pesos and to you the prices look more like 18.81
pesos to 206.91 pesos, given the exchange rate of 18.81 pesos for one dollar.
Let’s combine our knowledge so far and find the standard deviations and coefficients of
variation of these two data sets.
First, we have to see if this is a sample or a population.
Are there only 11 restaurants in New York?
Of course not; this is obviously a sample drawn from all the restaurants in the city.
Then we have to use the formulas for sample measures of variability.
Second, we have to find the mean.
The mean in dollars is equal to 5.5 and the mean in pesos to 103.46.
The third step of the process is finding the sample variance.
Following the formula that we showed earlier, we can obtain 10.72 dollars squared and 3793.69
pesos squared.
The respective sample standard deviations are 3.27 dollars and 61.59 pesos.
Let’s make a couple of observations.
First, variance gives results in squared units, while standard deviation in original units.
This is the main reason why professionals prefer to use standard deviation as the main
measure of variability.
It is directly interpretable.
Squared dollars means nothing even in the field of statistics.
Second, we got standard deviations of 3.27 and 61.59 for the same pizza at the same 11
restaurants in New York City.
Seems wrong, right?
Don’t worry.
It is time to use our last tool – the coefficient of variation.
Dividing the standard deviations by the respective means, we get the two coefficients of variation.
The result is the same – 0.60.
Notice that it is not dollars, pesos, dollars squared or pesos squared.
It is just 0.60.
This shows us the great advantage that the coefficient of variation gives us.
Now, we can confidently say that the two data sets have the same variability, which was
what we expected beforehand.
Let’s recap what we have learned so far.
There are three main measures of variability – variance, standard deviation and coefficient
of variation.
Each of them has different strengths and applications.
You should feel confident using all of them as we are getting closer to more complex statistical
topics.
Thanks for watching!