字幕表 動画を再生する
Let's review a little bit of everything we learned so far
and hopefully it'll make everything fit together
a little bit better.
Then we'll do a bunch of calculations with real numbers
and I think it'll really hit the point home.
So, first of all if we're dealing with a-- let me
actually write down, let me make some columns.
So if we're dealing with-- let's see, we could call it the
concept and then we'll call it whether we're dealing with
a population or a sample.
So the first statistical concept we came up with was the
notion of the mean or the central tendency and we learned
of that was one way to measure the average or central
tendency of a data set.
The other ways were the median and the mode.
But the mean tends to show up a lot more, especially when we
start talking about variances and, as we'll do in this video,
the standard deviation.
But the mean of a population we learned-- we use the greek
letter Mu-- is equal to the sum of each of the data points
in the population.
That's an i.
Let me make sure it looks like an I.
So you're going to sum up each of those data points.
You're going to start with the first one and you're going
to go to the nth one.
We're assuming that there are n data points in the population.
And then you divide by the total number that you have.
And this is like the average that you're used to taking
before you learned any of the statistics stuff.
You add up all the data points and you divide by
the number there are.
The sample is the same thing.
We just use a slightly different terminology.
The mean of a sample-- and I'll do it in a different
color-- just write it as x with a line on top.
And that's equal to the sum of all the data
points in the sample.
So each of the xi in the sample.
But we're serving the sample is something
less than a population.
So you start with the first one still.
And then you go to the lower case n where we assume that
lowercase n is less than the big N.
If this was the same thing then we're actually taking the
average or we're taking the mean of the entire population.
And then you divide by the number of data
points you added.
You get to n.
Then we said OK, how far-- this give us the central tendency.
It's one measure of the central tendency.
But what if we wanted to know how good of an indicator this
is for the population or for the sample?
Or, on average, how far are the data points from this mean?
And that's where we came up with the concept of variance.
And I'll arbitrarily switch colors again.
Variance.
And in a population the variable or the notation for
variance is the sigma squared.
This means variance.
And that is equal to-- you take each of the data points.
You find the difference between that and the mean that
you calculate up there.
You square it so you get the squared difference.
And then you essentially take the average of all of these.
You take the average of all of these squared distances.
So that's-- so you take the sum from i is equal to 1 to
n and you divide it by n.
That's the variance.
And then the variance of a sample mean-- and this was a
little bit more interesting and we talked a little bit
about it in the last video.
You actually want to provide a-- you want to estimate the
variance of the population when you're taking the
variance of a sample.
And in order to provide an unbiased estimate you do
something very similar to here but you end up
dividing by n minus 1.
So let me write that down.
So the variance of a population-- I'm sorry, the
variance of a sample or samples variance or unbiased sample
variance if that's why we're going to divide by n minus 1.
That's denoted by s squared.
What you do is you take the difference between each of the
data points in the sample minus the sample mean.
We assume that we don't know the population mean.
Maybe we did.
If we knew the population mean we actually wouldn't have to do
the unbiased thing they were going to do here in
the denominator.
But when you have a sample the only way to kind of figure out
the population mean is to estimate it with sample mean.
So we assume that we only have the sample mean.
And you're going to square those and then you're going to
sum them up from i is equal to 1 to i is equal to n because
you have n data points.
And if you want an unbiased estimator you divide
by n minus 1.
And we talked a little bit before why you want this to be
a n minus 1 instead of a n.
And actually in a couple of videos I'll actually
prove this to you.
One, I'll prove it maybe experimentally using Excel and
then I'll-- which wouldn't be a proof, it'll just give you a
little bit of intuition-- and then I'll actually prove
it a little bit more formally later on.
But you don't have to worry about it right now.
The next thing we'll learn is something that you've probably
heard a lot of, especially sometimes in class, teachers
talk about the standard deviation of a test or-- it's
actually probably one of the most use words in statistics.
I think a lot of people unfortunately maybe use it or
maybe use it without fully appreciating everything
that it involves.
But the goal we'll eventually hopefully appreciate
all that involves soon.
But the standard deviation-- and once you know variance it's
actually quite straightforward.
It's the square root of the variance.
So the standard deviation of a population is written as sigma
which is equal to the square root of the variance.
And now I think you understand why a variance is written
as sigma squared.
And that is equal to just the square root of all that.
It's equal to the square root-- I'll probably run out of
space-- of all of that.
So the sum-- I won't write at the top or the bottom, that
makes it messy-- if xi minus Mu squared, everything over n.
And then if you wanted the standard deviation of a
sample-- and it actually gets a little bit interesting because
the standard deviation of a sample, which is equal to the
square root of the variance of a sample-- it actually turned
out that this is not an unbiased estimator for this--
and I don't want to get to technical for it right now--
that this is actually a very good estimate of this.
The expected value of this is going to be this.
And I'll go into more depth on expected values in the future.
But it turns out that this is not quite the same
expected value as this.
But you don't have to worry about it for now.
So why even talk about the standard deviation?
Well, one, the units work out a little better.
If let's say all of our data points were measured
in meters, right?
If we were taking a bunch of measurements of length then
the units of the variance would be meter squared.
right?
Because we're taking meters minus meters.
This would be a meter.
Then you're squaring.
You're getting meters squared.
And that's kind of a strange concept if you say you know the
average dispersion from the center is in meter squares.
Well first, when you take the square root of it you get
this-- you get something that's again in meters.
So you're kind of saying, oh well the standard deviation
is x or y meters.
And then we'll learn a little bit it if you can actually
model your data as a bell curve or if you assume that your data
has a distribution of a bell curve then this tells you some
interesting things about where all of the probability of
finding someone within one or two standard deviations
of the of the mean.
But anyway, I don't want to go to technical right now.
Let's just calculate a bunch.
Let's calculate.
Let's see, if I had numbers 1, 2, 3, 8, and 7.
And let's say that this is a population.
So what would its mean be?
So I have 1 plus 2 plus 3.
So it's 3 plus 3 is 6.
6 plus 8 is 14.
14 plus 7 is 21.
So the mean of this population-- you sum up
all the data points.
You get 21 divided by the total number of data
points, 1, 2, 3, 4, 5.
21 divided by 5 which is equal to what?
4.2.
Fair enough.
Now we want to figure out the variance.
And we're assuming that this is the entire population.
So the variance of this population is going to be equal
to the sum of the squared differences of each of
these numbers from 4.2.
I'm going to have to get my calculator out.
So it's going to be 1 minus 4.2 squared plus 2 minus 4.2
squared plus 3 minus 4.2 squared plus 8 minus
4.2 squared plus 7 minus 4.2 squared.
And it's going to be all of that-- I know it looks a little
bit funny-- divided by the number of data points we
have-- divided by 5.
So let me take the calculator out.
All right.
Here we go.
Actually maybe I should have used the graphing
calculator that I have.
Let me see if I can get this thing-- if I could get this.
There you go.
Yeah, I think the graphing one will be better because
I can see everything that I'm writing.
OK, so let me clear this.
So I want to take 1 minus 4.2 squared plus 2 minus 4.2
squared plus 3 minus 4.2 squared plus 8 minus 4.2
squared, where I'm just taking the sum of the squared
distances from the mean squared, one more, plus
7 minus 4.2 squared.
So that's the sum.
The sum is 38.8.
So the numerator is going to be equal to 38.8 divided by 5.
So this is the sum of the squared distances, right?
Each of these-- just so you can relate to the formula-- each
of that is xi minus the mean squared.
And so if we take the sum of all of them-- this numerator is
the sum of each of the xi minus the mean squared from
i equals 1 to n.
And that ended up to be 38.8.
And I just calculated like that.
I just took each to the data points minus the mean
squared, add them all up, and I got 38.8.
And I went and divided by n which is 5.
So this n up here is actually also 5.
Right?
And so 38.8 divided by 5 is 7.76.
So the variance-- let me scroll down a little bit-- the
variance is equal to 7.76.
Now if this was a sample of a larger distribution, if this
was a sample-- if the 1, 2, 3, 8, and 7, weren't the
population-- if it was a sample from a larger population,
instead of dividing by 5 we would have divided by 4.
And we would have gotten the variance as 38.8 divided by n
minus 1, which is divided by 4.
So then we would have gotten the variance-- we would have
gotten the sample variance 9.7 if you divided by n
minus 1 instead of n.
But anyway, don't worry about that right now.
That's just a change of n.
But once you have the variance, it's very easy to figure
out the standard deviation.
You just take the square root of it.
The square root of 7.76-- 2.78.
Let's say 2.79 is the standard deviation.
So this gives us some measure of, on average, how far
the numbers are away from the mean which was 4.2.
And it gives it in kind of the units of the
original measurement.
Anyway, I'm all out of time.
I'll see you in the next video.
Or actually, let's figure out-- we said if this was a sample,
if those numbers were sample and not the population, that
we figured out that the sample variance was 9.7.
And so then the sample standard deviation is just going to
be the square root of that.
The square root of 9.7 seven which would be 3.1.
3.11.
Anyway, hopefully that makes it a little bit more concrete.
We've been dealing with these sigma notation variables
and all that so far.
So when you actually do it with numbers you see it's
hopefully not that difficult.
Anyway, see you in the next video.