字幕表 動画を再生する
Welcome to this next class in Pattern Recognition, we have been looking at density estimation.
So, let us briefly recall, what we have been doing in the last couple of classes.
We have been looking at, how to estimate densities for given IID samples, for a particular density.
We first looked at the maximum likelihood estimation method. For the last couple of
classes, we have been considering the Bayesian estimation of density function. So, given
a density function f x given theta where, theta is the parameter, we are considering
Bayesian estimate for the parameter theta. As I have already told you, the main difference
between the maximum likelihood estimation and the Bayesian estimation is that, in the
Bayesian estimation, we look at the parameter theta, which may be a vector or a scalar,
the parameter theta itself is viewed as a random variable. And because, we view it as
a random variable, it has a prior density, which captures our initial uncertainty.
Our knowledge or lack of knowledge about the specific value of the parameter is captured
through a prior density f theta. So, f theta gives us some idea about, what we think or
the possible values for the parameter. Given the given the prior density, we are going
to use the data likelihood that is, f D given theta to calculate the posterior density f
theta given D. Once again, I would like to drive your attention
to a caution on the notation, for simplicity, the densities of all kind of random variables
we are using the same symbol f. So, f of theta, f of D given theta, f of theta given D, all
these are densities, purely as mathematical notation looks same because, f is the same
function. But, we are we are using f as a notation to denote density and density of,
which random variable it is, is clear from context.
Thus, f x given theta is the density of x conditioned on theta, which is the parameter
f of theta, is the density of the parameter theta and so on, f theta given D is the conditional
density of theta, conditioned on data D. So, even though we are using the same symbol f,
for all densities so, I hope you understand that, the f used in different times is a different
function. It refers to densities of different random variables and just to keep the notation
uncluttered, we are calling it as the f. So, once again, essentially we start with a prior
density f of theta, for the parameter theta then, use the data likelihood of D given theta
to calculate the posterior f theta given D, we have seen a couple of examples of this
process earlier.
And the idea of Bayesian estimation by now, you would have seen is, to choose a right
kind of prior, the right kind of prior for us is, what is called the conjugate prior.
The conjugate prior is that prior density, which results in the posterior density belonging
to the same class of densities. For example, as we saw when, we are estimating the mean
of a Gaussian random variable or mean of a Gaussian density where, the variance is assumed
known, we choose Gaussian density for the prior then, the posterior is also Gaussian
density. So, for that particular estimation problem,
the prior density happens to be Gaussian similarly, for a Bernoulli problem where, we have to
estimate the parameter p namely, the probability of the random variability taken value 1. For
parameter p, the appropriate prior density turns out to be beta appropriate in the sense
that, if I take prior density to be beta then, the posterior also becomes beta, so such a
prior is called a conjugate prior right. The conjugate prior is that prior density,
which results in the posterior density also, to be of the same class of densities, the
use is of conjugate prior makes the process of calculation of posterior density easier.
As we have seen let us say, in the case of the Bernoulli parameter, that we have seen
earlier, if we start with some beta a 0, for the beta a 0 b 0 for the prior then, the posterior
is also beta density with possibly some other parameters a and b n.
So, calculation of the posterior density is simply a matter of parameter updation, given
the parameters of the prior now, we update them into parameters of the posterior. Having
obtained the posterior density, we can finally use either the mean or the mode of the posterior
density at the final estimate. We have seen examples of both or we can also calculate
f of x given D, that is the actual class conditional density, conditioned on data by integrating
the posterior density and we have seen example of that also. So, this class, we will look
at a couple of more examples of Bayesian estimation and then, closed Bayesian estimation. As you
would have by now seen, Bayesian estimation is a little more complicated mainly because,
you have to choose the right kind of prior and different kind of estimation problems
make different priors of the conjugate.
So, we will start with another example, this is the multinomial example that is, we consider
estimating the mass function of a discrete random variable, which takes one of M possible
values say, a 1 to a M where, p i is probability Z takes the value a i. So, essentially Z takes
value a 1 with probability p 1, a 2 with probability p 2, a M with probability p M and we want
to estimate this p 1, p 2, p M, given a sample of n iid realizations of Z. We already considered
this problem in in the maximum likelihood case and there we told you, this is particularly
important in certain class of pattern recognition problem. Especially those to do with say,
the text classification and so on where, the discrete random variables for feature that
is, features that take only finitely many values are important.
We have seen, how to do the maximum likelihood estimation for obtaining this p 1, p 2’s
that, p 1, p 2, p M that characterize the mass function of this discrete random variable
Z. Now, we will look at, how to do the same thing using Bayesian estimation, I hope the
problem is clear, this already is done earlier so, we will we will quickly review the notation
that we used earlier.
So, as earlier, we represent any realization of Z by M-dimensional Boolean vector x, x
has M components x superscript 1, x superscript M. The transpose there is because, as I said,
all vectors for us are column vectors, each of these components of this vector x, x superscript
i is either 0 or 1, and summation of all of them is 1. That means, essentially x takes
only the unit vectors 1 0 0 0, 0 1 0 0’s and so on, the idea is that, if z takes the
i th value a i then, we represent it by a vector where, i th component is 1 and all
others are 0. We have already seen that, this is a interesting
and useful representation for maximum likelihood with this same, we use the same representation
here. And now, p i turn out to be, the probability that the i th component of this vector is
1 that is what, I wrote there p x subscript i is equal to 1. Because, p i is the probability
with where, Z takes the i th value a i and when Z takes the i th value a i, the i th
component of x i becomes 1 where, i th component of x becomes 1.
Also, as I told you last time, the reason, why we use the superscripts to denote the
components of x is because, subscripts of x are used as to denote different data or
data is x 1 to x n that is why, we are using superscripts to denote the components of particular
data. So, we also seen last time that, for this x, the mass function with the single
vector parameter p, is product i is equal to 1 to M p i x i because, in any given x,
only one component of x is 1 and that is the one that, survives this product. So, if x
is 1 0 0 0 then, for that x, f of x given p will become p 1 to the power 1 and p to
the power 0 and so on that is, p 1. So thus, this correctly represents the mass function,
that we are interested in and p is the parameter of the mass , that we need to estimate.
As usual, our data has n samples x 1 to x n, and each sample x i is a vector of M components,
the components are shown by superscripts. And each component is either 0 or 1, and in
each data it is M x i that is, each M vector x i, if I sum all the components, it becomes
1. That simply means, because, each component is on 0 1 and sum is 1 means, exactly one
component is 1 and all others are 0 so, this is the nature of our representation and we
have n such data items. So, given such data, we want to estimate p
1, p 2, p M thus essentially, what we are doing is, we are estimating the parameters
of a multinomial distribution. As you know, a multinomial binomial takes only binomial
is important, when there is a random experiment that is repeated, which takes only two values
success or failure. In the multinomial case, it is the same thing independent realizations
of a random experiment that takes more than two values say, M values.
If a random variable takes M different values, I can think of it as a random experiment,
which can result in one of M possible outcomes right. So, many samples from that random variable
are like, I have a multinomial distribution that is, I repeat a random experiment, that
can take one of M possible outcomes, n number of times. So, some of which will be well result
in first outcome and so on, some of which will results in second outcome and so on.
So, I know for each repetition, what outcome has come and given those things, we want to
estimate the multinomial parameters p 1, p 2, p M. Now because, we are in the Bayesian
context, the first question we have to answer is, what is the conjugate prior in this case
right. Now, as we have already seen from our earlier examples, for this we should examine
the form of the data likelihood. So, we already have our model, we for this x, we have the
mass function, that we that we have seen earlier that is the mass function so, given this mass
function, what is the data likelihood, that is easy to calculate.
So, f D given p, p is product of this over i is equal to 1 to n now, we can substitute
for f of x i, if I substitute for f of x i, f of x i is once again product p j x i j.
Now, if I interchange i and j summation then, p j to the power x i j product over i can
be written as product over j p j to the power n j where, n j is summation over i x i j.
What does that mean, in any given x i, the j th component is 1, if that particular outcome
of Z represents the j th value of Z. So, this n j tells you, out of the n, how
many times Z taken the j th value so, for example, n 1 plus n 2 plus n M will be equal
to n, the total number of samples. So, out of n samples, n 1 times the first value of
Z has come, n 2 times the second value of Z has come or looking at as a multinomial
distribution, n 1 times the first outcome has occurred, n 2 times the second outcome
has occurred and so on. So, now, in terms of this n’s, the data likelihood is given
by product over j, p j to the power n j. Now, if this is the data likelihood, we multiply
this with a prior and we should get another expression of the same form, as the prior
so, what should be our prior.
So, this is the data likelihood so, we expect the prior to have a some density, this proportional
to a product of p j to the power a j. If the prior is proportional to product of p j to
the power a j and then, I multiply with data likelihood, I get another product of p j to
the power some a j prime so, once again that posterior will belong to the same density
of the prior right. Let us still remember that, p is a vector parameter, p has M components
with all of them are probabilities so, they are greater than or equal to 0. And sum of
p a is equal to 1 because, p 1 is the probability of that Z takes first value and so on so,
this is needed for the mass function of Z. So, we need a density, which which is a density
defined over all p, that satisfy this and that should have a form, which is product
p j to the power a j.
Now, such a density is, what is known as a Dirichlet density, if you remember when there
are only two outcomes when you looking at Bernoulli, the the prior happen to be what
is called the beta density. So, we will see that, the Dirichlet density is a kind of generalization
of the beta density so, the Dirichlet density is given by f of p, is gamma of a 1 plus a
2 plus a M by gamma of a 1 into gamma of a 2 into gamma of a M, product j is equal to
1 to M p j to the power a j minus 1. Where, this gamma is the gamma function, that
we already seen when we discuss the beta function the beta density, gamma function gamma of
a is integral 0 to infinity x to the power a minus 1, e power minus x d x. In this, the
the parameters of this Dirichlet density or this a j’s that is that is, a 1 a 2 a M,
all of them are assumed to be greater than equal to 1.
Also, this density has this value, only for those p that satisfy all components greater
than or equal to 0 or some of the components is 1, outside of those p’s the density is
0. That means, the density is concentrated on that subset of power M, which satisfies
p i greater than or equal to 0 and summation p equal to 1, which is essentially called
as simplex. Those you know what is simplex is, they are simplex but, anyway even, if
you do not know what is simplex is, this density is non-zero only for those p’s that satisfy
p i greater than or equal to 0, summation p i is equal to 1 otherwise, the density value
is 0.
So, that is the Dirichlet density so, if M is equal to 2, this becomes gamma a 1 plus