## 字幕表 動画を再生する

• Welcome to this next class in Pattern Recognition, we have been looking at density estimation.

• So, let us briefly recall, what we have been doing in the last couple of classes.

• We have been looking at, how to estimate densities for given IID samples, for a particular density.

• We first looked at the maximum likelihood estimation method. For the last couple of

• classes, we have been considering the Bayesian estimation of density function. So, given

• a density function f x given theta where, theta is the parameter, we are considering

• Bayesian estimate for the parameter theta. As I have already told you, the main difference

• between the maximum likelihood estimation and the Bayesian estimation is that, in the

• Bayesian estimation, we look at the parameter theta, which may be a vector or a scalar,

• the parameter theta itself is viewed as a random variable. And because, we view it as

• a random variable, it has a prior density, which captures our initial uncertainty.

• Our knowledge or lack of knowledge about the specific value of the parameter is captured

• through a prior density f theta. So, f theta gives us some idea about, what we think or

• the possible values for the parameter. Given the given the prior density, we are going

• to use the data likelihood that is, f D given theta to calculate the posterior density f

• theta given D. Once again, I would like to drive your attention

• to a caution on the notation, for simplicity, the densities of all kind of random variables

• we are using the same symbol f. So, f of theta, f of D given theta, f of theta given D, all

• these are densities, purely as mathematical notation looks same because, f is the same

• function. But, we are we are using f as a notation to denote density and density of,

• which random variable it is, is clear from context.

• Thus, f x given theta is the density of x conditioned on theta, which is the parameter

• f of theta, is the density of the parameter theta and so on, f theta given D is the conditional

• density of theta, conditioned on data D. So, even though we are using the same symbol f,

• for all densities so, I hope you understand that, the f used in different times is a different

• function. It refers to densities of different random variables and just to keep the notation

• uncluttered, we are calling it as the f. So, once again, essentially we start with a prior

• density f of theta, for the parameter theta then, use the data likelihood of D given theta

• to calculate the posterior f theta given D, we have seen a couple of examples of this

• process earlier.

• And the idea of Bayesian estimation by now, you would have seen is, to choose a right

• kind of prior, the right kind of prior for us is, what is called the conjugate prior.

• The conjugate prior is that prior density, which results in the posterior density belonging

• to the same class of densities. For example, as we saw when, we are estimating the mean

• of a Gaussian random variable or mean of a Gaussian density where, the variance is assumed

• known, we choose Gaussian density for the prior then, the posterior is also Gaussian

• density. So, for that particular estimation problem,

• the prior density happens to be Gaussian similarly, for a Bernoulli problem where, we have to

• estimate the parameter p namely, the probability of the random variability taken value 1. For

• parameter p, the appropriate prior density turns out to be beta appropriate in the sense

• that, if I take prior density to be beta then, the posterior also becomes beta, so such a

• prior is called a conjugate prior right. The conjugate prior is that prior density,

• which results in the posterior density also, to be of the same class of densities, the

• use is of conjugate prior makes the process of calculation of posterior density easier.

• As we have seen let us say, in the case of the Bernoulli parameter, that we have seen

• earlier, if we start with some beta a 0, for the beta a 0 b 0 for the prior then, the posterior

• is also beta density with possibly some other parameters a and b n.

• So, calculation of the posterior density is simply a matter of parameter updation, given

• the parameters of the prior now, we update them into parameters of the posterior. Having

• obtained the posterior density, we can finally use either the mean or the mode of the posterior

• density at the final estimate. We have seen examples of both or we can also calculate

• f of x given D, that is the actual class conditional density, conditioned on data by integrating

• the posterior density and we have seen example of that also. So, this class, we will look

• at a couple of more examples of Bayesian estimation and then, closed Bayesian estimation. As you

• would have by now seen, Bayesian estimation is a little more complicated mainly because,

• you have to choose the right kind of prior and different kind of estimation problems

• make different priors of the conjugate.

• So, we will start with another example, this is the multinomial example that is, we consider

• estimating the mass function of a discrete random variable, which takes one of M possible

• values say, a 1 to a M where, p i is probability Z takes the value a i. So, essentially Z takes

• value a 1 with probability p 1, a 2 with probability p 2, a M with probability p M and we want

• to estimate this p 1, p 2, p M, given a sample of n iid realizations of Z. We already considered

• this problem in in the maximum likelihood case and there we told you, this is particularly

• important in certain class of pattern recognition problem. Especially those to do with say,

• the text classification and so on where, the discrete random variables for feature that

• is, features that take only finitely many values are important.

• We have seen, how to do the maximum likelihood estimation for obtaining this p 1, p 2’s

• that, p 1, p 2, p M that characterize the mass function of this discrete random variable

• Z. Now, we will look at, how to do the same thing using Bayesian estimation, I hope the

• problem is clear, this already is done earlier so, we will we will quickly review the notation

• that we used earlier.

• So, as earlier, we represent any realization of Z by M-dimensional Boolean vector x, x

• has M components x superscript 1, x superscript M. The transpose there is because, as I said,

• all vectors for us are column vectors, each of these components of this vector x, x superscript

• i is either 0 or 1, and summation of all of them is 1. That means, essentially x takes

• only the unit vectors 1 0 0 0, 0 1 0 0’s and so on, the idea is that, if z takes the

• i th value a i then, we represent it by a vector where, i th component is 1 and all

• others are 0. We have already seen that, this is a interesting

• and useful representation for maximum likelihood with this same, we use the same representation

• here. And now, p i turn out to be, the probability that the i th component of this vector is

• 1 that is what, I wrote there p x subscript i is equal to 1. Because, p i is the probability

• with where, Z takes the i th value a i and when Z takes the i th value a i, the i th

• component of x i becomes 1 where, i th component of x becomes 1.

• Also, as I told you last time, the reason, why we use the superscripts to denote the

• components of x is because, subscripts of x are used as to denote different data or

• data is x 1 to x n that is why, we are using superscripts to denote the components of particular

• data. So, we also seen last time that, for this x, the mass function with the single

• vector parameter p, is product i is equal to 1 to M p i x i because, in any given x,

• only one component of x is 1 and that is the one that, survives this product. So, if x

• is 1 0 0 0 then, for that x, f of x given p will become p 1 to the power 1 and p to

• the power 0 and so on that is, p 1. So thus, this correctly represents the mass function,

• that we are interested in and p is the parameter of the mass , that we need to estimate.

• As usual, our data has n samples x 1 to x n, and each sample x i is a vector of M components,

• the components are shown by superscripts. And each component is either 0 or 1, and in

• each data it is M x i that is, each M vector x i, if I sum all the components, it becomes

• 1. That simply means, because, each component is on 0 1 and sum is 1 means, exactly one

• component is 1 and all others are 0 so, this is the nature of our representation and we

• have n such data items. So, given such data, we want to estimate p

• 1, p 2, p M thus essentially, what we are doing is, we are estimating the parameters

• of a multinomial distribution. As you know, a multinomial binomial takes only binomial

• is important, when there is a random experiment that is repeated, which takes only two values

• success or failure. In the multinomial case, it is the same thing independent realizations

• of a random experiment that takes more than two values say, M values.

• If a random variable takes M different values, I can think of it as a random experiment,

• which can result in one of M possible outcomes right. So, many samples from that random variable

• are like, I have a multinomial distribution that is, I repeat a random experiment, that

• can take one of M possible outcomes, n number of times. So, some of which will be well result

• in first outcome and so on, some of which will results in second outcome and so on.

• So, I know for each repetition, what outcome has come and given those things, we want to

• estimate the multinomial parameters p 1, p 2, p M. Now because, we are in the Bayesian

• context, the first question we have to answer is, what is the conjugate prior in this case

• right. Now, as we have already seen from our earlier examples, for this we should examine

• the form of the data likelihood. So, we already have our model, we for this x, we have the

• mass function, that we that we have seen earlier that is the mass function so, given this mass

• function, what is the data likelihood, that is easy to calculate.

• So, f D given p, p is product of this over i is equal to 1 to n now, we can substitute

• for f of x i, if I substitute for f of x i, f of x i is once again product p j x i j.

• Now, if I interchange i and j summation then, p j to the power x i j product over i can

• be written as product over j p j to the power n j where, n j is summation over i x i j.

• What does that mean, in any given x i, the j th component is 1, if that particular outcome

• of Z represents the j th value of Z. So, this n j tells you, out of the n, how

• many times Z taken the j th value so, for example, n 1 plus n 2 plus n M will be equal

• to n, the total number of samples. So, out of n samples, n 1 times the first value of

• Z has come, n 2 times the second value of Z has come or looking at as a multinomial

• distribution, n 1 times the first outcome has occurred, n 2 times the second outcome

• has occurred and so on. So, now, in terms of this n’s, the data likelihood is given

• by product over j, p j to the power n j. Now, if this is the data likelihood, we multiply

• this with a prior and we should get another expression of the same form, as the prior

• so, what should be our prior.

• So, this is the data likelihood so, we expect the prior to have a some density, this proportional

• to a product of p j to the power a j. If the prior is proportional to product of p j to

• the power a j and then, I multiply with data likelihood, I get another product of p j to

• the power some a j prime so, once again that posterior will belong to the same density

• of the prior right. Let us still remember that, p is a vector parameter, p has M components

• with all of them are probabilities so, they are greater than or equal to 0. And sum of

• p a is equal to 1 because, p 1 is the probability of that Z takes first value and so on so,

• this is needed for the mass function of Z. So, we need a density, which which is a density

• defined over all p, that satisfy this and that should have a form, which is product

• p j to the power a j.

• Now, such a density is, what is known as a Dirichlet density, if you remember when there

• are only two outcomes when you looking at Bernoulli, the the prior happen to be what

• is called the beta density. So, we will see that, the Dirichlet density is a kind of generalization

• of the beta density so, the Dirichlet density is given by f of p, is gamma of a 1 plus a

• 2 plus a M by gamma of a 1 into gamma of a 2 into gamma of a M, product j is equal to

• 1 to M p j to the power a j minus 1. Where, this gamma is the gamma function, that

• we already seen when we discuss the beta function the beta density, gamma function gamma of

• a is integral 0 to infinity x to the power a minus 1, e power minus x d x. In this, the

• the parameters of this Dirichlet density or this a j’s that is that is, a 1 a 2 a M,

• all of them are assumed to be greater than equal to 1.

• Also, this density has this value, only for those p that satisfy all components greater

• than or equal to 0 or some of the components is 1, outside of those p’s the density is

• 0. That means, the density is concentrated on that subset of power M, which satisfies

• p i greater than or equal to 0 and summation p equal to 1, which is essentially called

• as simplex. Those you know what is simplex is, they are simplex but, anyway even, if

• you do not know what is simplex is, this density is non-zero only for those p’s that satisfy

• p i greater than or equal to 0, summation p i is equal to 1 otherwise, the density value

• is 0.

• So, that is the Dirichlet density so, if M is equal to 2, this becomes gamma a 1 plus

• a 2 by gamma a 1 into gamma a 2, and p 1 to the power a 1 minus 1 and p 2 to the power

• a 2 minus 1 and that is the beta density. So, when M is equal to 2, this density becomes

• the beta density and this Dirichlet density happens to be the conjugate prior here. So,

• as you can see, if I am estimating the parameter of a Bernoulli density then, my conjugate

• prior happens to be beta. Whereas, if Iam estimating parameters for

• a multinomial distribution rather than binomial one then, the prior happens to be Dirichlet,

• which is a kind of a neat generalization of the beta density to, a to more than two case.

• Of course, this is a strange expression and we have first show that, this is a density

• on that particular set of p, that we we mentioned. Using the using similar methods, as in the

• case of beta density we can show that, this is a density, the other thing that we want

• is, just like in the beta density case, ultimately because, my posterior will be a Dirichlet

• density. We need to know the movements of the Dirichlet density so that, I I can correctly

• use my posterior to get my estimates.

• Once again without proofs, I I just put down these movements so, if p 1, p 2, p M have

• joint densities as Dirichlet, with parameters a 1, a 2, a M then, expected value of any

• component say, p j is a j by a 0 where, a 0 is a 1 plus a 2 plus a M. Variance of p

• j happens to be a 0 into a j into a 0 minus a j, by a 0 square into a 0 plus 1 and similarly,

• this is the co variance. We do not know the co variance but, this kind of gives us all

• the movements upto the up to order 2. So, for example, if you are going to use the use

• the mean of the posterior rather or final estimate, we would need this formula.

• So, with this let us now go on in this case, compute the posterior density, if by taking

• prior as Dirichlet, the posterior density we have, to for the posterior density, as

• we know is f p given D is proportional to the product of f D given p, into f p f D given

• p is the data likelihood, f p is the prior we have taken prior to be Dirichlet. We already

• have an expression for the likelihood so, if you substitute those two so, this is the

• expression for the likelihood, product p j to the power n j.

• And let us say, we have taken the we have taken the prior to be Dirichlet with parameters

• a 1, a 2, a M then, this becomes the prior so, this product can now be written as, product

• over p j of n j plus a j minus 1. So, obviously the reason, why we chosen this particular

• prior is that, the posterior will belong to same class. So, indeed posterior belongs to

• the same class right, this is product is proportional to product of p j to the power something.

• So, if my prior is Dirichlet with parameters a 1, a 2, a M then, the posterior is Dirichlet

• with parameters n j plus a j where, the n j’s come from the data, This is once again

• very similar to, what happened in the Bernoulli case.

• Thus, the posterior is also Dirich let with parameters n j plus a j so, if we take for

• example, the mean of the posterior as our final Bayesian estimate, we already seen,

• what the mean is a j by sum so, we know summation n j. So, it will be n j plus a j by summation

• over j n j plus a j, summation over j n j is n, that we have already seen and a 0 is

• the notation we have given for summation a j’s.

• So, the the bayesian density, which is taken as the mean of the posterior turns out to

• be n j plus a j by n plus a 0. Let us recall, that the ML estimate for this was n j by n

• right. And the ML estimate is very easy to see, we are asking, what is the probability

• that Z takes the j th value or what is the probability that, the j th outcome occurs

• that is, equal to the number of times j th outcome occurred by the total number of samples

• right, n j means summation over i, x i j is the number of times the j th value has occurred.

• So, n j by n was, as we have already derived was the ML estimate here, instead of it being

• n j by n, it becomes n j plus a j by n plus a 0 where, a j and a 0, which is a 1 plus

• a 2 plus a M are determined by our choice of the prior right, our choice of prior determines,

• what the value a j are. So, just like in the case of the Bernoulli parameters, the nature

• of the Bayesian estimate is same so, you can think of the prior as saying, that before

• I collect data in my mind because, I have some idea of, what the values of p j’s are.

• I have coded them to say that, if I have done a zero repetitions of Z, they are fictitious

• repetitions a 1 of them would give me first value, a 2 of them will give me second value,

• a M of them will give me third value. So, I can choose the a 0 as well as a 1 to a M

• based on my idea of, what these numbers p 1 to p M are. So then, the final estimate

• is the actual in the data, how many times j has occurred plus how many time j has occurred

• in the fictitious trials, divided by the total number of actual trials and the fictitious

• trials. So, when data when the like in the Bernoulli