Placeholder Image

字幕表 動画を再生する

  • Welcome to this next class in Pattern Recognition, we have been looking at density estimation.

  • So, let us briefly recall, what we have been doing in the last couple of classes.

  • We have been looking at, how to estimate densities for given IID samples, for a particular density.

  • We first looked at the maximum likelihood estimation method. For the last couple of

  • classes, we have been considering the Bayesian estimation of density function. So, given

  • a density function f x given theta where, theta is the parameter, we are considering

  • Bayesian estimate for the parameter theta. As I have already told you, the main difference

  • between the maximum likelihood estimation and the Bayesian estimation is that, in the

  • Bayesian estimation, we look at the parameter theta, which may be a vector or a scalar,

  • the parameter theta itself is viewed as a random variable. And because, we view it as

  • a random variable, it has a prior density, which captures our initial uncertainty.

  • Our knowledge or lack of knowledge about the specific value of the parameter is captured

  • through a prior density f theta. So, f theta gives us some idea about, what we think or

  • the possible values for the parameter. Given the given the prior density, we are going

  • to use the data likelihood that is, f D given theta to calculate the posterior density f

  • theta given D. Once again, I would like to drive your attention

  • to a caution on the notation, for simplicity, the densities of all kind of random variables

  • we are using the same symbol f. So, f of theta, f of D given theta, f of theta given D, all

  • these are densities, purely as mathematical notation looks same because, f is the same

  • function. But, we are we are using f as a notation to denote density and density of,

  • which random variable it is, is clear from context.

  • Thus, f x given theta is the density of x conditioned on theta, which is the parameter

  • f of theta, is the density of the parameter theta and so on, f theta given D is the conditional

  • density of theta, conditioned on data D. So, even though we are using the same symbol f,

  • for all densities so, I hope you understand that, the f used in different times is a different

  • function. It refers to densities of different random variables and just to keep the notation

  • uncluttered, we are calling it as the f. So, once again, essentially we start with a prior

  • density f of theta, for the parameter theta then, use the data likelihood of D given theta

  • to calculate the posterior f theta given D, we have seen a couple of examples of this

  • process earlier.

  • And the idea of Bayesian estimation by now, you would have seen is, to choose a right

  • kind of prior, the right kind of prior for us is, what is called the conjugate prior.

  • The conjugate prior is that prior density, which results in the posterior density belonging

  • to the same class of densities. For example, as we saw when, we are estimating the mean

  • of a Gaussian random variable or mean of a Gaussian density where, the variance is assumed

  • known, we choose Gaussian density for the prior then, the posterior is also Gaussian

  • density. So, for that particular estimation problem,

  • the prior density happens to be Gaussian similarly, for a Bernoulli problem where, we have to

  • estimate the parameter p namely, the probability of the random variability taken value 1. For

  • parameter p, the appropriate prior density turns out to be beta appropriate in the sense

  • that, if I take prior density to be beta then, the posterior also becomes beta, so such a

  • prior is called a conjugate prior right. The conjugate prior is that prior density,

  • which results in the posterior density also, to be of the same class of densities, the

  • use is of conjugate prior makes the process of calculation of posterior density easier.

  • As we have seen let us say, in the case of the Bernoulli parameter, that we have seen

  • earlier, if we start with some beta a 0, for the beta a 0 b 0 for the prior then, the posterior

  • is also beta density with possibly some other parameters a and b n.

  • So, calculation of the posterior density is simply a matter of parameter updation, given

  • the parameters of the prior now, we update them into parameters of the posterior. Having

  • obtained the posterior density, we can finally use either the mean or the mode of the posterior

  • density at the final estimate. We have seen examples of both or we can also calculate

  • f of x given D, that is the actual class conditional density, conditioned on data by integrating

  • the posterior density and we have seen example of that also. So, this class, we will look

  • at a couple of more examples of Bayesian estimation and then, closed Bayesian estimation. As you

  • would have by now seen, Bayesian estimation is a little more complicated mainly because,

  • you have to choose the right kind of prior and different kind of estimation problems

  • make different priors of the conjugate.

  • So, we will start with another example, this is the multinomial example that is, we consider

  • estimating the mass function of a discrete random variable, which takes one of M possible

  • values say, a 1 to a M where, p i is probability Z takes the value a i. So, essentially Z takes

  • value a 1 with probability p 1, a 2 with probability p 2, a M with probability p M and we want

  • to estimate this p 1, p 2, p M, given a sample of n iid realizations of Z. We already considered

  • this problem in in the maximum likelihood case and there we told you, this is particularly

  • important in certain class of pattern recognition problem. Especially those to do with say,

  • the text classification and so on where, the discrete random variables for feature that

  • is, features that take only finitely many values are important.

  • We have seen, how to do the maximum likelihood estimation for obtaining this p 1, p 2’s

  • that, p 1, p 2, p M that characterize the mass function of this discrete random variable

  • Z. Now, we will look at, how to do the same thing using Bayesian estimation, I hope the

  • problem is clear, this already is done earlier so, we will we will quickly review the notation

  • that we used earlier.

  • So, as earlier, we represent any realization of Z by M-dimensional Boolean vector x, x

  • has M components x superscript 1, x superscript M. The transpose there is because, as I said,

  • all vectors for us are column vectors, each of these components of this vector x, x superscript

  • i is either 0 or 1, and summation of all of them is 1. That means, essentially x takes

  • only the unit vectors 1 0 0 0, 0 1 0 0’s and so on, the idea is that, if z takes the

  • i th value a i then, we represent it by a vector where, i th component is 1 and all

  • others are 0. We have already seen that, this is a interesting

  • and useful representation for maximum likelihood with this same, we use the same representation

  • here. And now, p i turn out to be, the probability that the i th component of this vector is

  • 1 that is what, I wrote there p x subscript i is equal to 1. Because, p i is the probability

  • with where, Z takes the i th value a i and when Z takes the i th value a i, the i th

  • component of x i becomes 1 where, i th component of x becomes 1.

  • Also, as I told you last time, the reason, why we use the superscripts to denote the

  • components of x is because, subscripts of x are used as to denote different data or

  • data is x 1 to x n that is why, we are using superscripts to denote the components of particular

  • data. So, we also seen last time that, for this x, the mass function with the single

  • vector parameter p, is product i is equal to 1 to M p i x i because, in any given x,

  • only one component of x is 1 and that is the one that, survives this product. So, if x

  • is 1 0 0 0 then, for that x, f of x given p will become p 1 to the power 1 and p to

  • the power 0 and so on that is, p 1. So thus, this correctly represents the mass function,

  • that we are interested in and p is the parameter of the mass , that we need to estimate.

  • As usual, our data has n samples x 1 to x n, and each sample x i is a vector of M components,

  • the components are shown by superscripts. And each component is either 0 or 1, and in

  • each data it is M x i that is, each M vector x i, if I sum all the components, it becomes

  • 1. That simply means, because, each component is on 0 1 and sum is 1 means, exactly one

  • component is 1 and all others are 0 so, this is the nature of our representation and we

  • have n such data items. So, given such data, we want to estimate p

  • 1, p 2, p M thus essentially, what we are doing is, we are estimating the parameters

  • of a multinomial distribution. As you know, a multinomial binomial takes only binomial

  • is important, when there is a random experiment that is repeated, which takes only two values

  • success or failure. In the multinomial case, it is the same thing independent realizations

  • of a random experiment that takes more than two values say, M values.

  • If a random variable takes M different values, I can think of it as a random experiment,

  • which can result in one of M possible outcomes right. So, many samples from that random variable

  • are like, I have a multinomial distribution that is, I repeat a random experiment, that

  • can take one of M possible outcomes, n number of times. So, some of which will be well result

  • in first outcome and so on, some of which will results in second outcome and so on.

  • So, I know for each repetition, what outcome has come and given those things, we want to

  • estimate the multinomial parameters p 1, p 2, p M. Now because, we are in the Bayesian

  • context, the first question we have to answer is, what is the conjugate prior in this case

  • right. Now, as we have already seen from our earlier examples, for this we should examine

  • the form of the data likelihood. So, we already have our model, we for this x, we have the

  • mass function, that we that we have seen earlier that is the mass function so, given this mass

  • function, what is the data likelihood, that is easy to calculate.

  • So, f D given p, p is product of this over i is equal to 1 to n now, we can substitute

  • for f of x i, if I substitute for f of x i, f of x i is once again product p j x i j.

  • Now, if I interchange i and j summation then, p j to the power x i j product over i can

  • be written as product over j p j to the power n j where, n j is summation over i x i j.

  • What does that mean, in any given x i, the j th component is 1, if that particular outcome

  • of Z represents the j th value of Z. So, this n j tells you, out of the n, how

  • many times Z taken the j th value so, for example, n 1 plus n 2 plus n M will be equal

  • to n, the total number of samples. So, out of n samples, n 1 times the first value of

  • Z has come, n 2 times the second value of Z has come or looking at as a multinomial

  • distribution, n 1 times the first outcome has occurred, n 2 times the second outcome

  • has occurred and so on. So, now, in terms of this n’s, the data likelihood is given

  • by product over j, p j to the power n j. Now, if this is the data likelihood, we multiply

  • this with a prior and we should get another expression of the same form, as the prior

  • so, what should be our prior.

  • So, this is the data likelihood so, we expect the prior to have a some density, this proportional

  • to a product of p j to the power a j. If the prior is proportional to product of p j to

  • the power a j and then, I multiply with data likelihood, I get another product of p j to

  • the power some a j prime so, once again that posterior will belong to the same density

  • of the prior right. Let us still remember that, p is a vector parameter, p has M components

  • with all of them are probabilities so, they are greater than or equal to 0. And sum of

  • p a is equal to 1 because, p 1 is the probability of that Z takes first value and so on so,

  • this is needed for the mass function of Z. So, we need a density, which which is a density

  • defined over all p, that satisfy this and that should have a form, which is product

  • p j to the power a j.

  • Now, such a density is, what is known as a Dirichlet density, if you remember when there

  • are only two outcomes when you looking at Bernoulli, the the prior happen to be what

  • is called the beta density. So, we will see that, the Dirichlet density is a kind of generalization

  • of the beta density so, the Dirichlet density is given by f of p, is gamma of a 1 plus a

  • 2 plus a M by gamma of a 1 into gamma of a 2 into gamma of a M, product j is equal to

  • 1 to M p j to the power a j minus 1. Where, this gamma is the gamma function, that

  • we already seen when we discuss the beta function the beta density, gamma function gamma of

  • a is integral 0 to infinity x to the power a minus 1, e power minus x d x. In this, the

  • the parameters of this Dirichlet density or this a j’s that is that is, a 1 a 2 a M,

  • all of them are assumed to be greater than equal to 1.

  • Also, this density has this value, only for those p that satisfy all components greater

  • than or equal to 0 or some of the components is 1, outside of those p’s the density is

  • 0. That means, the density is concentrated on that subset of power M, which satisfies

  • p i greater than or equal to 0 and summation p equal to 1, which is essentially called

  • as simplex. Those you know what is simplex is, they are simplex but, anyway even, if

  • you do not know what is simplex is, this density is non-zero only for those p’s that satisfy

  • p i greater than or equal to 0, summation p i is equal to 1 otherwise, the density value

  • is 0.

  • So, that is the Dirichlet density so, if M is equal to 2, this becomes gamma a 1 plus

  • a 2 by gamma a 1 into gamma a 2, and p 1 to the power a 1 minus 1 and p 2 to the power

  • a 2 minus 1 and that is the beta density. So, when M is equal to 2, this density becomes

  • the beta density and this Dirichlet density happens to be the conjugate prior here. So,

  • as you can see, if I am estimating the parameter of a Bernoulli density then, my conjugate

  • prior happens to be beta. Whereas, if Iam estimating parameters for

  • a multinomial distribution rather than binomial one then, the prior happens to be Dirichlet,

  • which is a kind of a neat generalization of the beta density to, a to more than two case.

  • Of course, this is a strange expression and we have first show that, this is a density

  • on that particular set of p, that we we mentioned. Using the using similar methods, as in the

  • case of beta density we can show that, this is a density, the other thing that we want

  • is, just like in the beta density case, ultimately because, my posterior will be a Dirichlet

  • density. We need to know the movements of the Dirichlet density so that, I I can correctly

  • use my posterior to get my estimates.

  • Once again without proofs, I I just put down these movements so, if p 1, p 2, p M have

  • joint densities as Dirichlet, with parameters a 1, a 2, a M then, expected value of any

  • component say, p j is a j by a 0 where, a 0 is a 1 plus a 2 plus a M. Variance of p

  • j happens to be a 0 into a j into a 0 minus a j, by a 0 square into a 0 plus 1 and similarly,

  • this is the co variance. We do not know the co variance but, this kind of gives us all

  • the movements upto the up to order 2. So, for example, if you are going to use the use

  • the mean of the posterior rather or final estimate, we would need this formula.

  • So, with this let us now go on in this case, compute the posterior density, if by taking

  • prior as Dirichlet, the posterior density we have, to for the posterior density, as

  • we know is f p given D is proportional to the product of f D given p, into f p f D given

  • p is the data likelihood, f p is the prior we have taken prior to be Dirichlet. We already

  • have an expression for the likelihood so, if you substitute those two so, this is the

  • expression for the likelihood, product p j to the power n j.

  • And let us say, we have taken the we have taken the prior to be Dirichlet with parameters

  • a 1, a 2, a M then, this becomes the prior so, this product can now be written as, product

  • over p j of n j plus a j minus 1. So, obviously the reason, why we chosen this particular

  • prior is that, the posterior will belong to same class. So, indeed posterior belongs to

  • the same class right, this is product is proportional to product of p j to the power something.

  • So, if my prior is Dirichlet with parameters a 1, a 2, a M then, the posterior is Dirichlet

  • with parameters n j plus a j where, the n j’s come from the data, This is once again

  • very similar to, what happened in the Bernoulli case.

  • Thus, the posterior is also Dirich let with parameters n j plus a j so, if we take for

  • example, the mean of the posterior as our final Bayesian estimate, we already seen,

  • what the mean is a j by sum so, we know summation n j. So, it will be n j plus a j by summation

  • over j n j plus a j, summation over j n j is n, that we have already seen and a 0 is

  • the notation we have given for summation a j’s.

  • So, the the bayesian density, which is taken as the mean of the posterior turns out to

  • be n j plus a j by n plus a 0. Let us recall, that the ML estimate for this was n j by n

  • right. And the ML estimate is very easy to see, we are asking, what is the probability

  • that Z takes the j th value or what is the probability that, the j th outcome occurs

  • that is, equal to the number of times j th outcome occurred by the total number of samples

  • right, n j means summation over i, x i j is the number of times the j th value has occurred.

  • So, n j by n was, as we have already derived was the ML estimate here, instead of it being

  • n j by n, it becomes n j plus a j by n plus a 0 where, a j and a 0, which is a 1 plus

  • a 2 plus a M are determined by our choice of the prior right, our choice of prior determines,

  • what the value a j are. So, just like in the case of the Bernoulli parameters, the nature

  • of the Bayesian estimate is same so, you can think of the prior as saying, that before

  • I collect data in my mind because, I have some idea of, what the values of p j’s are.

  • I have coded them to say that, if I have done a zero repetitions of Z, they are fictitious

  • repetitions a 1 of them would give me first value, a 2 of them will give me second value,

  • a M of them will give me third value. So, I can choose the a 0 as well as a 1 to a M

  • based on my idea of, what these numbers p 1 to p M are. So then, the final estimate

  • is the actual in the data, how many times j has occurred plus how many time j has occurred

  • in the fictitious trials, divided by the total number of actual trials and the fictitious

  • trials. So, when data when the like in the Bernoulli

  • case, if the data is small then, we do not go very wrong because, our paired beliefs

  • we will ensure that, p j’s do not go into unnatural values. For example, p j breaking

  • 1 or 0, when data is very small but, as n increases for any fixed a j and a 0, as n

  • increases ultimately, this becomes n j by n. So, asymptotically the Bayesian estimate

  • will be same as the maximum likelihood estimate and hence, it will be consistent. But, once