Placeholder Image

字幕表 動画を再生する

  • Welcome to this next class in Pattern Recognition, we have been looking at density estimation.

  • So, let us briefly recall, what we have been doing in the last couple of classes.

  • We have been looking at, how to estimate densities for given IID samples, for a particular density.

  • We first looked at the maximum likelihood estimation method. For the last couple of

  • classes, we have been considering the Bayesian estimation of density function. So, given

  • a density function f x given theta where, theta is the parameter, we are considering

  • Bayesian estimate for the parameter theta. As I have already told you, the main difference

  • between the maximum likelihood estimation and the Bayesian estimation is that, in the

  • Bayesian estimation, we look at the parameter theta, which may be a vector or a scalar,

  • the parameter theta itself is viewed as a random variable. And because, we view it as

  • a random variable, it has a prior density, which captures our initial uncertainty.

  • Our knowledge or lack of knowledge about the specific value of the parameter is captured

  • through a prior density f theta. So, f theta gives us some idea about, what we think or

  • the possible values for the parameter. Given the given the prior density, we are going

  • to use the data likelihood that is, f D given theta to calculate the posterior density f

  • theta given D. Once again, I would like to drive your attention

  • to a caution on the notation, for simplicity, the densities of all kind of random variables

  • we are using the same symbol f. So, f of theta, f of D given theta, f of theta given D, all

  • these are densities, purely as mathematical notation looks same because, f is the same

  • function. But, we are we are using f as a notation to denote density and density of,

  • which random variable it is, is clear from context.

  • Thus, f x given theta is the density of x conditioned on theta, which is the parameter

  • f of theta, is the density of the parameter theta and so on, f theta given D is the conditional

  • density of theta, conditioned on data D. So, even though we are using the same symbol f,

  • for all densities so, I hope you understand that, the f used in different times is a different

  • function. It refers to densities of different random variables and just to keep the notation

  • uncluttered, we are calling it as the f. So, once again, essentially we start with a prior

  • density f of theta, for the parameter theta then, use the data likelihood of D given theta

  • to calculate the posterior f theta given D, we have seen a couple of examples of this

  • process earlier.

  • And the idea of Bayesian estimation by now, you would have seen is, to choose a right

  • kind of prior, the right kind of prior for us is, what is called the conjugate prior.

  • The conjugate prior is that prior density, which results in the posterior density belonging

  • to the same class of densities. For example, as we saw when, we are estimating the mean

  • of a Gaussian random variable or mean of a Gaussian density where, the variance is assumed

  • known, we choose Gaussian density for the prior then, the posterior is also Gaussian

  • density. So, for that particular estimation problem,

  • the prior density happens to be Gaussian similarly, for a Bernoulli problem where, we have to

  • estimate the parameter p namely, the probability of the random variability taken value 1. For

  • parameter p, the appropriate prior density turns out to be beta appropriate in the sense

  • that, if I take prior density to be beta then, the posterior also becomes beta, so such a

  • prior is called a conjugate prior right. The conjugate prior is that prior density,

  • which results in the posterior density also, to be of the same class of densities, the

  • use is of conjugate prior makes the process of calculation of posterior density easier.

  • As we have seen let us say, in the case of the Bernoulli parameter, that we have seen

  • earlier, if we start with some beta a 0, for the beta a 0 b 0 for the prior then, the posterior

  • is also beta density with possibly some other parameters a and b n.

  • So, calculation of the posterior density is simply a matter of parameter updation, given

  • the parameters of the prior now, we update them into parameters of the posterior. Having

  • obtained the posterior density, we can finally use either the mean or the mode of the posterior

  • density at the final estimate. We have seen examples of both or we can also calculate

  • f of x given D, that is the actual class conditional density, conditioned on data by integrating

  • the posterior density and we have seen example of that also. So, this class, we will look

  • at a couple of more examples of Bayesian estimation and then, closed Bayesian estimation. As you

  • would have by now seen, Bayesian estimation is a little more complicated mainly because,

  • you have to choose the right kind of prior and different kind of estimation problems

  • make different priors of the conjugate.

  • So, we will start with another example, this is the multinomial example that is, we consider

  • estimating the mass function of a discrete random variable, which takes one of M possible

  • values say, a 1 to a M where, p i is probability Z takes the value a i. So, essentially Z takes

  • value a 1 with probability p 1, a 2 with probability p 2, a M with probability p M and we want

  • to estimate this p 1, p 2, p M, given a sample of n iid realizations of Z. We already considered

  • this problem in in the maximum likelihood case and there we told you, this is particularly

  • important in certain class of pattern recognition problem. Especially those to do with say,

  • the text classification and so on where, the discrete random variables for feature that

  • is, features that take only finitely many values are important.

  • We have seen, how to do the maximum likelihood estimation for obtaining this p 1, p 2’s

  • that, p 1, p 2, p M that characterize the mass function of this discrete random variable

  • Z. Now, we will look at, how to do the same thing using Bayesian estimation, I hope the

  • problem is clear, this already is done earlier so, we will we will quickly review the notation

  • that we used earlier.

  • So, as earlier, we represent any realization of Z by M-dimensional Boolean vector x, x

  • has M components x superscript 1, x superscript M. The transpose there is because, as I said,

  • all vectors for us are column vectors, each of these components of this vector x, x superscript

  • i is either 0 or 1, and summation of all of them is 1. That means, essentially x takes

  • only the unit vectors 1 0 0 0, 0 1 0 0’s and so on, the idea is that, if z takes the

  • i th value a i then, we represent it by a vector where, i th component is 1 and all

  • others are 0. We have already seen that, this is a interesting

  • and useful representation for maximum likelihood with this same, we use the same representation

  • here. And now, p i turn out to be, the probability that the i th component of this vector is

  • 1 that is what, I wrote there p x subscript i is equal to 1. Because, p i is the probability

  • with where, Z takes the i th value a i and when Z takes the i th value a i, the i th

  • component of x i becomes 1 where, i th component of x becomes 1.

  • Also, as I told you last time, the reason, why we use the superscripts to denote the

  • components of x is because, subscripts of x are used as to denote different data or

  • data is x 1 to x n that is why, we are using superscripts to denote the components of particular

  • data. So, we also seen last time that, for this x, the mass function with the single

  • vector parameter p, is product i is equal to 1 to M p i x i because, in any given x,

  • only one component of x is 1 and that is the one that, survives this product. So, if x

  • is 1 0 0 0 then, for that x, f of x given p will become p 1 to the power 1 and p to

  • the power 0 and so on that is, p 1. So thus, this correctly represents the mass function,

  • that we are interested in and p is the parameter of the mass , that we need to estimate.

  • As usual, our data has n samples x 1 to x n, and each sample x i is a vector of M components,

  • the components are shown by superscripts. And each component is either 0 or 1, and in

  • each data it is M x i that is, each M vector x i, if I sum all the components, it becomes

  • 1. That simply means, because, each component is on 0 1 and sum is 1 means, exactly one

  • component is 1 and all others are 0 so, this is the nature of our representation and we

  • have n such data items. So, given such data, we want to estimate p

  • 1, p 2, p M thus essentially, what we are doing is, we are estimating the parameters

  • of a multinomial distribution. As you know, a multinomial binomial takes only binomial

  • is important, when there is a random experiment that is repeated, which takes only two values

  • success or failure. In the multinomial case, it is the same thing independent realizations

  • of a random experiment that takes more than two values say, M values.

  • If a random variable takes M different values, I can think of it as a random experiment,

  • which can result in one of M possible outcomes right. So, many samples from that random variable

  • are like, I have a multinomial distribution that is, I repeat a random experiment, that

  • can take one of M possible outcomes, n number of times. So, some of which will be well result

  • in first outcome and so on, some of which will results in second outcome and so on.

  • So, I know for each repetition, what outcome has come and given those things, we want to

  • estimate the multinomial parameters p 1, p 2, p M. Now because, we are in the Bayesian

  • context, the first question we have to answer is, what is the conjugate prior in this case

  • right. Now, as we have already seen from our earlier examples, for this we should examine

  • the form of the data likelihood. So, we already have our model, we for this x, we have the

  • mass function, that we that we have seen earlier that is the mass function so, given this mass

  • function, what is the data likelihood, that is easy to calculate.

  • So, f D given p, p is product of this over i is equal to 1 to n now, we can substitute

  • for f of x i, if I substitute for f of x i, f of x i is once again product p j x i j.

  • Now, if I interchange i and j summation then, p j to the power x i j product over i can

  • be written as product over j p j to the power n j where, n j is summation over i x i j.

  • What does that mean, in any given x i, the j th component is 1, if that particular outcome

  • of Z represents the j th value of Z. So, this n j tells you, out of the n, how

  • many times Z taken the j th value so, for example, n 1 plus n 2 plus n M will be equal

  • to n, the total number of samples. So, out of n samples, n 1 times the first value of

  • Z has come, n 2 times the second value of Z has come or looking at as a multinomial

  • distribution, n 1 times the first outcome has occurred, n 2 times the second outcome

  • has occurred and so on. So, now, in terms of this n’s, the data likelihood is given

  • by product over j, p j to the power n j. Now, if this is the data likelihood, we multiply

  • this with a prior and we should get another expression of the same form, as the prior

  • so, what should be our prior.

  • So, this is the data likelihood so, we expect the prior to have a some density, this proportional

  • to a product of p j to the power a j. If the prior is proportional to product of p j to

  • the power a j and then, I multiply with data likelihood, I get another product of p j to

  • the power some a j prime so, once again that posterior will belong to the same density

  • of the prior right. Let us still remember that, p is a vector parameter, p has M components

  • with all of them are probabilities so, they are greater than or equal to 0. And sum of

  • p a is equal to 1 because, p 1 is the probability of that Z takes first value and so on so,

  • this is needed for the mass function of Z. So, we need a density, which which is a density

  • defined over all p, that satisfy this and that should have a form, which is product

  • p j to the power a j.

  • Now, such a density is, what is known as a Dirichlet density, if you remember when there

  • are only two outcomes when you looking at Bernoulli, the the prior happen to be what

  • is called the beta density. So, we will see that, the Dirichlet density is a kind of generalization

  • of the beta density so, the Dirichlet density is given by f of p, is gamma of a 1 plus a

  • 2 plus a M by gamma of a 1 into gamma of a 2 into gamma of a M, product j is equal to

  • 1 to M p j to the power a j minus 1. Where, this gamma is the gamma function, that

  • we already seen when we discuss the beta function the beta density, gamma function gamma of

  • a is integral 0 to infinity x to the power a minus 1, e power minus x d x. In this, the

  • the parameters of this Dirichlet density or this a j’s that is that is, a 1 a 2 a M,

  • all of them are assumed to be greater than equal to 1.

  • Also, this density has this value, only for those p that satisfy all components greater

  • than or equal to 0 or some of the components is 1, outside of those p’s the density is

  • 0. That means, the density is concentrated on that subset of power M, which satisfies

  • p i greater than or equal to 0 and summation p equal to 1, which is essentially called

  • as simplex. Those you know what is simplex is, they are simplex but, anyway even, if

  • you do not know what is simplex is, this density is non-zero only for those p’s that satisfy

  • p i greater than or equal to 0, summation p i is equal to 1 otherwise, the density value

  • is 0.

  • So, that is the Dirichlet density so, if M is equal to 2, this becomes gamma a 1 plus

  • a 2 by gamma a 1 into gamma a 2, and p 1 to the power a 1 minus 1 and p 2 to the power

  • a 2 minus 1 and that is the beta density. So, when M is equal to 2, this density becomes

  • the beta density and this Dirichlet density happens to be the conjugate prior here. So,

  • as you can see, if I am estimating the parameter of a Bernoulli density then, my conjugate

  • prior happens to be beta. Whereas, if Iam estimating parameters for

  • a multinomial distribution rather than binomial one then, the prior happens to be Dirichlet,

  • which is a kind of a neat generalization of the beta density to, a to more than two case.

  • Of course, this is a strange expression and we have first show that, this is a density

  • on that particular set of p, that we we mentioned. Using the using similar methods, as in the

  • case of beta density we can show that, this is a density, the other thing that we want

  • is, just like in the beta density case, ultimately because, my posterior will be a Dirichlet

  • density. We need to know the movements of the Dirichlet density so that, I I can correctly

  • use my posterior to get my estimates.

  • Once again without proofs, I I just put down these movements so, if p 1, p 2, p M have

  • joint densities as Dirichlet, with parameters a 1, a 2, a M then, expected value of any

  • component say, p j is a j by a 0 where, a 0 is a 1 plus a 2 plus a M. Variance of p

  • j happens to be a 0 into a j into a 0 minus a j, by a 0 square into a 0 plus 1 and similarly,

  • this is the co variance. We do not know the co variance but, this kind of gives us all

  • the movements upto the up to order 2. So, for example, if you are going to use the use

  • the mean of the posterior rather or final estimate, we would need this formula.

  • So, with this let us now go on in this case, compute the posterior density, if by taking

  • prior as Dirichlet, the posterior density we have, to for the posterior density, as

  • we know is f p given D is proportional to the product of f D given p, into f p f D given

  • p is the data likelihood, f p is the prior we have taken prior to be Dirichlet. We already

  • have an expression for the likelihood so, if you substitute those two so, this is the

  • expression for the likelihood, product p j to the power n j.

  • And let us say, we have taken the we have taken the prior to be Dirichlet with parameters

  • a 1, a 2, a M then, this becomes the prior so, this product can now be written as, product

  • over p j of n j plus a j minus 1. So, obviously the reason, why we chosen this particular

  • prior is that, the posterior will belong to same class. So, indeed posterior belongs to

  • the same class right, this is product is proportional to product of p j to the power something.

  • So, if my prior is Dirichlet with parameters a 1, a 2, a M then, the posterior is Dirichlet

  • with parameters n j plus a j where, the n j’s come from the data, This is once again

  • very similar to, what happened in the Bernoulli case.

  • Thus, the posterior is also Dirich let with parameters n j plus a j so, if we take for

  • example, the mean of the posterior as our final Bayesian estimate, we already seen,

  • what the mean is a j by sum so, we know summation n j. So, it will be n j plus a j by summation

  • over j n j plus a j, summation over j n j is n, that we have already seen and a 0 is

  • the notation we have given for summation a j’s.

  • So, the the bayesian density, which is taken as the mean of the posterior turns out to

  • be n j plus a j by n plus a 0. Let us recall, that the ML estimate for this was n j by n

  • right. And the ML estimate is very easy to see, we are asking, what is the probability

  • that Z takes the j th value or what is the probability that, the j th outcome occurs

  • that is, equal to the number of times j th outcome occurred by the total number of samples

  • right, n j means summation over i, x i j is the number of times the j th value has occurred.

  • So, n j by n was, as we have already derived was the ML estimate here, instead of it being

  • n j by n, it becomes n j plus a j by n plus a 0 where, a j and a 0, which is a 1 plus

  • a 2 plus a M are determined by our choice of the prior right, our choice of prior determines,

  • what the value a j are. So, just like in the case of the Bernoulli parameters, the nature

  • of the Bayesian estimate is same so, you can think of the prior as saying, that before

  • I collect data in my mind because, I have some idea of, what the values of p j’s are.

  • I have coded them to say that, if I have done a zero repetitions of Z, they are fictitious

  • repetitions a 1 of them would give me first value, a 2 of them will give me second value,

  • a M of them will give me third value. So, I can choose the a 0 as well as a 1 to a M

  • based on my idea of, what these numbers p 1 to p M are. So then, the final estimate

  • is the actual in the data, how many times j has occurred plus how many time j has occurred

  • in the fictitious trials, divided by the total number of actual trials and the fictitious

  • trials. So, when data when the like in the Bernoulli

  • case, if the data is small then, we do not go very wrong because, our paired beliefs

  • we will ensure that, p j’s do not go into unnatural values. For example, p j breaking

  • 1 or 0, when data is very small but, as n increases for any fixed a j and a 0, as n

  • increases ultimately, this becomes n j by n. So, asymptotically the Bayesian estimate

  • will be same as the maximum likelihood estimate and hence, it will be consistent. But, once

  • again like in the Bernoulli case, the the prior allows me, to allow my initial beliefs

  • to properly moderate data especially, when data is not very large.

  • Now, let us look at another example, last class we considered the example of estimating

  • mean of a normal distribution where, we assumed the variance to be known right. Now, let us

  • do it the other way round, we want to estimate the variance of a normal distribution and

  • we assume mean to be known. It might look a little strange to you, when we did the ML

  • estimate, we did not have to do so much trouble, we directly did only one example to estimate

  • both mean and variance of a one dimensional Gaussian distribution.

  • Because, in the ML case, it is a very straight forward thing here, for each kind of parameter,

  • the corresponding prior would be different for example, when we wanted to estimate the

  • mean, the conjugate prior was Gaussian right. Some of you may be thinking that, because

  • we were estimating Gaussian density, the conjugate what happened to be Gaussian, that is not

  • true. If you want to estimate the variance of a

  • Gaussian where, I assume mean known, the conjugate prior cannot be Gaussian right because, variance

  • as a parameter can take only non-zero values. So, it is density cannot be Gaussian then,

  • we may jump to the conclusion say, may be it is a exponential, exponential is a density

  • that is 0 only only when the random variables takes positive values, it is the density is

  • non-zero, only when the random variable takes positive values.

  • But, exponential is only a special case right, we will see that in general, the the prior

  • is not exponential, exponential is only very special case of the prior. Also, the prior

  • will not be unvariance, as it turns out for this case, is better to take 1 by variance

  • at the parameter, it is often denoted by nu and is often called the precision. While I

  • have not done the vector case, in the vector case, the inverse of the sigma matrix is called

  • the lambda matrix and that is called the precision matrix.

  • In the in the scalar case of course, we simply take the 1 by variance at the precision, which

  • is often denoted by nu so, in terms of the parameter nu, the normal density model is

  • given by 1 root nu. Because, is normally 1 by sigma root 2 pi, 1 by sigma is root nu

  • so, this root nu by root 2 pi exponential minus half normally, x minus mu whole square

  • by sigma square and 1 by sigma square is nu. So, it is written as exponential minus nu

  • by 2 into x minus mu whole square. Note that, we are assuming mu is known and that is why,

  • only mu is shown as the conditioning parameter. And we need to find nu said that, nu is always

  • positive so, for example, when we choose a prior density, we choose density that is,

  • the that is 0 on the negative nu and you have to find, what is the prior density, you have

  • to look at the data likelihood. So, let us look at the data likelihood, the

  • data likelihood is given by in terms of nu, as equal to 1 to n, f of x i given nu, f of

  • x given nu is this. So, if I take a product, this will give me nu to the power root nu

  • to the power n that is, nu to the power n by 2. This 1 by root 2 pi to the power n that

  • is, 2 pi to the power minus n by 2 so, I have a two pi to the power minus n by 2 term, I

  • have a nu to the power n by 2 term. And then, when I take a product of this over x i, it

  • will become exponential minus nu by 2 into sum of this.

  • So, exponential minus nu by 2 into sum over i x i minus mu whole square so now ,to ask

  • what should be the right prior, we should ask what kind of function is this of nu, viewed

  • as a function of nu, what kind of function is this. So, we essentially have an exponential

  • nu into something term and we have a nu to the power something term right. So, the conjugate

  • prior should be something, this proportional to nu power something and that, should be

  • proportional to product of a power of nu and an exponential of a linear function of nu.

  • Because, the data density is some constant into nu to the power something and exponential

  • minus some some k times nu. So, if the prior also has nu to the power something and exponential

  • some constant into nu, I mean then, the product will once again be nu to the power something

  • into exponentials of constant nu. So, the prior should be proportional to a product

  • of a power of nu and an exponential of a linear function in nu. And such a density transfer

  • to be, what is known as a gamma density, such a prior would be what is called the gamma

  • density.

  • So, let us look at the gamma density, the gamma density is given by the density function

  • f nu is 1 by gamma a, b to the power of a nu to the power a minus one e to the power

  • of b nu where, a and b are parameters. So, the gamma is once again the gamma function,

  • as a matter of fact, the actual gamma function comes from making this to be a density. By

  • a simple integration, we can show that this to be a density because, this integral will

  • turn out to be the gamma function. The the gamma density has two parameters a

  • and b, the a comes in nu to the power of a minus one and b comes in e power minus b nu,

  • this b power a is needed so that, the density integrates to 1. So, the nu to the power is

  • controlled by a and exponential of the linear function in nu is controlled by b, these two

  • are the parameters and the mean of gamma density is a by b and the mode is a minus 1 by b.

  • If I actually choose a to be 1 then, the density turns out to be b e power minus b nu right

  • now, when a is 1, as you know gamma of a is a minus gamma, if one is one turns out to

  • be 1 by straight forward integration. So, when a is equal to 1 is simply b e power minus

  • b nu that is nothing but, the exponential density so, exponential density is a special

  • case of gamma density with a is equal to 1. So, let us take the prior to be gamma with

  • parameters a 0 and b 0 that means, it becomes nu the power of a 0 minus 1 e to the power

  • of minus b 0 nu, those are the two important terms, the rest is constant.

  • So, the posterior density becomes f of nu given D is proportional f of D given nu into

  • f nu f of D given nu is this and f of nu is this with a as a 0 and b as b 0.

  • So, we get this, f of D given nu is nu to the power forgetting the constants keeping

  • only the nu terms; it become nu to the power n by 2 exponential minus nu by 2 into summation

  • x i minus mu whole square; and from f nu, I get nu to the power of a 0 minus 1 exponential

  • minus b 0 nu. For the reason we chose this as the prior is now, these two new terms will

  • become nu to the power something and these exponential terms will become exponential

  • something into nu. So, if you do that, it becomes nu to the power

  • of a 0 plus n by 2 minus 1 into exponential minus b 0 nu minus nu by 2 into this. So,

  • we once again have nu to the power something exponential minus a linear function of nu

  • so, the posterior as expected, is once again a gamma density. Now, what kind of gamma density

  • is it. The the gamma density is two parameters a and b essentially, forgetting the constants

  • is proportional to nu to the power of a minus 1 e to the power minus b nu right. So for

  • the posterior density, the a parameter is a 0 plus n by 2 and the b parameter is b 0

  • plus half into this sum right.

  • So, if we think that the posterior is a gamma density with parameters a a n and b n right

  • then, this is what we will get. If the posterior is a is a gamma density with parameters a

  • n and b n. Then a n will be a 0 plus n by 2, 0 plus n by 2 and what will be b n, b n

  • will be b 0 plus half into this summation b 0 plus half into the summation. I can write

  • this summation as, I know 1 by n summation is equal to 1 to n, x n minus mu whole square

  • will be the maximum likelihood estimate for variance plus call it sigma square hat ML.

  • Then, this summation is n times sigma square hat ML so, I can write b n as, b 0 plus n

  • by 2 sigma square hat ML. So, if I chosen the prior to be gamma with parameters a 0

  • and b 0 then, the posterior will become a gamma with parameters a n and b n. Where a

  • n turns out to be a 0 plus n by 2 and b n turns out to be b 0 plus n by 2 times, sigma

  • square hat ML where, sigma square hat ML is the maximum likelihood estimator for variance

  • in this case.

  • So, recall that sigma square hat ML is the estimate for variance, is the maximum likelihood

  • estimate for variance.

  • So now, if I want to take the mean of the posterior as the final estimate, as we know

  • for the gamma density with parameters a and b, the mean is a by b. So, here the posterior

  • density is gamma with parameters a n and b n so, the mean will be a n by b n. So, our

  • Bayesian estimate for nu nu hat would be a n by b n that is, a 0 plus n by 2 or b 0 plus

  • n by 2 sigma square hat ML right. Remember that, nu is 1 by sigma square right

  • so, if I do not did not have the a 0 and b 0, nu hat is 1 by sigma square Ml so, it is

  • same as the maximum likelihood estimate. Because, nu is actually 1 by sigma square so, the estimate

  • for 1 by sigma square will be 1 by sigma square hat ML. Now, the a 0 and b 0 are determined

  • by our choice of prior right, we are choosing a gamma density at the prior so, the kind

  • of gamma density we want, is what determines the values a 0 and b 0.

  • Now, what can we say about this density, once I have this estimate, once again as n tends

  • to infinity, nu hat converges to sigma square ML right because, as n tends to infinity,

  • n by 2 will be greater than both a 0 and b 0. So, this fraction essentially becomes n

  • by 2 by n by two sigma square hat ML so, I am I am sorry about to the typo nu hat converges

  • to 1 by sigma square hat ML. I am sorry, it is not nu hat converges to sigma square ML

  • but, nu hat converges to 1 by sigma square hat ML.

  • Also note, that the variance of the posterior right for a gamma density with parameters

  • a n and b n, the posterior the variance is a n by b n square. So, this is a n, this is

  • b n so, if you take the square, the numerator goes as n whereas, denominator increases as

  • n square so, that the variance goes to 0, as n tends to infinity. So, as n tends to

  • infinity, the posterior essentially becomes same as the mean and the mean is 1 by sigma

  • square hat ML so, once again just as we expect, the the Bayesian estimate is consistent. But,

  • at any small sample size, it is not only determined by the data thus, 1 by sigma square hat ML

  • but, is also determined by the initial a 0 and b 0, we choose for the prior prior density,

  • which is gaussian with parameters a 0 and b 0.

  • So, we have seen both Bayesian estimation for either the mean or the variance of the

  • Gaussian, mean we seen last time, for variance we are seeing just now. So, when I want to

  • estimate only the mean, assuming that the variance is known then, the prior turns out

  • to be a a Gaussian. When I want to estimate only the variance, assuming that the mean

  • is known, the prior turns out to be gamma. So, if I want to estimate both mean and variance

  • now, I have two parameters once again, from my experience in estimating variance, we will

  • choose nu as the parameter, for parameters in the density model.

  • So, my density model now is f of x given mu nu is this remember, that nu is 1 by sigma

  • square so, if both mu and nu are unknown, we need a prior, there is a joint density

  • on mu and nu. We already know that, if nu is known, only mu is unknown then, the prior

  • density is Gaussian, if mu is known and nu is unknown then, I know the prior density

  • is gamma. So, the joint density should be some combination of Gaussian and gamma, the

  • the algebra turns out be a little cumbersome so, I do not give you all the algebra, I will

  • just give you the final expression.

  • So, then the conjugate prior would be, what is called as Gaussian gamma density, the Gaussian

  • gamma density this, any joint density of any two random variables mu and nu here, can be

  • written as a product of the marginal of nu multiplied by the conditional of mu given

  • nu that is, true of anything. So, this is how, we will model this so, the the Gaussian

  • gamma density model is given here, as you can see what we are saying is, f nu is the

  • first term here, upto here first meaning, the first two terms so, this is the density

  • what we already seen, this is a gamma right. So, f nu is gamma with parameters a 0 and

  • b 0 and the density mu given nu is essentially a Gaussian density with nu as it is precision

  • or 1 by nu as it is variance. So, the conditional of mu given nu is Gaussian, with this is a

  • Gaussian in the in the variable mu with it is own mean mu 0 and the variance being a

  • function of the conditioning random variable. That is, this is not just directly 1 by nu

  • but it is 1 by c 0 nu so that is, the marginal for nu is a gamma density and the conditional

  • density of mu, conditioned on nu is a Gaussian. Actually, by looking at the at the data likelihood,

  • we can find that, this is the kind of dependence we need that, nu can always be expressed in

  • terms of nu to the power something and exponential linear in nu.

  • Whereas, the mu dependence can only be expressed by something that is the function of both

  • nu and mu that is why, we have to model this in this kind of a factorization. When I put

  • equality hereby now, we know, that we do not need the actual constants, we are only looking

  • at the form of this densities. So, by now, we have seen enough examples so, I started

  • misusing the abusing the notation, I just put equality even though, it is not really

  • equal. Because, this thing is not really a density,

  • there will always be a, in the second thing there will be some normalizing constant. There

  • will be one constant to make this gamma density a density, another constant to make this into

  • a proper normal density so, there will be some constant. But by now, we know this constants

  • do not matter so, I just, we are abusing notation by not putting that constant. And also, actually

  • we would have some relation between c 0 and b 0 and a 0 but, it really does not matter,

  • we can choose a slightly bigger class of densities at the conjugate prior.

  • Of course, this prior density is quite involved and doing the Bayesian estimation with this

  • prior density is not easy so, I will skip the details, you people can sit and do the

  • algebra, the algebra is more complicated. But ultimately, you get similar looking final

  • essentially, what we get is a convex combination of the sample mean plus something that depends

  • on the on the prior parameters for the for the estimate of nu.

  • And similarly, for the estimate of nu right, it will be some factor involving 1 by sigma

  • square hat ML and something that, depends on your a 0 b 0 in such a way that, as n tends

  • to infinity, once again the ML estimates and the Bayesian estimates will be same. So, we

  • will we will we, I have just given you the prior for this but, we will not actually derive

  • the final estimates. I have not done any of multidimensional examples, they are not conceptually

  • any more difficult than the one dimensional normally we did. But obviously, as you can

  • see compared to maximum likelihood, obtaining Bayesian estimates has lot more algebra so,

  • just the algebraic notation will be more cumbersome so, we will we will skip that.

  • So, we will simply say that, we can obtain Bayesian estimates like this for many standard

  • densities but, there is one part that is, by now evident, obtaining maximum likelihood

  • estimates and Bayesian estimates is not the same. For maximum likelihood estimates, for

  • almost mechanically, I can calculate the likelihood function, differentiate, equate to 0, find

  • the maximum and I get the estimates. For Bayesian estimate, I have to properly choose the right

  • kind of prior, which is the conjugate prior for that particular problem and only then,

  • right the the expressions are amenable to simplification.

  • And then, I have to look at the look at the parameters of the posterior density and based

  • on that, I I have to obtain my Bayesian estimates. As we have seen the conjugate prior would

  • depend on the form of x given theta as a matter of fact, for the same density in our mind

  • say, Gaussian depending on what is the parameterization that we think, what are the unknown parameters

  • that we think, the prior changes. For example, if we think only the mean of

  • the Gaussian is unknown then, the conjugate prior is Gaussian, if we think only the variance

  • is unknown and we choose the variance in terms of the precision parameter then, the density

  • happens to be gamma. If we think both mean and the precision are unknown then, the prior

  • density, the conjugate prior density turns out to be that gaussian gamma, I told you.

  • So, the conjugate prior would depend very much on the form of f of x given theta, the

  • procedure little more involves certainly than the maximum likelihood estimate.

  • So, what is that we gain with it, why is not maximum likelihood estimate sufficient, that

  • we have already seen, when we when we started the Bayesian estimate, that the reason why

  • we came to Bayesian estimate is that, maximum likelihood estimate blindly believes the data.

  • So, if we have some prior information about the kind of values the parameter can take

  • or because, our first few data are bad and we have very little data.

  • There is no way, we can make any incomplete information we have about the parameter to

  • to bear on the final estimate we get. Whereas, the prior density allows us this the this

  • flexibility so, essentially the prior density allows us to incorporate knowledge, that we

  • may have about the parameter that is, in the form of a conjugate prior. And as we seen

  • in the final expressions, it always comes up with an expression whereby, this small

  • sample problems are automatically handled by trading the part that I get only from data,

  • not trading by combining the part, that I get only from the data, with the part I get

  • from the prior. So, at small sample, our beliefs, kind of moderators in not jumping to too drastic

  • at conclusions based on data, that is the essence of the Bayesian estimation.

  • Now, we will slightly move to a few more related issues in estimation, we seen two specific

  • methods of estimation, we have seen how to do the estimation for different densities.

  • For example, we have derived ML and Bayesian estimates for a few standard densities now,

  • let us look at a few more generic issues about estimation. The first thing that we will do

  • is, to look at what is what is a generic representation for many densities so, there is one form for

  • a density. By by now, I suppose you become familiar to

  • this, that as i said right in the beginning, when we started on our estimation, we use

  • this the word density to mean either density or mass function. Depending on the random

  • variable is discrete or continuous so, we use density in a generic sense so, we are

  • saying, we will look first at at the representation of a density function in terms of some parameters,

  • that captures most of the standard densities, such a form is called the exponential family

  • of densities. It is a it is a very important thing because,

  • as we shall see later on, for exponential family of densities ML estimates become very

  • straight forward. So, we get generic ML estimates for all densities within the exponential family,

  • given the exponential family, for all of them, we can write one kind of generic set of equations

  • to solve, to get the ML estimates, we do not have to do individually. And equally importantly,

  • looking at this, kind of generic notion of a density function allows us to introduce

  • an important notion estimation, which is called the sufficient statistic.

  • So, first let us look at, what we call exponential family of densities suppose, you have a density

  • model for a random variable x with parameters eta. Eta could be a single parameter or many

  • parameters, with many parameter we will think of it as a parameter vector. We will write

  • the the density model as f x given eta as h of x that is, some function with only of

  • x multiplied by some function g of eta, is some function only of the parameter eta multiplied

  • by the exponential of eta transpose u x where u x is a vector of functions of the data.

  • So, given data I can make some new functions of data u 1 x, u 2 x for example, data is

  • x 1, x 2, x n, u 1 x could be summation x i, u 2 x could be summation x i square and

  • so on, you have sorry u 1 x could be x, u 2 x could be x square and so on. So, u x are

  • some vector functions of x; so if the density can be written as a product of a time involving

  • only x and a time involving only eta and a time that involves both eta and u and x in

  • a very special way, exponential eta transpose u x where u is a vector of predefined, a vector

  • of given functions of x. The reason, why it is called the exponential

  • family is that, I can always write it as exponential of something all right. I can write the exponential

  • of eta transpose u x plus this h x factor can be brought inside the exponential by writing

  • it as l n h x similarly, this as l n g x. Because, exponential of l n h x will be h

  • x, exponential of l n g x will be g x because, I can write it like this, it is called a exponential

  • family. The important thing is that, many standard

  • densities Bernoulli, binomial, Poisson, gamma, beta, Gaussian, exponential everything can

  • be put in this form. Of course, among these standard densities, the notable one that cannot

  • be written in this form is, uniform density except from uniform density, almost all the

  • standard densities can be put in this form.

  • So, let us look at a simple example of, how to put a density in in the standard exponential

  • let us consider, the Bernoulli distribution so, the mass function with parameter p is

  • given in terms of p power x into 1 minus p to the power 1 minus x. This is obviously,

  • not in the form of a factor involving only a x, multiplied factor involving only the

  • parameter multiplied by a factor like this right. But the point is, by not thinking of

  • p as the parameter but, something else as the parameter, we would be able to put it

  • in this form.

  • So, let us look at that so, starting with this, we can write this as, I can always write

  • anything so, I want you, I can write this p x into1e minus p to the power 1 minus x

  • as exponential l n of that. So, if I if I take l n and put inside exponential, the l

  • n of this will become x l n p plus 1 minus x l n 1 minus p so, that is what I did, exponential

  • x l n p plus 1 minus x l n 1 minus p. Now, there is, this 1 into l n 1 minus p, exponential

  • of l n 1 minus p will be 1 minus p so, let us let me take that factor out that is, 1

  • minus p. Now, I have got x l n p and minus x l n one

  • minus p, I can write it as, x into l n p by 1 minus p so, I can write this as 1 minus

  • p exponential of x l n p by 1 minus p. Now, this I can further write as, 1 by 1 plus p

  • by 1 minus p right. So this now, one can immediately see, if I think of p by 1 minus p as a parameter

  • then, I can write this as let us say, that is what I want to call eta then, this is some

  • factor that is dependent only on eta. And this is a factor that depend exponential

  • of let us say, l n p into 1 minus p is my eta then, eta times a function of x namely,

  • x. So, I have to somehow, write this as also l n p by 1 minus p, this is very easy I can

  • always write p by 1 minus p as exponential l n p by 1 minus p. So, I can write this as,

  • 1 by 1 plus exponential eta into exponential eta x where, eta is l n p by 1 minus p right.

  • So, I can write my Bernoulli mass function in as, 1 by 1 plus exponential eta into exponential

  • eta x where, eta is l n p by 1 minus p.

  • So, what does this mean, this is exactly in the form h x into g eta into exponential eta

  • transpose u x right h x is 1, there’ i no factor, there is only dependent on x right

  • so, h x is 1. What is g eta, g eta is this factor 1 by 1 plus exponential eta right that

  • is, g eta and I want exponential eta transpose u x, I have got eta times x. So, eta is a

  • scalar here so, I can simply take u x to be x right. So, if I take eta as l n p by 1 minus

  • p, h x as 1, g eta as 1 by 1 plus exponential eta and u x is equal to x then, it is in this

  • form where, this transpose is of course, is redundant here because, eta happens to be

  • one dimensional.

  • Whereas this means, that the Bernoulli density belongs to the exponential family in the same

  • way, for all the standard densities, we can put them in this general frame work. Sometimes

  • see for example, if I want to represent the Bernoulli mass function with p as the parameter

  • then, it is not in this generic form, h x into g eta into this. But, if I u instead

  • of using p as the parameter, I use l n p by 1 minus p as the parameter then, I can put

  • in this. After all, if you give me eta that is, l n

  • p by 1 minus p, I can calculate p or if you give me p, I can calculate l n p by 1 minus

  • p so, the eta to p transformation is one to one and invertible. So, by that I think of

  • eta as the parameter, as p as the parameter it does not matter but, if I think of eta

  • as the parameter, the mass function comes to a very standard form so, sometimes eta

  • is called the natural parameter for Bernoulli, this particular eta.

  • In the same, many other densities can be put in the exponential form so, for example, for

  • the Gaussian, the normally I will write it with, mu and sigma square as the two parameters

  • like this. But we can also write it like this, some function represents only acts essentially

  • a constant function, some factor that depends only on some parameters, which I call eta

  • 1 and eta 2. And then, exponential eta 1 time some function

  • of x plus eta 2 times some function of x where, eta 1 happens to be mu by sigma square, eta

  • 2 happens to be minus 1 by 2 sigma square, u 1 x happens to be x and u 2 x happens to

  • be x square. The algebra involved is a little more than the algebra involved in showing

  • this by the Bernoulli density but, once again it is just algebra. So, starting from this

  • expression, one can show that, this is same as this expression where, I make this following

  • changes eta 1 is mu by sigma square, eta 2 is minus 1 by sigma square, u 1 x is equal

  • to x, u 2 x is equal to x square.

  • So, under these things, this density once again recommends the form x h g eta exponential

  • eta transpose u x right. As I said h x can be thought of as it is constant function,

  • this is the g eta function and this is exponential eta 1 u 1 x plus eta 2 u 2 x where, eta 1,

  • eta 2, u 1, u 2 are given right. So, once again these are the form h x g eta exponential

  • eta transpose u x so, Gaussian is also in the exponential family and like this, we can

  • show that almost all standard densities belong to exponential class of densities. Now, what

  • is the use of showing many of these densities belong to the exponential family of densities,

  • the the main utility is that, as I said, we get a very standard generic form for the maximum

  • likelihood estimate.

  • And also, what it would mean is the following now. When a density is in this form, if i

  • take the data likelihood right, the data likelihood will depend on n fold product of this that

  • is, product of x h i g eta to the power n. And when I multiply exponential eta transpose

  • u x 1 into exponential transpose u x 2 and so on ultimately, I get exponential eta transpose

  • summation u x i. So, these functions u x or the quantity summation u x i are the only

  • way, the data affects the data likelihood. So, this form gives us a very interesting

  • generic way, in which data affects the data likelihood and hence, gives us a standard

  • method for calculating maximum likelihood estimates for all such densities.

  • So, in the next class, we will we will look at a few of the examples of the exponential

  • family of densities. And how, looking at all of them as exponential family of densities,

  • allows us to obtain maximum likelihood estimates in a in a generic fashion and then, we will

  • introduce the notion of, what is called a sufficient statistic.

  • Thank you.

Welcome to this next class in Pattern Recognition, we have been looking at density estimation.

字幕と単語

ワンタップで英和辞典検索 単語をクリックすると、意味が表示されます

C1 上級

Mod-03 Lec-08 ベイズ推定の例; 指数系列の密度とML推定 (Mod-03 Lec-08 Bayesian Estimation examples; the exponential family of densities and ML estimates)

  • 62 7
    aga に公開 2021 年 01 月 14 日
動画の中の単語