字幕表 動画を再生する
In statistics, logistic regression, or logit regression, is a type of probabilistic statistical
classification model. It is also used to predict a binary response from a binary predictor,
used for predicting the outcome of a categorical dependent variable based on one or more predictor
variables. That is, it is used in estimating the parameters of a qualitative response model.
The probabilities describing the possible outcomes of a single trial are modeled, as
a function of the explanatory variables, using a logistic function. Frequently "logistic
regression" is used to refer specifically to the problem in which the dependent variable
is binary—that is, the number of available categories is two—while problems with more
than two categories are referred to as multinomial logistic regression or, if the multiple categories
are ordered, as ordered logistic regression. Logistic regression measures the relationship
between a categorical dependent variable and one or more independent variables, which are
usually continuous, by using probability scores as the predicted values of the dependent variable.
As such it treats the same set of problems as does probit regression using similar techniques.
Fields and examples of applications Logistic regression was put forth in the 1940s
as an alternative to Fisher's 1936 classification method, linear discriminant analysis. It is
used extensively in numerous disciplines, including the medical and social science fields.
For example, the Trauma and Injury Severity Score, which is widely used to predict mortality
in injured patients, was originally developed by Boyd et al. using logistic regression.
Logistic regression might be used to predict whether a patient has a given disease, based
on observed characteristics of the patient. Another example might be to predict whether
an American voter will vote Democratic or Republican, based on age, income, gender,
race, state of residence, votes in previous elections, etc. The technique can also be
used in engineering, especially for predicting the probability of failure of a given process,
system or product. It is also used in marketing applications such as prediction of a customer's
propensity to purchase a product or cease a subscription, etc. In economics it can be
used to predict the likelihood of a person's choosing to be in the labor force, and a business
application would be to predict the likehood of a homeowner defaulting on a mortgage. Conditional
random fields, an extension of logistic regression to sequential data, are used in natural language
processing. Basics
Logistic regression can be binomial or multinomial. Binomial or binary logistic regression deals
with situations in which the observed outcome for a dependent variable can have only two
possible types. Multinomial logistic regression deals with situations where the outcome can
have three or more possible types. In binary logistic regression, the outcome is usually
coded as "0" or "1", as this leads to the most straightforward interpretation. If a
particular observed outcome for the dependent variable is the noteworthy possible outcome
it is usually coded as "1" and the contrary outcome as "0". Logistic regression is used
to predict the odds of being a case based on the values of the independent variables.
The odds are defined as the probability that a particular outcome is a case divided by
the probability that it is a noncase. Like other forms of regression analysis, logistic
regression makes use of one or more predictor variables that may be either continuous or
categorical data. Unlike ordinary linear regression, however, logistic regression is used for predicting
binary outcomes of the dependent variable rather than continuous outcomes. Given this
difference, it is necessary that logistic regression take the natural logarithm of the
odds of the dependent variable being a case to create a continuous criterion as a transformed
version of the dependent variable. Thus the logit transformation is referred to as the
link function in logistic regression—although the dependent variable in logistic regression
is binomial, the logit is the continuous criterion upon which linear regression is conducted.
The logit of success is then fit to the predictors using linear regression analysis. The predicted
value of the logit is converted back into predicted odds via the inverse of the natural
logarithm, namely the exponential function. Therefore, although the observed dependent
variable in logistic regression is a zero-or-one variable, the logistic regression estimates
the odds, as a continuous variable, that the dependent variable is a success. In some applications
the odds are all that is needed. In others, a specific yes-or-no prediction is needed
for whether the dependent variable is or is not a case; this categorical prediction can
be based on the computed odds of a success, with predicted odds above some chosen cut-off
value being translated into a prediction of a success.
Logistic function, odds ratio, and logit
An explanation of logistic regression begins with an explanation of the logistic function,
which always takes on values between zero and one:
and viewing as a linear function of an explanatory variable , the logistic function can be written
as:
This will be interpreted as the probability of the dependent variable equalling a "success"
or "case" rather than a failure or non-case. We also define the inverse of the logistic
function, the logit:
and equivalently:
A graph of the logistic function is shown in Figure 1. The input is the value of and
the output is . The logistic function is useful because it can take an input with any value
from negative infinity to positive infinity, whereas the output is confined to values between
0 and 1 and hence is interpretable as a probability. In the above equations, refers to the logit
function of some given linear combination of the predictors, denotes the natural logarithm,
is the probability that the dependent variable equals a case, is the intercept from the linear
regression equation, is the regression coefficient multiplied by some value of the predictor,
and base denotes the exponential function. The formula for illustrates that the probability
of the dependent variable equaling a case is equal to the value of the logistic function
of the linear regression expression. This is important in that it shows that the value
of the linear regression expression can vary from negative to positive infinity and yet,
after transformation, the resulting expression for the probability ranges between 0 and 1.
The equation for illustrates that the logit is equivalent to the linear regression expression.
Likewise, the next equation illustrates that the odds of the dependent variable equaling
a case is equivalent to the exponential function of the linear regression expression. This
illustrates how the logit serves as a link function between the probability and the linear
regression expression. Given that the logit ranges between negative infinity and positive
infinity, it provides an adequate criterion upon which to conduct linear regression and
the logit is easily converted back into the odds.
Multiple explanatory variables If there are multiple explanatory variables,
then the above expression can be revised to Then when this is used in the equation relating
the logged odds of a success to the values of the predictors, the linear regression will
be a multiple regression with m explanators; the parameters for all j = 0, 1, 2, ..., m
are all estimated. Model fitting
Estimation Maximum likelihood estimation
The regression coefficients are usually estimated using maximum likelihood estimation. Unlike
linear regression with normally distributed residuals, it is not possible to find a closed-form
expression for the coefficient values that maximizes the likelihood function, so an iterative
process must be used instead, for example Newton's method. This process begins with
a tentative solution, revises it slightly to see if it can be improved, and repeats
this revision until improvement is minute, at which point the process is said to have
converged. In some instances the model may not reach
convergence. When a model does not converge this indicates that the coefficients are not
meaningful because the iterative process was unable to find appropriate solutions. A failure
to converge may occur for a number of reasons: having a large proportion of predictors to
cases, multicollinearity, sparseness, or complete separation.
Having a large proportion of variables to cases results in an overly conservative Wald
statistic and can lead to nonconvergence. Multicollinearity refers to unacceptably high
correlations between predictors. As multicollinearity increases, coefficients remain unbiased but
standard errors increase and the likelihood of model convergence decreases. To detect
multicollinearity amongst the predictors, one can conduct a linear regression analysis
with the predictors of interest for the sole purpose of examining the tolerance statistic
used to assess whether multicollinearity is unacceptably high.
Sparseness in the data refers to having a large proportion of empty cells. Zero cell
counts are particularly problematic with categorical predictors. With continuous predictors, the
model can infer values for the zero cell counts, but this is not the case with categorical
predictors. The reason the model will not converge with zero cell counts for categorical
predictors is because the natural logarithm of zero is an undefined value, so final solutions
to the model cannot be reached. To remedy this problem, researchers may collapse categories
in a theoretically meaningful way or may consider adding a constant to all cells.
Another numerical problem that may lead to a lack of convergence is complete separation,
which refers to the instance in which the predictors perfectly predict the criterion
– all cases are accurately classified. In such instances, one should reexamine the data,
as there is likely some kind of error. Although not a precise number, as a general
rule of thumb, logistic regression models require a minimum of 10 events per explaining
variable. Minimum chi-squared estimator for grouped
data While individual data will have a dependent
variable with a value of zero or one for every observation, with grouped data one observation
is on a group of people who all share the same characteristics; in this case the researcher
observes the proportion of people in the group for whom the response variable falls into
one category or the other. If this proportion is neither zero nor one for any group, the
minimum chi-squared estimator involves using weighted least squares to estimate a linear
model in which the dependent variable is the logit of the proportion: that is, the log
of the ratio of the fraction in one group to the fraction in the other group.
Evaluating goodness of fit Goodness of fit in linear regression models
is generally measured using the R2. Since this has no direct analog in logistic regression,
various methods including the following can be used instead.
Deviance and likelihood ratio tests In linear regression analysis, one is concerned
with partitioning variance via the sum of squares calculations – variance in the criterion
is essentially divided into variance accounted for by the predictors and residual variance.
In logistic regression analysis, deviance is used in lieu of sum of squares calculations.
Deviance is analogous to the sum of squares calculations in linear regression and is a
measure of the lack of fit to the data in a logistic regression model. Deviance is calculated
by comparing a given model with the saturated model – a model with a theoretically perfect
fit. This computation is called the likelihood-ratio test:
In the above equation D represents the deviance and ln represents the natural logarithm. The
log of the likelihood ratio will produce a negative value, so the product is multiplied
by negative two times its natural logarithm to produce a value with an approximate chi-squared
distribution. Smaller values indicate better fit as the fitted model deviates less from
the saturated model. When assessed upon a chi-square distribution, nonsignificant chi-square
values indicate very little unexplained variance and thus, good model fit. Conversely, a significant
chi-square value indicates that a significant amount of the variance is unexplained.
Two measures of deviance are particularly important in logistic regression: null deviance
and model deviance. The null deviance represents the difference between a model with only the
intercept and the saturated model. And, the model deviance represents the difference between
a model with at least one predictor and the saturated model. In this respect, the null
model provides a baseline upon which to compare predictor models. Given that deviance is a
measure of the difference between a given model and the saturated model, smaller values
indicate better fit. Therefore, to assess the contribution of a predictor or set of
predictors, one can subtract the model deviance from the null deviance and assess the difference
on a chi-square distribution with degree of freedom equal to the difference in the number
of parameters estimated. Let
Then
If the model deviance is significantly smaller than the null deviance then one can conclude
that the predictor or set of predictors significantly improved model fit. This is analogous to the
F-test used in linear regression analysis to assess the significance of prediction.
Pseudo-R2s In linear regression the squared multiple
correlation, R2 is used to assess goodness of fit as it represents the proportion of
variance in the criterion that is explained by the predictors. In logistic regression
analysis, there is no agreed upon analogous measure, but there are several competing measures
each with limitations. Three of the most commonly used indices are examined on this page beginning
with the likelihood ratio R2, R2L:
This is the most analogous index to the squared multiple correlation in linear regression.
It represents the proportional reduction in the deviance wherein the deviance is treated
as a measure of variation analogous but not identical to the variance in linear regression
analysis. One limitation of the likelihood ratio R2 is that it is not monotonically related
to the odds ratio, meaning that it does not necessarily increase as the odds ratio increases
and does not necessarily decrease as the odds ratio decreases.
The Cox and Snell R2 is an alternative index of goodness of fit related to the R2 value
from linear regression. The Cox and Snell index is problematic as its maximum value
is .75, when the variance is at its maximum. The Nagelkerke R2 provides a correction to
the Cox and Snell R2 so that the maximum value is equal to one. Nevertheless, the Cox and
Snell and likelihood ratio R2s show greater agreement with each other than either does
with the Nagelkerke R2. Of course, this might not be the case for values exceeding .75 as
the Cox and Snell index is capped at this value. The likelihood ratio R2 is often preferred
to the alternatives as it is most analogous to R2 in linear regression, is independent
of the base rate and varies between 0 and 1.
A word of caution is in order when interpreting pseudo-R2 statistics. The reason these indices
of fit are referred to as pseudo R2 is because they do not represent the proportionate reduction
in error as the R2 in linear regression does. Linear regression assumes homoscedasticity,
that the error variance is the same for all values of the criterion. Logistic regression
will always be heteroscedastic – the error variances differ for each value of the predicted
score. For each value of the predicted score there would be a different value of the proportionate
reduction in error. Therefore, it is inappropriate to think of R2 as a proportionate reduction
in error in a universal sense in logistic regression.
Hosmer–Lemeshow test The Hosmer–Lemeshow test uses a test statistic
that asymptotically follows a distribution to assess whether or not the observed event
rates match expected event rates in subgroups of the model population.
Evaluating binary classification performance If the estimated probabilities are to be used
to classify each observation of independent variable values as predicting the category
that the dependent variable is found in, the various methods below for judging the model's
suitability in out-of-sample forecasting can also be used on the data that were used for
estimation—accuracy, precision, recall, specificity and negative predictive value.
In each of these evaluative methods, an aspect of the model's effectiveness in assigning
instances to the correct categories is measured. Coefficients
After fitting the model, it is likely that researchers will want to examine the contribution
of individual predictors. To do so, they will want to examine the regression coefficients.
In linear regression, the regression coefficients represent the change in the criterion for
each unit change in the predictor. In logistic regression, however, the regression coefficients
represent the change in the logit for each unit change in the predictor. Given that the
logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential
function of the regression coefficient – the odds ratio. In linear regression, the significance
of a regression coefficient is assessed by computing a t-test. In logistic regression,
there are several different tests designed to assess the significance of an individual
predictor, most notably the likelihood ratio test and the Wald statistic.
Likelihood ratio test The likelihood-ratio test discussed above
to assess model fit is also the recommended procedure to assess the contribution of individual
"predictors" to a given model. In the case of a single predictor model, one simply compares
the deviance of the predictor model with that of the null model on a chi-square distribution
with a single degree of freedom. If the predictor model has a significantly smaller deviance,
then one can conclude that there is a significant association between the "predictor" and the
outcome. Although some common statistical packages do provide likelihood ratio test
statistics, without this computationally intensive test it would be more difficult to assess
the contribution of individual predictors in the multiple logistic regression case.
To assess the contribution of individual predictors one can enter the predictors hierarchically,
comparing each new model with the previous to determine the contribution of each predictor.
(There is considerable debate among statisticians regarding the appropriateness of so-called
"stepwise" procedures. They do not preserve the nominal statistical properties and can
be very misleading.[1] Wald statistic
Alternatively, when assessing the contribution of individual predictors in a given model,
one may examine the significance of the Wald statistic. The Wald statistic, analogous to
the t-test in linear regression, is used to assess the significance of coefficients. The
Wald statistic is the ratio of the square of the regression coefficient to the square
of the standard error of the coefficient and is asymptotically distributed as a chi-square
distribution.
Although several statistical packages report the Wald statistic to assess the contribution
of individual predictors, the Wald statistic has limitations. When the regression coefficient
is large, the standard error of the regression coefficient also tends to be large increasing
the probability of Type-II error. The Wald statistic also tends to be biased when data
are sparse. Case-control sampling
Suppose cases are rare. Then we might wish to sample them more frequently than their
prevalence in the population. For example, suppose there is a disease that affects 1
person in 10,000 and to collect our data we need to do a complete physical. It may be
too expensive to do thousands of physicals of healthy people in order to get data on
only a few diseased individuals. Thus, we may evaluate more diseased individuals. This
is also called unbalanced data. As a rule of thumb, sampling controls at a rate of five
times the number of cases is sufficient to get enough control data.
If we form a logistic model from such data, if the model is correct, the parameters are
all correct except for . We can correct if we know the true prevalence as follows:
where is the true prevalence and is the prevalence in the sample.
Formal mathematical specification There are various equivalent specifications
of logistic regression, which fit into different types of more general models. These different
specifications allow for different sorts of useful generalizations.
Setup The basic setup of logistic regression is
the same as for standard linear regression. It is assumed that we have a series of N observed
data points. Each data point i consists of a set of m explanatory variables x1,i ... xm,i,
and an associated binary-valued outcome variable Yi, i.e. it can assume only the two possible
values 0 or 1. The goal of logistic regression is to explain the relationship between the
explanatory variables and the outcome, so that an outcome can be predicted for a new
set of explanatory variables. Some examples:
The observed outcomes are the presence or absence of a given disease in a set of patients,
and the explanatory variables might be characteristics of the patients thought to be pertinent.
The observed outcomes are the votes of a set of people in an election, and the explanatory
variables are the demographic characteristics of each person. In such a case, one of the
two outcomes is arbitrarily coded as 1, and the other as 0.
As in linear regression, the outcome variables Yi are assumed to depend on the explanatory
variables x1,i ... xm,i. Explanatory variables
As shown above in the above examples, the explanatory variables may be of any type:
real-valued, binary, categorical, etc. The main distinction is between continuous variables
and discrete variables. Discrete variables referring to more than two possible choices
are typically coded using dummy variables, that is, separate explanatory variables taking
the value 0 or 1 are created for each possible value of the discrete variable, with a 1 meaning
"variable does have the given value" and a 0 meaning "variable does not have that value".
For example, a four-way discrete variable of blood type with the possible values "A,
B, AB, O" can be converted to four separate two-way dummy variables, "is-A, is-B, is-AB,
is-O", where only one of them has the value 1 and all the rest have the value 0. This
allows for separate regression coefficients to be matched for each possible value of the
discrete variable. Outcome variables
Formally, the outcomes Yi are described as being Bernoulli-distributed data, where each
outcome is determined by an unobserved probability pi that is specific to the outcome at hand,
but related to the explanatory variables. This can be expressed in any of the following
equivalent forms:
The meanings of these four lines are: The first line expresses the probability distribution
of each Yi: Conditioned on the explanatory variables, it follows a Bernoulli distribution
with parameters pi, the probability of the outcome of 1 for trial i. As noted above,
each separate trial has its own probability of success, just as each trial has its own
explanatory variables. The probability of success pi is not observed, only the outcome
of an individual Bernoulli trial using that probability.
The second line expresses the fact that the expected value of each Yi is equal to the
probability of success pi, which is a general property of the Bernoulli distribution. In
other words, if we run a large number of Bernoulli trials using the same probability of success
pi, then take the average of all the 1 and 0 outcomes, then the result would be close
to pi. This is because doing an average this way simply computes the proportion of successes
seen, which we expect to converge to the underlying probability of success.
The third line writes out the probability mass function of the Bernoulli distribution,
specifying the probability of seeing each of the two possible outcomes.
The fourth line is another way of writing the probability mass function, which avoids
having to write separate cases and is more convenient for certain types of calculations.
This relies on the fact that Yi can take only the value 0 or 1. In each case, one of the
exponents will be 1, "choosing" the value under it, while the other is 0, "canceling
out" the value under it. Hence, the outcome is either pi or 1 − pi, as in the previous
line. Linear predictor function
The basic idea of logistic regression is to use the mechanism already developed for linear
regression by modeling the probability pi using a linear predictor function, i.e. a
linear combination of the explanatory variables and a set of regression coefficients that
are specific to the model at hand but the same for all trials. The linear predictor
function for a particular data point i is written as:
where are regression coefficients indicating the relative effect of a particular explanatory
variable on the outcome. The model is usually put into a more compact
form as follows: The regression coefficients β0, β1, ..., βm
are grouped into a single vector β of size m + 1.
For each data point i, an additional explanatory pseudo-variable x0,i is added, with a fixed
value of 1, corresponding to the intercept coefficient β0.
The resulting explanatory variables x0,i, x1,i, ..., xm,i are then grouped into a single
vector Xi of size m + 1. This makes it possible to write the linear
predictor function as follows:
using the notation for a dot product between two vectors.
As a generalized linear model The particular model used by logistic regression,
which distinguishes it from standard linear regression and from other types of regression
analysis used for binary-valued outcomes, is the way the probability of a particular
outcome is linked to the linear predictor function:
Written using the more compact notation described above, this is:
This formulation expresses logistic regression as a type of generalized linear model, which
predicts variables with various types of probability distributions by fitting a linear predictor
function of the above form to some sort of arbitrary transformation of the expected value
of the variable. The intuition for transforming using the logit
function was explained above. It also has the practical effect of converting the probability
to a variable that ranges over — thereby matching the potential range of the linear
prediction function on the right side of the equation.
Note that both the probabilities pi and the regression coefficients are unobserved, and
the means of determining them is not part of the model itself. They are typically determined
by some sort of optimization procedure, e.g. maximum likelihood estimation, that finds
values that best fit the observed data, usually subject to regularization conditions that
seek to exclude unlikely values, e.g. extremely large values for any of the regression coefficients.
The use of a regularization condition is equivalent to doing maximum a posteriori estimation,
an extension of maximum likelihood. Whether or not regularization is used, it is usually
not possible to find a closed-form solution; instead, an iterative numerical method must
be used, such as iteratively reweighted least squares or, more commonly these days, a quasi-Newton
method such as the L-BFGS method. The interpretation of the βj parameter estimates
is as the additive effect on the log of the odds for a unit change in the jth explanatory
variable. In the case of a dichotomous explanatory variable, for instance gender, is the estimate
of the odds of having the outcome for, say, males compared with females.
An equivalent formula uses the inverse of the logit function, which is the logistic
function, i.e.:
The formula can also be written as a probability distribution:
As a latent-variable model The above model has an equivalent formulation
as a latent-variable model. This formulation is common in the theory of discrete choice
models, and makes it easier to extend to certain more complicated models with multiple, correlated
choices, as well as to compare logistic regression to the closely related probit model.
Imagine that, for each trial i, there is a continuous latent variable Yi* that is distributed
as follows:
where
i.e. the latent variable can be written directly in terms of the linear predictor function
and an additive random error variable that is distributed according to a standard logistic
distribution. Then Yi can be viewed as an indicator for
whether this latent variable is positive:
The choice of modeling the error variable specifically with a standard logistic distribution,
rather than a general logistic distribution with the location and scale set to arbitrary
values, seems restrictive, but in fact it is not. It must be kept in mind that we can
choose the regression coefficients ourselves, and very often can use them to offset changes
in the parameters of the error variable's distribution. For example, a logistic error-variable
distribution with a non-zero location parameter μ is equivalent to a distribution with a
zero location parameter, where μ has been added to the intercept coefficient. Both situations
produce the same value for Yi* regardless of settings of explanatory variables. Similarly,
an arbitrary scale parameter s is equivalent to setting the scale parameter to 1 and then
dividing all regression coefficients by s. In the latter case, the resulting value of
Yi* will be smaller by a factor of s than in the former case, for all sets of explanatory
variables — but critically, it will always remain on the same side of 0, and hence lead
to the same Yi choice. (Note that this predicts that the irrelevancy
of the scale parameter may not carry over into more complex models where more than two
choices are available.) It turns out that this formulation is exactly
equivalent to the preceding one, phrased in terms of the generalized linear model and
without any latent variables. This can be shown as follows, using the fact that the
cumulative distribution function of the standard logistic distribution is the logistic function,
which is the inverse of the logit function, i.e.
Then:
This formulation — which is standard in discrete choice models — makes clear the
relationship between logistic regression and the probit model, which uses an error variable
distributed according to a standard normal distribution instead of a standard logistic
distribution. Both the logistic and normal distributions are symmetric with a basic unimodal,
"bell curve" shape. The only difference is that the logistic distribution has somewhat
heavier tails, which means that it is less sensitive to outlying data.
As a two-way latent-variable model Yet another formulation uses two separate
latent variables:
where
where EV1(0,1) is a standard type-1 extreme value distribution: i.e.
Then
This model has a separate latent variable and a separate set of regression coefficients
for each possible outcome of the dependent variable. The reason for this separation is
that it makes it easy to extend logistic regression to multi-outcome categorical variables, as
in the multinomial logit model. In such a model, it is natural to model each possible
outcome using a different set of regression coefficients. It is also possible to motivate
each of the separate latent variables as the theoretical utility associated with making
the associated choice, and thus motivate logistic regression in terms of utility theory. This
is the approach taken by economists when formulating discrete choice models, because it both provides
a theoretically strong foundation and facilitates intuitions about the model, which in turn
makes it easy to consider various sorts of extensions.
The choice of the type-1 extreme value distribution seems fairly arbitrary, but it makes the mathematics
work out, and it may be possible to justify its use through rational choice theory.
It turns out that this model is equivalent to the previous model, although this seems
non-obvious, since there are now two sets of regression coefficients and error variables,
and the error variables have a different distribution. In fact, this model reduces directly to the
previous one with the following substitutions:
An intuition for this comes from the fact that, since we choose based on the maximum
of two values, only their difference matters, not the exact values — and this effectively
removes one degree of freedom. Another critical fact is that the difference of two type-1
extreme-value-distributed variables is a logistic distribution, i.e. if
We can demonstrate the equivalent as follows:
Example As an example, consider a province-level election
where the choice is between a right-of-center party, a left-of-center party, and a secessionist
party. We would then use three latent variables, one for each choice. Then, in accordance with
utility theory, we can then interpret the latent variables as expressing the utility
that results from making each of the choices. We can also interpret the regression coefficients
as indicating the strength that the associated factor has in contributing to the utility
— or more correctly, the amount by which a unit change in an explanatory variable changes
the utility of a given choice. A voter might expect that the right-of-center party would
lower taxes, especially on rich people. This would give low-income people no benefit, i.e.
no change in utility; would cause moderate benefit for middle-incoming people; and would
cause significant benefits for high-income people. On the other hand, the left-of-center
party might be expected to raise taxes and offset it with increased welfare and other
assistance for the lower and middle classes. This would cause significant positive benefit
to low-income people, perhaps weak benefit to middle-income people, and significant negative
benefit to high-income people. Finally, the secessionist party would take no direct actions
on the economy, but simply secede. A low-income or middle-income voter might expect basically
no clear utility gain or loss from this, but a high-income voter might expect negative
utility, since he/she is likely to own companies, which will have a harder time doing business
in such an environment and probably lose money. These intuitions can be expressed as follows:
This clearly shows that Separate sets of regression coefficients need
to exist for each choice. When phrased in terms of utility, this can be seen very easily.
Different choices have different effects on net utility; furthermore, the effects vary
in complex ways that depend on the characteristics of each individual, so there need to be separate
sets of coefficients for each characteristic, not simply a single extra per-choice characteristic.
Even though income is a continuous variable, its effect on utility is too complex for it
to be treated as a single variable. Either it needs to be directly split up into ranges,
or higher powers of income need to be added so that polynomial regression on income is
effectively done. As a "log-linear" model
Yet another formulation combines the two-way latent variable formulation above with the
original formulation higher up without latent variables, and in the process provides a link
to one of the standard formulations of the multinomial logit.
Here, instead of writing the logit of the probabilities pi as a linear predictor, we
separate the linear predictor into two, one for each of the two outcomes:
Note that two separate sets of regression coefficients have been introduced, just as
in the two-way latent variable model, and the two equations appear a form that writes
the logarithm of the associated probability as a linear predictor, with an extra term
at the end. This term, as it turns out, serves as the normalizing factor ensuring that the
result is a distribution. This can be seen by exponentiating both sides:
In this form it is clear that the purpose of Z is to ensure that the resulting distribution
over Yi is in fact a probability distribution, i.e. it sums to 1. This means that Z is simply
the sum of all un-normalized probabilities, and by dividing each probability by Z, the
probabilities become "normalized". That is:
and the resulting equations are
Or generally:
This shows clearly how to generalize this formulation to more than two outcomes, as
in multinomial logit. In order to prove that this is equivalent
to the previous model, note that the above model is overspecified, in that and cannot
be independently specified: rather so knowing one automatically determines the other. As
a result, the model is nonidentifiable, in that multiple combinations of β0 and β1
will produce the same probabilities for all possible explanatory variables. In fact, it
can be seen that adding any constant vector to both of them will produce the same probabilities:
As a result, we can simplify matters, and restore identifiability, by picking an arbitrary
value for one of the two vectors. We choose to set Then,
and so
which shows that this formulation is indeed equivalent to the previous formulation.
Note that most treatments of the multinomial logit model start out either by extending
the "log-linear" formulation presented here or the two-way latent variable formulation
presented above, since both clearly show the way that the model could be extended to multi-way
outcomes. In general, the presentation with latent variables is more common in econometrics
and political science, where discrete choice models and utility theory reign, while the
"log-linear" formulation here is more common in computer science, e.g. machine learning
and natural language processing. As a single-layer perceptron
The model has an equivalent formulation
This functional form is commonly called a single-layer perceptron or single-layer artificial
neural network. A single-layer neural network computes a continuous output instead of a
step function. The derivative of pi with respect to X = (x1, ..., xk) is computed from the
general form:
where f(X) is an analytic function in X. With this choice, the single-layer neural network
is identical to the logistic regression model. This function has a continuous derivative,
which allows it to be used in backpropagation. This function is also preferred because its
derivative is easily calculated:
In terms of binomial data A closely related model assumes that each
i is associated not with a single Bernoulli trial but with ni independent identically
distributed trials, where the observation Yi is the number of successes observed, and
hence follows a binomial distribution:
An example of this distribution is the fraction of seeds that germinate after ni are planted.
In terms of expected values, this model is expressed as follows:
so that
Or equivalently:
This model can be fit using the same sorts of methods as the above more basic model.
Bayesian logistic regression
In a Bayesian statistics context, prior distributions are normally placed on the regression coefficients,
usually in the form of Gaussian distributions. Unfortunately, the Gaussian distribution is
not the conjugate prior of the likelihood function in logistic regression; in fact,
the likelihood function is not an exponential family and thus does not have a conjugate
prior at all. As a result, the posterior distribution is difficult to calculate, even using standard
simulation algorithms. There are various possibilities:
Don't do a proper Bayesian analysis, but simply compute a maximum a posteriori point estimate
of the parameters. This is common, for example, in "maximum entropy" classifiers in machine
learning. Use a more general approximation method such
as the Metropolis–Hastings algorithm. Draw a Markov chain Monte Carlo sample from
the exact posterior by using the Independent Metropolis–Hastings algorithm with heavy-tailed
multivariate candidate distribution found by matching the mode and curvature at the
mode of the normal approximation to the posterior and then using the Student’s t shape with
low degrees of freedom. This is shown to have excellent convergence properties.
Use a latent variable model and approximate the logistic distribution using a more tractable
distribution, e.g. a Student's t-distribution or a mixture of normal distributions.
Do probit regression instead of logistic regression. This is actually a special case of the previous
situation, using a normal distribution in place of a Student's t, mixture of normals,
etc. This will be less accurate but has the advantage that probit regression is extremely
common, and a ready-made Bayesian implementation may already be available.
Use the Laplace approximation of the posterior distribution. This approximates the posterior
with a Gaussian distribution. This is not a terribly good approximation, but it suffices
if all that is desired is an estimate of the posterior mean and variance. In such a case,
an approximation scheme such as variational Bayes can be used.
Gibbs sampling with an approximating distribution As shown above, logistic regression is equivalent
to a latent variable model with an error variable distributed according to a standard logistic
distribution. The overall distribution of the latent variable is also a logistic distribution,
with the mean equal to . This model considerably simplifies the application of techniques such
as Gibbs sampling. However, sampling the regression coefficients is still difficult, because of
the lack of conjugacy between the normal and logistic distributions. Changing the prior
distribution over the regression coefficients is of no help, because the logistic distribution
is not in the exponential family and thus has no conjugate prior.
One possibility is to use a more general Markov chain Monte Carlo technique, such as the Metropolis–Hastings
algorithm, which can sample arbitrary distributions. Another possibility, however, is to replace
the logistic distribution with a similar-shaped distribution that is easier to work with using
Gibbs sampling. In fact, the logistic and normal distributions have a similar shape,
and thus one possibility is simply to have normally distributed errors. Because the normal
distribution is conjugate to itself, sampling the regression coefficients becomes easy.
In fact, this model is exactly the model used in probit regression.
However, the normal and logistic distributions differ in that the logistic has heavier tails.
As a result, it is more robust to inaccuracies in the underlying model or to errors in the
data. Probit regression loses some of this robustness.
Another alternative is to use errors distributed as a Student's t-distribution. The Student's
t-distribution has heavy tails, and is easy to sample from because it is the compound
distribution of a normal distribution with variance distributed as an inverse gamma distribution.
In other words, if a normal distribution is used for the error variable, and another latent
variable, following an inverse gamma distribution, is added corresponding to the variance of
this error variable, the marginal distribution of the error variable will follow a Student's
t-distribution. Because of the various conjugacy relationships, all variables in this model
are easy to sample from. The Student's t-distribution that best approximates
a standard logistic distribution can be determined by matching the moments of the two distributions.
The Student's t-distribution has three parameters, and since the skewness of both distributions
is always 0, the first four moments can all be matched, using the following equations:
This yields the following values:
The following graphs compare the standard logistic distribution with the Student's t-distribution
that matches the first four moments using the above-determined values, as well as the
normal distribution that matches the first two moments. Note how much closer the Student's
t-distribution agrees, especially in the tails. Beyond about two standard deviations from
the mean, the logistic and normal distributions diverge rapidly, but the logistic and Student's
t-distributions don't start diverging significantly until more than 5 standard deviations away.
(Another possibility, also amenable to Gibbs sampling, is to approximate the logistic distribution
using a mixture density of normal distributions.) Extensions
There are large numbers of extensions: Multinomial logistic regression handles the
case of a multi-way categorical dependent variable. Note that the general case of having
dependent variables with more than two values is termed polytomous regression.
Ordered logistic regression handles ordinal dependent variables.
Mixed logit is an extension of multinomial logit that allows for correlations among the
choices of the dependent variable. An extension of the logistic model to sets
of interdependent variables is the conditional random field.
Model suitability A way to measure a model's suitability is
to assess the model against a set of data that was not used to create the model. The
class of techniques is called cross-validation. This holdout model assessment method is particularly
valuable when data are collected in different settings or when models are assumed to be
generalizable. To measure the suitability of a binary regression
model, one can classify both the actual value and the predicted value of each observation
as either 0 or 1. The predicted value of an observation can be set equal to 1 if the estimated
probability that the observation equals 1 is above , and set equal to 0 if the estimated
probability is below . Here logistic regression is being used as a binary classification model.
There are four possible combined classifications: prediction of 0 when the holdout sample has
a 0 prediction of 0 when the holdout sample has
a 1 prediction of 1 when the holdout sample has
a 0 prediction of 1 when the holdout sample has
a 1 These classifications are used to calculate
accuracy, precision, recall, specificity and negative predictive value:
= fraction of observations with correct predicted classification
= Fraction of predicted positives that are correct
= fraction of predicted negatives that are correct
= fraction of observations that are actually 1 with a correct predicted classification
= fraction of observations that are actually 0 with a correct predicted classification
See also
Logistic function Discrete choice
Jarrow–Turnbull model Limited dependent variable
Multinomial logit model Ordered logit
Hosmer–Lemeshow test Brier score
MLPACK - contains a C++ implementation of logistic regression
References
Further reading Agresti, Alan.. Categorical Data Analysis.
New York: Wiley-Interscience. ISBN 0-471-36093-7. Amemiya, T.. Advanced Econometrics. Harvard
University Press. ISBN 0-674-00560-0. Balakrishnan, N.. Handbook of the Logistic
Distribution. Marcel Dekker, Inc. ISBN 978-0-8247-8587-1. Greene, William H.. Econometric Analysis,
fifth edition. Prentice Hall. ISBN 0-13-066189-9. Hilbe, Joseph M.. Logistic Regression Models.
Chapman & Hall/CRC Press. ISBN 978-1-4200-7575-5. Howell, David C.. Statistical Methods for
Psychology, 7th ed. Belmont, CA; Thomson Wadsworth. ISBN 978-0-495-59786-5.
Peduzzi, P.; J. Concato, E. Kemper, T.R. Holford, A.R. Feinstein. "A simulation study of the
number of events per variable in logistic regression analysis". Journal of Clinical
Epidemiology 49: 1373–1379. doi:10.1016/s0895-4356(96)00236-3. PMID 8970487.
External links Econometrics Lecture on YouTube by Mark Thoma
Logistic Regression Interpretation Logistic Regression tutorial
Using open source software for building Logistic Regression models