An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Statistics review 14: Logistic regression
Jonathan ball.
- Author information
- Article notes
- Copyright and License information
Corresponding author.
Issue date 2005.
This review introduces logistic regression, which is a method for modelling the dependence of a binary response variable on one or more explanatory variables. Continuous and categorical explanatory variables are considered.
Keywords: binomial distribution, Hosmer–Lemeshow test, likelihood, likelihood ratio test, logit function, maximum likelihood estimation, median effective level, odds, odds ratio, predicted probability, Wald test
Introduction
Logistic regression provides a method for modelling a binary response variable, which takes values 1 and 0. For example, we may wish to investigate how death (1) or survival (0) of patients can be predicted by the level of one or more metabolic markers. As an illustrative example, consider a sample of 2000 patients whose levels of a metabolic marker have been measured. Table 1 shows the data grouped into categories according to metabolic marker level, and the proportion of deaths in each category is given. The proportions of deaths are estimates of the probabilities of death in each category. Figure 1 shows a plot of these proportions. It suggests that the probability of death increases with the metabolic marker level. However, it can be seen that the relationship is nonlinear and that the probability of death changes very little at the high or low extremes of marker level. This pattern is typical because proportions cannot lie outside the range from 0 to 1. The relationship can be described as following an 'S'-shaped curve.
Relationship between level of a metabolic marker and survival
Proportion of deaths plotted against the metabolic marker group midpoints for the data presented in Table 1.
Logistic regression with a single quantitative explanatory variable
The logistic or logit function is used to transform an 'S'-shaped curve into an approximately straight line and to change the range of the proportion from 0–1 to -∞ to +∞.
The logit function is defined as the natural logarithm (ln) of the odds [ 1 ] of death. That is,
Where p is the probability of death.
Figure 2 shows the logit-transformed proportions from Fig. 1 . The points now follow an approximately straight line. The relationship between probability of death and marker level x could therefore be modelled as follows:
Logit(p) plotted against the metabolic marker group mid-points for the data presented in Table 1.
logit(p) = a + bx
Although this model looks similar to a simple linear regression model, the underlying distribution is binomial and the parameters a and b cannot be estimated in exactly the same way as for simple linear regression. Instead, the parameters are usually estimated using the method of maximum likelihood, which is discussed below.
Binomial distribution
When the response variable is binary (e.g. death or survival), then the probability distribution of the number of deaths in a sample of a particular size, for given values of the explanatory variables, is usually assumed to be binomial. The probability that the number of deaths in a sample of size n is exactly equal to a value r is given by n C r p r (1 - p) n - r , where n C r = n!/(r!(n - r)!) is the number of ways r individuals can be chosen from n and p is the probability of an individual dying. (The probability of survival is 1 - p.)
For example, using the first row of the data in Table 1 , the probability that seven deaths occurred out of 182 patients is given by 182 C 7 p 7 (1 - p) 175 . If the probability of death is assumed to be 0.04, then the probability that seven deaths occurred is 182 C 7 × 0.04 7 × 0.86 175 = 0.152. This probability, calculated on the assumption of a binomial distribution with parameter p = 0.04, is called a likelihood.
Maximum likelihood estimation
Maximum likelihood estimation involves finding the value(s) of the parameter(s) that give rise to the maximum likelihood. For example, again we shall take the seven deaths occurring out of 182 patients and use maximum likelihood estimation to estimate the probability of death, p. Figure 3 shows the likelihood calculated for a range of values of p. From the graph it can be seen that the value of p giving the maximum likelihood is close to 0.04. This value is the maximum likelihood estimate (MLE) of p. Mathematically, it can be shown that the MLE in this case is 7/182.
Likelihood for a range of values of p. MLE, maximum likelihood estimate.
In more complicated situations, iterative techniques are required to find the maximum likelihood and the associated parameter values, and a computer package is required.
The model logit(p) = a + bx is equivalent to the following:
Because the explanatory variable x increases by one unit from x to x + 1, the odds of death change from e a e bx to e a e b(x + 1) = e a e bx e b . The odds ratio (OR) is therefore e a e bx e b /e a e bx = e b . The odds ratio e b has a simpler interpretation in the case of a categorical explanatory variable with two categories; in this case it is just the odds ratio for one category compared with the other.
Estimates of the parameters a and b are usually obtained using a statistical package, and the output for the data summarized in Table 1 is given in Table 2 . From the output, b = 1.690 and e b OR = 5.4. This indicates that, for example, the odds of death for a patient with a marker level of 3.0 is 5.4 times that of a patient with marker level 2.0.
Output from a statistical package for logistic regression on the example data
CI, confidence interval; df, degrees of freedom; OR, odds ratio; SE, standard error.
Predicted probabilities
The model can be used to calculate the predicted probability of death (p) for a given value of the metabolic marker. For example, patients with metabolic marker level 2.0 and 3.0 have the following respective predicted probabilities of death:
The corresponding odds of death for these patients are 0.300/(1 - 0.300) = 0.428 and 0.700/(1 - 0.700) = 2.320, giving an odds ratio of 2.320/0.428 = 5.421, as above.
The metabolic marker level at which the predicted probability equals 0.5 – that is, at which the two possible outcomes are equally likely – is called the median effective level (EL 50 ). Solving the equation
gives x = EL 50 = a/b
For the example data, EL 50 = 4.229/1.690 = 2.50, indicating that at this marker level death or survival are equally likely.
Assessment of the fitted model
After estimating the coefficients, there are several steps involved in assessing the appropriateness, adequacy and usefulness of the model. First, the importance of each of the explanatory variables is assessed by carrying out statistical tests of the significance of the coefficients. The overall goodness of fit of the model is then tested. Additionally, the ability of the model to discriminate between the two groups defined by the response variable is evaluated. Finally, if possible, the model is validated by checking the goodness of fit and discrimination on a different set of data from that which was used to develop the model.
Tests and confidence intervals for the parameters
The wald statistic.
Wald χ 2 statistics are used to test the significance of individual coefficients in the model and are calculated as follows:
Each Wald statistic is compared with a χ 2 distribution with 1 degree of freedom. Wald statistics are easy to calculate but their reliability is questionable, particularly for small samples. For data that produce large estimates of the coefficient, the standard error is often inflated, resulting in a lower Wald statistic, and therefore the explanatory variable may be incorrectly assumed to be unimportant in the model. Likelihood ratio tests (see below) are generally considered to be superior.
The Wald tests for the example data are given in Table 2 . The test for the coefficient of the metabolic marker indicates that the metabolic marker contributes significantly in predicting death.
The constant has no simple practical interpretation but is generally retained in the model irrespective of its significance.
Likelihood ratio test
The likelihood ratio test for a particular parameter compares the likelihood of obtaining the data when the parameter is zero (L 0 ) with the likelihood (L 1 ) of obtaining the data evaluated at the MLE of the parameter. The test statistic is calculated as follows:
-2 × ln(likelihood ratio) = -2 × ln(L 0 /L 1 ) = -2 × (lnL 0 - lnL 1 )
It is compared with a χ 2 distribution with 1 degree of freedom. Table 3 shows the likelihood ratio test for the example data obtained from a statistical package and again indicates that the metabolic marker contributes significantly in predicting death.
Likelihood ratio test for inclusion of the variable marker in themodel
Goodness of fit of the model
The goodness of fit or calibration of a model measures how well the model describes the response variable. Assessing goodness of fit involves investigating how close values predicted by the model are to the observed values.
When there is only one explanatory variable, as for the example data, it is possible to examine the goodness of fit of the model by grouping the explanatory variable into categories and comparing the observed and expected counts in the categories. For example, for each of the 182 patients with metabolic marker level less than one the predicted probability of death was calculated using the formula
where x is the metabolic marker level for an individual patient. This gives 182 predicted probabilities from which the arithmetic mean was calculated, giving a value of 0.04. This was repeated for all metabolic marker level categories. Table 4 shows the predicted probabilities of death in each category and also the expected number of deaths calculated as the predicted probability multiplied by the number of patients in the category. The observed and the expected numbers of deaths can be compared using a χ 2 goodness of fit test, providing the expected number in any category is not less than 5. The null hypothesis for the test is that the numbers of deaths follow the logistic regression model. The χ 2 test statistic is given by
Relationship between level of a metabolic marker and predicted probability of death
The test statistic is compared with a χ 2 distribution where the degrees of freedom are equal to the number of categories minus the number of parameters in the logistic regression model. For the example data the χ 2 statistic is 2.68 with 9 - 2 = 7 degrees of freedom, giving P = 0.91, suggesting that the numbers of deaths are not significantly different from those predicted by the model.
The Hosmer–Lemeshow test
The Hosmer–Lemeshow test is a commonly used test for assessing the goodness of fit of a model and allows for any number of explanatory variables, which may be continuous or categorical. The test is similar to a χ 2 goodness of fit test and has the advantage of partitioning the observations into groups of approximately equal size, and therefore there are less likely to be groups with very low observed and expected frequencies. The observations are grouped into deciles based on the predicted probabilities. The test statistic is calculated as above using the observed and expected counts for both the deaths and survivals, and has an approximate χ 2 distribution with 8 (=10 - 2) degrees of freedom. Calibration results for the model from the example data are shown in Table 5 . The Hosmer–Lemeshow test ( P = 0.576) indicates that the numbers of deaths are not significantly different from those predicted by the model and that the overall model fit is good.
Contingency table for Hosmer–Lemeshow test
χ 2 test statistic = 6.642 (goodness of fit based on deciles of risk); degrees of freedom = 8; P = 0.576.
Further checks can be carried out on the fit for individual observations by inspection of various types of residuals (differences between observed and fitted values). These can identify whether any observations are outliers or have a strong influence on the fitted model. For further details see, for example, Hosmer and Lemeshow [ 2 ].
R 2 for logistic regression
Most statistical packages provide further statistics that may be used to measure the usefulness of the model and that are similar to the coefficient of determination (R 2 ) in linear regression [ 3 ]. The Cox & Snell and the Nagelkerke R 2 are two such statistics. The values for the example data are 0.44 and 0.59, respectively. The maximum value that the Cox & Snell R 2 attains is less than 1. The Nagelkerke R 2 is an adjusted version of the Cox & Snell R 2 and covers the full range from 0 to 1, and therefore it is often preferred. The R 2 statistics do not measure the goodness of fit of the model but indicate how useful the explanatory variables are in predicting the response variable and can be referred to as measures of effect size. The value of 0.59 indicates that the model is useful in predicting death.
Discrimination
The discrimination of a model – that is, how well the model distinguishes patients who survive from those who die – can be assessed using the area under the receiver operating characteristic curve (AUROC) [ 4 ]. The value of the AUROC is the probability that a patient who died had a higher predicted probability than did a patient who survived. Using a statistical package to calculate the AUROC for the example data gave a value of 0.90 (95% C.I. 0.89 to 0.91), indicating that the model discriminates well.
When the goodness of fit and discrimination of a model are tested using the data on which the model was developed, they are likely to be over-estimated. If possible, the validity of model should be assessed by carrying out tests of goodness of fit and discrimination on a different data set from the original one.
Logistic regression with more than one explanatory variable
We may wish to investigate how death or survival of patients can be predicted by more than one explanatory variable. As an example, we shall use data obtained from patients attending an accident and emergency unit. Serum metabolite levels were investigated as potentially useful markers in the early identification of those patients at risk for death. Two of the metabolic markers recorded were lactate and urea. Patients were also divided into two age groups: <70 years and ≥70 years.
Like ordinary regression, logistic regression can be extended to incorporate more than one explanatory variable, which may be either quantitative or qualitative. The logistic regression model can then be written as follows:
logit(p) = a + b 1 x 1 + b 2 x 2 + ... + b i x i
where p is the probability of death and x 1 , x 2 ... x i are the explanatory variables.
The method of including variables in the model can be carried out in a stepwise manner going forward or backward, testing for the significance of inclusion or elimination of the variable at each stage. The tests are based on the change in likelihood resulting from including or excluding the variable [ 2 ]. Backward stepwise elimination was used in the logistic regression of death/survival on lactate, urea and age group. The first model fitted included all three variables and the tests for the removal of the variables were all significant as shown in Table 6 .
Tests for the removal of the variables for the logistic regression on the accident and emergency data
Therefore all the variables were retained. For these data, forward stepwise inclusion of the variables resulted in the same model, though this may not always be the case because of correlations between the explanatory variables. Several models may produce equally good statistical fits for a set of data and it is therefore important when choosing a model to take account of biological or clinical considerations and not depend solely on statistical results.
The output from a statistical package is given in Table 7 . The Wald tests also show that all three explanatory variables contribute significantly to the model. This is also seen in the confidence intervals for the odds ratios, none of which include 1 [ 5 ].
Coefficients and Wald tests for logistic regression on the accident and emergency data
From Table 7 the fitted model is:
logit(p) = -5.716 + (0.270 × lactate) + (0.053 × urea) + (1.425 × age group)
Because there is more than one explanatory variable in the model, the interpretation of the odds ratio for one variable depends on the values of other variables being fixed. The interpretation of the odds ratio for age group is relatively simple because there are only two age groups; the odds ratio of 4.16 indicates that, for given levels of lactate and urea, the odds of death for patients in the ≥70 years group is 4.16 times that in the <70 years group. The odds ratio for the quantitative variable lactate is 1.31. This indicates that, for a given age group and level of urea, for an increase of 1 mmol/l in lactate the odds of death are multiplied by 1.31. Similarly, for a given age group and level of lactate, for an increase of 1 mmol/l in urea the odds of death are multiplied by 1.05.
The Hosmer–Lemeshow test results (χ 2 = 7.325, 8 degrees of freedom, P = 0.502) indicate that the goodness of fit is satisfactory. However, the Nagelkerke R 2 value was 0.17, suggesting that the model is not very useful in predicting death. Although the contribution of the three explanatory variables in the prediction of death is statistically significant, the effect size is small.
The AUROC for these data gave a value of 0.76 ((95% C.I. 0.69 to 0.82)), indicating that the discrimination of the model is only fair.
Assumptions and limitations
The logistic transformation of the binomial probabilities is not the only transformation available, but it is the easiest to interpret, and other transformations generally give similar results.
In logistic regression no assumptions are made about the distributions of the explanatory variables. However, the explanatory variables should not be highly correlated with one another because this could cause problems with estimation.
Large sample sizes are required for logistic regression to provide sufficient numbers in both categories of the response variable. The more explanatory variables, the larger the sample size required. With small sample sizes, the Hosmer–Lemeshow test has low power and is unlikely to detect subtle deviations from the logistic model. Hosmer and Lemeshow recommend sample sizes greater than 400.
The choice of model should always depend on biological or clinical considerations in addition to statistical results.
Logistic regression provides a useful means for modelling the dependence of a binary response variable on one or more explanatory variables, where the latter can be either categorical or continuous. The fit of the resulting model can be assessed using a number of methods.
Abbreviations
AUROC = area under the receiver operating characteristic curve; C.I. = confidence interval; ln = natural logarithm; logit = natural logarithm of the odds; MLE = maximum likelihood estimate; OR = odds ratio; ROC = receiver operating characteristic curve.
Competing interests
The author(s) declare that they have no competing interests.
- Kirkwood BR, Sterne JAC. Essential Medical Statistics. 2. Oxford, UK: Blackwell Science Ltd; 2003. [ Google Scholar ]
- Hosmer DW, Lemeshow S. Applied Logistic Regression. 2. New York, USA: John Wiley and Sons; 2000. [ Google Scholar ]
- Bewick V, Cheek L, Ball J. Statistics review 7: Correlation and regression. Crit Care. 2003;7:451–459. doi: 10.1186/cc2401. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Bewick V, Cheek L, Ball J. Statistics review 13: Receiver operating characteristic (ROC) curves. Crit Care. 2004;8:508–512. doi: 10.1186/cc3000. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- Bewick V, Cheek L, Ball J. Statistics review 11: Assessing risk. Crit Care. 2004;8:287–291. doi: 10.1186/cc2908. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- View on publisher site
- PDF (114.3 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
- 12.1 - Logistic Regression
Logistic regression models a relationship between predictor variables and a categorical response variable. For example, we could use logistic regression to model the relationship between various measurements of a manufactured specimen (such as dimensions and chemical composition) to predict if a crack greater than 10 mils will occur (a binary variable: either yes or no). Logistic regression helps us estimate a probability of falling into a certain level of the categorical response given a set of predictors. We can choose from three types of logistic regression, depending on the nature of the categorical response variable:
Binary Logistic Regression :
Used when the response is binary (i.e., it has two possible outcomes). The cracking example given above would utilize binary logistic regression. Other examples of binary responses could include passing or failing a test, responding yes or no on a survey, and having high or low blood pressure.
Nominal Logistic Regression :
Used when there are three or more categories with no natural ordering to the levels. Examples of nominal responses could include departments at a business (e.g., marketing, sales, HR), type of search engine used (e.g., Google, Yahoo!, MSN), and color (black, red, blue, orange).
Ordinal Logistic Regression :
Used when there are three or more categories with a natural ordering to the levels, but the ranking of the levels do not necessarily mean the intervals between them are equal. Examples of ordinal responses could be how students rate the effectiveness of a college course (e.g., good, medium, poor), levels of flavors for hot wings, and medical condition (e.g., good, stable, serious, critical).
Particular issues with modelling a categorical response variable include nonnormal error terms, nonconstant error variance, and constraints on the response function (i.e., the response is bounded between 0 and 1). We will investigate ways of dealing with these in the binary logistic regression setting here. Nominal and ordinal logistic regression are not considered in this course.
The multiple binary logistic regression model is the following:
\[\begin{align}\label{logmod} \pi(\textbf{X})&=\frac{\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})}{1+\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})}\notag \\ & =\frac{\exp(\textbf{X}\beta)}{1+\exp(\textbf{X}\beta)}\\ & =\frac{1}{1+\exp(-\textbf{X}\beta)}, \end{align}\]
where here \(\pi\) denotes a probability and not the irrational number 3.14....
- \(\pi\) is the probability that an observation is in a specified category of the binary Y variable, generally called the "success probability."
- Notice that the model describes the probability of an event happening as a function of X variables. For instance, it might provide estimates of the probability that an older person has heart disease.
- The numerator \(\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k})\) must be positive, because it is a power of a positive value ( e ).
- The denominator of the model is (1 + numerator), so the answer will always be less than 1.
- With one X variable, the theoretical model for \(\pi\) has an elongated "S" shape (or sigmoidal shape) with asymptotes at 0 and 1, although in sample estimates we may not see this "S" shape if the range of the X variable is limited.
For a sample of size n , the likelihood for a binary logistic regression is given by:
\[\begin{align*} L(\beta;\textbf{y},\textbf{X})&=\prod_{i=1}^{n}\pi_{i}^{y_{i}}(1-\pi_{i})^{1-y_{i}}\\ & =\prod_{i=1}^{n}\biggl(\frac{\exp(\textbf{X}_{i}\beta)}{1+\exp(\textbf{X}_{i}\beta)}\biggr)^{y_{i}}\biggl(\frac{1}{1+\exp(\textbf{X}_{i}\beta)}\biggr)^{1-y_{i}}. \end{align*}\]
This yields the log likelihood:
\[\begin{align*} \ell(\beta)&=\sum_{i=1}^{n}[y_{i}\log(\pi_{i})+(1-y_{i})\log(1-\pi_{i})]\\ & =\sum_{i=1}^{n}[y_{i}\textbf{X}_{i}\beta-\log(1+\exp(\textbf{X}_{i}\beta))]. \end{align*}\]
Maximizing the likelihood (or log likelihood) has no closed-form solution, so a technique like iteratively reweighted least squares is used to find an estimate of the regression coefficients, $\hat{\beta}$.
To illustrate, consider data published on n = 27 leukemia patients. The data ( leukemia_remission.txt ) has a response variable of whether leukemia remission occurred (REMISS), which is given by a 1.
The predictor variables are cellularity of the marrow clot section (CELL), smear differential percentage of blasts (SMEAR), percentage of absolute marrow leukemia cell infiltrate (INFIL), percentage labeling index of the bone marrow leukemia cells (LI), absolute number of blasts in the peripheral blood (BLAST), and the highest temperature prior to start of treatment (TEMP).
The following output shows the estimated logistic regression equation and associated significance tests
- Select Stat > Regression > Binary Logistic Regression > Fit Binary Logistic Model.
- Select "REMISS" for the Response (the response event for remission is 1 for this data).
- Select all the predictors as Continuous predictors.
- Click Options and choose Deviance or Pearson residuals for diagnostic plots.
- Click Graphs and select "Residuals versus order."
- Click Results and change "Display of results" to "Expanded tables."
- Click Storage and select "Coefficients."
Coefficients Term Coef SE Coef 95% CI Z-Value P-Value VIF Constant 64.3 75.0 ( -82.7, 211.2) 0.86 0.391 CELL 30.8 52.1 ( -71.4, 133.0) 0.59 0.554 62.46 SMEAR 24.7 61.5 ( -95.9, 145.3) 0.40 0.688 434.42 INFIL -25.0 65.3 (-152.9, 103.0) -0.38 0.702 471.10 LI 4.36 2.66 ( -0.85, 9.57) 1.64 0.101 4.43 BLAST -0.01 2.27 ( -4.45, 4.43) -0.01 0.996 4.18 TEMP -100.2 77.8 (-252.6, 52.2) -1.29 0.198 3.01
The Wald test is the test of significance for individual regression coefficients in logistic regression (recall that we use t -tests in linear regression). For maximum likelihood estimates, the ratio
\[\begin{equation*} Z=\frac{\hat{\beta}_{i}}{\textrm{s.e.}(\hat{\beta}_{i})} \end{equation*}\]
can be used to test $H_{0}: \beta_{i}=0$. The standard normal curve is used to determine the $p$-value of the test. Furthermore, confidence intervals can be constructed as
\[\begin{equation*} \hat{\beta}_{i}\pm z_{1-\alpha/2}\textrm{s.e.}(\hat{\beta}_{i}). \end{equation*}\]
Estimates of the regression coefficients, $\hat{\beta}$, are given in the Coefficients table in the column labeled "Coef." This table also gives coefficient p -values based on Wald tests. The index of the bone marrow leukemia cells (LI) has the smallest p -value and so appears to be closest to a significant predictor of remission occurring. After looking at various subsets of the data, we find that a good model is one which only includes the labeling index as a predictor:
Coefficients Term Coef SE Coef 95% CI Z-Value P-Value VIF Constant -3.78 1.38 (-6.48, -1.08) -2.74 0.006 LI 2.90 1.19 ( 0.57, 5.22) 2.44 0.015 1.00
Regression Equation P(1) = exp(Y')/(1 + exp(Y')) Y' = -3.78 + 2.90 LI
Since we only have a single predictor in this model we can create a Binary Fitted Line Plot to visualize the sigmoidal shape of the fitted logistic regression curve:
Odds, Log Odds, and Odds Ratio
There are algebraically equivalent ways to write the logistic regression model:
The first is
\[\begin{equation}\label{logmod1} \frac{\pi}{1-\pi}=\exp(\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k}), \end{equation}\]
which is an equation that describes the odds of being in the current category of interest. By definition, the odds for an event is π / (1 - π ) such that P is the probability of the event. For example, if you are at the racetrack and there is a 80% chance that a certain horse will win the race, then his odds are 0.80 / (1 - 0.80) = 4, or 4:1.
The second is
\[\begin{equation}\label{logmod2} \log\biggl(\frac{\pi}{1-\pi}\biggr)=\beta_{0}+\beta_{1}X_{1}+\ldots+\beta_{k}X_{k}, \end{equation}\]
which states that the (natural) logarithm of the odds is a linear function of the X variables (and is often called the log odds ). This is also referred to as the logit transformation of the probability of success, \(\pi\).
The odds ratio (which we will write as $\theta$) between the odds for two sets of predictors (say $\textbf{X}_{(1)}$ and $\textbf{X}_{(2)}$) is given by
\[\begin{equation*} \theta=\frac{(\pi/(1-\pi))|_{\textbf{X}=\textbf{X}_{(1)}}}{(\pi/(1-\pi))|_{\textbf{X}=\textbf{X}_{(2)}}}. \end{equation*}\]
For binary logistic regression, the odds of success are:
\[\begin{equation*} \frac{\pi}{1-\pi}=\exp(\textbf{X}\beta). \end{equation*}\]
By plugging this into the formula for $\theta$ above and setting $\textbf{X}_{(1)}$ equal to $\textbf{X}_{(2)}$ except in one position (i.e., only one predictor differs by one unit), we can determine the relationship between that predictor and the response. The odds ratio can be any nonnegative number. An odds ratio of 1 serves as the baseline for comparison and indicates there is no association between the response and predictor. If the odds ratio is greater than 1, then the odds of success are higher for higher levels of a continuous predictor (or for the indicated level of a factor). In particular, the odds increase multiplicatively by $\exp(\beta_{j})$ for every one-unit increase in $\textbf{X}_{j}$. If the odds ratio is less than 1, then the odds of success are less for higher levels of a continuous predictor (or for the indicated level of a factor). Values farther from 1 represent stronger degrees of association.
For example, when there is just a single predictor, \(X\), the odds of success are:
\[\begin{equation*} \frac{\pi}{1-\pi}=\exp(\beta_0+\beta_1X). \end{equation*}\]
If we increase \(X\) by one unit, the odds ratio is
\[\begin{equation*} \theta=\frac{\exp(\beta_0+\beta_1(X+1))}{\exp(\beta_0+\beta_1X)}=\exp(\beta_1). \end{equation*}\]
To illustrate, the relevant output from the leukemia example is:
Odds Ratios for Continuous Predictors Odds Ratio 95% CI LI 18.1245 (1.7703, 185.5617)
The regression parameter estimate for LI is $2.89726$, so the odds ratio for LI is calculated as $\exp(2.89726)=18.1245$. The 95% confidence interval is calculated as $\exp(2.89726\pm z_{0.975}*1.19)$, where $z_{0.975}=1.960$ is the $97.5^{\textrm{th}}$ percentile from the standard normal distribution. The interpretation of the odds ratio is that for every increase of 1 unit in LI, the estimated odds of leukemia remission are multiplied by 18.1245. However, since the LI appears to fall between 0 and 2, it may make more sense to say that for every 0.1 unit increase in L1, the estimated odds of remission are multiplied by $\exp(2.89726\times 0.1)=1.336$. Then
- At LI=0.9, the estimated odds of leukemia remission is $\exp\{-3.77714+2.89726*0.9\}=0.310$.
- At LI=0.8, the estimated odds of leukemia remission is $\exp\{-3.77714+2.89726*0.8\}=0.232$.
- The resulting odds ratio is $\frac{0.310}{0.232}=1.336$, which is the ratio of the odds of remission when LI=0.9 compared to the odds when L1=0.8.
Notice that $1.336\times 0.232=0.310$, which demonstrates the multiplicative effect by $\exp(0.1\hat{\beta_{1}})$ on the odds.
Likelihood Ratio (or Deviance) Test
The likelihood ratio test is used to test the null hypothesis that any subset of the $\beta$'s is equal to 0. The number of $\beta$'s in the full model is k +1 , while the number of $\beta$'s in the reduced model is r +1 . (Remember the reduced model is the model that results when the $\beta$'s in the null hypothesis are set to 0.) Thus, the number of $\beta$'s being tested in the null hypothesis is \((k+1)-(r+1)=k-r\). Then the likelihood ratio test statistic is given by:
\[\begin{equation*} \Lambda^{*}=-2(\ell(\hat{\beta}^{(0)})-\ell(\hat{\beta})), \end{equation*}\]
where $\ell(\hat{\beta})$ is the log likelihood of the fitted (full) model and $\ell(\hat{\beta}^{(0)})$ is the log likelihood of the (reduced) model specified by the null hypothesis evaluated at the maximum likelihood estimate of that reduced model. This test statistic has a $\chi^{2}$ distribution with \(k-r\) degrees of freedom. Statistical software often presents results for this test in terms of "deviance," which is defined as \(-2\) times log-likelihood. The notation used for the test statistic is typically $G^2$ = deviance (reduced) – deviance (full).
This test procedure is analagous to the general linear F test procedure for multiple linear regression. However, note that when testing a single coefficient, the Wald test and likelihood ratio test will not in general give identical results.
To illustrate, the relevant software output from the leukemia example is:
Deviance Table Source DF Adj Dev Adj Mean Chi-Square P-Value Regression 1 8.299 8.299 8.30 0.004 LI 1 8.299 8.299 8.30 0.004 Error 25 26.073 1.043 Total 26 34.372
Since there is only a single predictor for this example, this table simply provides information on the likelihood ratio test for LI ( p -value of 0.004), which is similar but not identical to the earlier Wald test result ( p -value of 0.015). The Deviance Table includes the following:
- The null (reduced) model in this case has no predictors, so the fitted probabilities are simply the sample proportion of successes, \(9/27=0.333333\). The log-likelihood for the null model is \(\ell(\hat{\beta}^{(0)})=-17.1859\), so the deviance for the null model is \(-2\times-17.1859=34.372\), which is shown in the "Total" row in the Deviance Table.
- The log-likelihood for the fitted (full) model is \(\ell(\hat{\beta})=-13.0365\), so the deviance for the fitted model is \(-2\times-13.0365=26.073\), which is shown in the "Error" row in the Deviance Table.
- The likelihood ratio test statistic is therefore \(\Lambda^{*}=-2(-17.1859-(-13.0365))=8.299\), which is the same as \(G^2=34.372-26.073=8.299\).
- The p -value comes from a $\chi^{2}$ distribution with \(2-1=1\) degrees of freedom.
When using the likelihood ratio (or deviance) test for more than one regression coefficient, we can first fit the "full" model to find deviance (full), which is shown in the "Error" row in the resulting full model Deviance Table. Then fit the "reduced" model (corresponding to the model that results if the null hypothesis is true) to find deviance (reduced), which is shown in the "Error" row in the resulting reduced model Deviance Table. For example, the relevant Deviance Tables for the Disease Outbreak example on pages 581-582 of Applied Linear Regression Models (4th ed) by Kutner et al are:
Full model:
Source DF Adj Dev Adj Mean Chi-Square P-Value Regression 9 28.322 3.14686 28.32 0.001 Error 88 93.996 1.06813 Total 97 122.318
Reduced model:
Source DF Adj Dev Adj Mean Chi-Square P-Value Regression 4 21.263 5.3159 21.26 0.000 Error 93 101.054 1.0866 Total 97 122.318
Here the full model includes four single-factor predictor terms and five two-factor interaction terms, while the reduced model excludes the interaction terms. The test statistic for testing the interaction terms is \(G^2 = 101.054-93.996 = 7.058\), which is compared to a chi-square distribution with \(10-5=5\) degrees of freedom to find the p -value = 0.216 > 0.05 (meaning the interaction terms are not significant at a 5% significance level).
Alternatively, select the corresponding predictor terms last in the full model and request the software to output Sequential (Type I) Deviances. Then add the corresponding Sequential Deviances in the resulting Deviance Table to calculate \(G^2\). For example, the relevant Deviance Table for the Disease Outbreak example is:
Source DF Seq Dev Seq Mean Chi-Square P-Value Regression 9 28.322 3.1469 28.32 0.001 Age 1 7.405 7.4050 7.40 0.007 Middle 1 1.804 1.8040 1.80 0.179 Lower 1 1.606 1.6064 1.61 0.205 Sector 1 10.448 10.4481 10.45 0.001 Age*Middle 1 4.570 4.5697 4.57 0.033 Age*Lower 1 1.015 1.0152 1.02 0.314 Age*Sector 1 1.120 1.1202 1.12 0.290 Middle*Sector 1 0.000 0.0001 0.00 0.993 Lower*Sector 1 0.353 0.3531 0.35 0.552 Error 88 93.996 1.0681 Total 97 122.318
The test statistic for testing the interaction terms is \(G^2 = 4.570+1.015+1.120+0.000+0.353 = 7.058\), the same as in the first calculation.
Goodness-of-Fit Tests
Overall performance of the fitted model can be measured by several different goodness-of-fit tests. Two tests that require replicated data (multiple observations with the same values for all the predictors) are the Pearson chi-square goodness-of-fit test and the deviance goodness-of-fit test (analagous to the multiple linear regression lack-of-fit F-test). Both of these tests have statistics that are approximately chi-square distributed with c - k - 1 degrees of freedom, where c is the number of distinct combinations of the predictor variables. When a test is rejected, there is a statistically significant lack of fit. Otherwise, there is no evidence of lack of fit.
By contrast, the Hosmer-Lemeshow goodness-of-fit test is useful for unreplicated datasets or for datasets that contain just a few replicated observations. For this test the observations are grouped based on their estimated probabilities. The resulting test statistic is approximately chi-square distributed with c - 2 degrees of freedom, where c is the number of groups (generally chosen to be between 5 and 10, depending on the sample size) .
Goodness-of-Fit Tests Test DF Chi-Square P-Value Deviance 25 26.07 0.404 Pearson 25 23.93 0.523 Hosmer-Lemeshow 7 6.87 0.442
Since there is no replicated data for this example, the deviance and Pearson goodness-of-fit tests are invalid, so the first two rows of this table should be ignored. However, the Hosmer-Lemeshow test does not require replicated data so we can interpret its high p -value as indicating no evidence of lack-of-fit.
The calculation of R 2 used in linear regression does not extend directly to logistic regression. One version of R 2 used in logistic regression is defined as
\[\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})-\ell_{S}(\beta)}, \end{equation*}\]
where $\ell(\hat{\beta_{0}})$ is the log likelihood of the model when only the intercept is included and $\ell_{S}(\beta)$ is the log likelihood of the saturated model (i.e., where a model is fit perfectly to the data). This R 2 does go from 0 to 1 with 1 being a perfect fit. With unreplicated data, $\ell_{S}(\beta)=0$, so the formula simplifies to:
\[\begin{equation*} R^{2}=\frac{\ell(\hat{\beta_{0}})-\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}=1-\frac{\ell(\hat{\beta})}{\ell(\hat{\beta_{0}})}. \end{equation*}\]
Model Summary Deviance Deviance R-Sq R-Sq(adj) AIC 24.14% 21.23% 30.07
Recall from above that \(\ell(\hat{\beta})=-13.0365\) and \(\ell(\hat{\beta}^{(0)})=-17.1859\), so:
\[\begin{equation*} R^{2}=1-\frac{-13.0365}{-17.1859}=0.2414. \end{equation*}\]
Note that we can obtain the same result by simply using deviances instead of log-likelihoods since the $-2$ factor cancels out:
\[\begin{equation*} R^{2}=1-\frac{26.073}{34.372}=0.2414. \end{equation*}\]
Raw Residual
The raw residual is the difference between the actual response and the estimated probability from the model. The formula for the raw residual is
\[\begin{equation*} r_{i}=y_{i}-\hat{\pi}_{i}. \end{equation*}\]
Pearson Residual
The Pearson residual corrects for the unequal variance in the raw residuals by dividing by the standard deviation. The formula for the Pearson residuals is
\[\begin{equation*} p_{i}=\frac{r_{i}}{\sqrt{\hat{\pi}_{i}(1-\hat{\pi}_{i})}}. \end{equation*}\]
Deviance Residuals
Deviance residuals are also popular because the sum of squares of these residuals is the deviance statistic. The formula for the deviance residual is
\[\begin{equation*} d_{i}=\pm\sqrt{2\biggl[y_{i}\log\biggl(\frac{y_{i}}{\hat{\pi}_{i}}\biggr)+(1-y_{i})\log\biggl(\frac{1-y_{i}}{1-\hat{\pi}_{i}}\biggr)\biggr]}. \end{equation*}\]
Here are the plots of the Pearson residuals and deviance residuals for the leukemia example. There are no alarming patterns in these plots to suggest a major problem with the model.
The hat matrix serves a similar purpose as in the case of linear regression – to measure the influence of each observation on the overall fit of the model – but the interpretation is not as clear due to its more complicated form. The hat values (leverages) are given by
\[\begin{equation*} h_{i,i}=\hat{\pi}_{i}(1-\hat{\pi}_{i})\textbf{x}_{i}^{\textrm{T}}(\textbf{X}^{\textrm{T}}\textbf{W}\textbf{X})\textbf{x}_{i}, \end{equation*}\]
where W is an $n\times n$ diagonal matrix with the values of $\hat{\pi}_{i}(1-\hat{\pi}_{i})$ for $i=1 ,\ldots,n$ on the diagonal. As before, we should investigate any observations with $h_{i,i}>3p/n$ or, failing this, any observations with $h_{i,i}>2p/n$ and very isolated .
Studentized Residuals
We can also report Studentized versions of some of the earlier residuals. The Studentized Pearson residuals are given by
\[\begin{equation*} sp_{i}=\frac{p_{i}}{\sqrt{1-h_{i,i}}} \end{equation*}\]
and the Studentized deviance residuals are given by
\[\begin{equation*} sd_{i}=\frac{d_{i}}{\sqrt{1-h_{i, i}}}. \end{equation*}\]
Cook's Distances
An extension of Cook's distance for logistic regression measures the overall change in fitted logits due to deleting the $i^{\textrm{th}}$ observation. It is defined by:
\[\begin{equation*} \textrm{C}_{i}=\frac{p_{i}^{2}h _{i,i}}{(k+1)(1-h_{i,i})^{2}}. \end{equation*}\]
Fits and Diagnostics for Unusual Observations Observed Obs Probability Fit SE Fit 95% CI Resid Std Resid Del Resid HI 8 0.000 0.849 0.139 (0.403, 0.979) -1.945 -2.11 -2.19 0.149840 Obs Cook’s D DFITS 8 0.58 -1.08011 R R Large residual
The residuals in this output are deviance residuals, so observation 8 has a deviance residual of \(-1.945\), a studentized deviance residual of \(-2.19\), a leverage (h) of \(0.149840\), and a Cook's distance (C) of 0.58.
Start Here!
- Welcome to STAT 462!
- Search Course Materials
- Lesson 1: Statistical Inference Foundations
- Lesson 2: Simple Linear Regression (SLR) Model
- Lesson 3: SLR Evaluation
- Lesson 4: SLR Assumptions, Estimation & Prediction
- Lesson 5: Multiple Linear Regression (MLR) Model & Evaluation
- Lesson 6: MLR Assumptions, Estimation & Prediction
- Lesson 7: Transformations & Interactions
- Lesson 8: Categorical Predictors
- Lesson 9: Influential Points
- Lesson 10: Regression Pitfalls
- Lesson 11: Model Building
- 12.2 - Further Logistic Regression Examples
- 12.3 - Poisson Regression
- 12.4 - Generalized Linear Models
- 12.5 - Nonlinear Regression
- 12.6 - Exponential Regression Example
- 12.7 - Population Growth Example
- Website for Applied Regression Modeling, 2nd edition
- Notation Used in this Course
- R Software Help
- Minitab Software Help
Copyright © 2018 The Pennsylvania State University Privacy and Legal Statements Contact the Department of Statistics Online Programs
- Data Science
- Data Analysis
- Data Visualization
- Machine Learning
- Deep Learning
- Computer Vision
- Artificial Intelligence
- AI ML DS Interview Series
- AI ML DS Projects series
- Data Engineering
- Web Scrapping
Logistic Regression in Machine Learning
Logistic regression is a supervised machine learning algorithm used for classification tasks where the goal is to predict the probability that an instance belongs to a given class or not. Logistic regression is a statistical algorithm which analyze the relationship between two data factors. The article explores the fundamentals of logistic regression, it’s types and implementations.
Table of Content
What is Logistic Regression?
Logistic function – sigmoid function, types of logistic regression, assumptions of logistic regression, how does logistic regression work, code implementation for logistic regression, precision-recall tradeoff in logistic regression threshold setting, how to evaluate logistic regression model, differences between linear and logistic regression.
Logistic regression is used for binary classification where we use sigmoid function , that takes input as independent variables and produces a probability value between 0 and 1.
For example, we have two classes Class 0 and Class 1 if the value of the logistic function for an input is greater than 0.5 (threshold value) then it belongs to Class 1 otherwise it belongs to Class 0. It’s referred to as regression because it is the extension of linear regression but is mainly used for classification problems.
Key Points:
- Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome must be a categorical or discrete value.
- It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
- In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic function, which predicts two maximum values (0 or 1).
- The sigmoid function is a mathematical function used to map the predicted values to probabilities.
- It maps any real value into another value within a range of 0 and 1. The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the “S” form.
- The S-form curve is called the Sigmoid function or the logistic function.
- In logistic regression, we use the concept of the threshold value, which defines the probability of either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold values tends to 0.
On the basis of the categories, Logistic Regression can be classified into three types:
- Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
- Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as “cat”, “dogs”, or “sheep”
- Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as “low”, “Medium”, or “High”.
We will explore the assumptions of logistic regression as understanding these assumptions is important to ensure that we are using appropriate application of the model. The assumption include:
- Independent observations: Each observation is independent of the other. meaning there is no correlation between any input variables.
- Binary dependent variables: It takes the assumption that the dependent variable must be binary or dichotomous, meaning it can take only two values. For more than two categories SoftMax functions are used.
- Linearity relationship between independent variables and log odds: The relationship between the independent variables and the log odds of the dependent variable should be linear.
- No outliers: There should be no outliers in the dataset.
- Large sample size: The sample size is sufficiently large
Terminologies involved in Logistic Regression
Here are some common terms involved in logistic regression:
- Independent variables: The input characteristics or predictor factors applied to the dependent variable’s predictions.
- Dependent variable: The target variable in a logistic regression model, which we are trying to predict.
- Logistic function: The formula used to represent how the independent and dependent variables relate to one another. The logistic function transforms the input variables into a probability value between 0 and 1, which represents the likelihood of the dependent variable being 1 or 0.
- Odds: It is the ratio of something occurring to something not occurring. it is different from probability as the probability is the ratio of something occurring to everything that could possibly occur.
- Log-odds: The log-odds, also known as the logit function, is the natural logarithm of the odds. In logistic regression, the log odds of the dependent variable are modeled as a linear combination of the independent variables and the intercept.
- Coefficient: The logistic regression model’s estimated parameters, show how the independent and dependent variables relate to one another.
- Intercept: A constant term in the logistic regression model, which represents the log odds when all independent variables are equal to zero.
- Maximum likelihood estimation : The method used to estimate the coefficients of the logistic regression model, which maximizes the likelihood of observing the data given the model.
The logistic regression model transforms the linear regression function continuous value output into categorical value output using a sigmoid function, which maps any real-valued set of independent variables input into a value between 0 and 1. This function is known as the logistic function.
Let the independent input features be:
[Tex]X = \begin{bmatrix} x_{11} & … & x_{1m}\\ x_{21} & … & x_{2m} \\ \vdots & \ddots & \vdots \\ x_{n1} & … & x_{nm} \end{bmatrix}[/Tex]
and the dependent variable is Y having only binary value i.e. 0 or 1.
[Tex]Y = \begin{cases} 0 & \text{ if } Class\;1 \\ 1 & \text{ if } Class\;2 \end{cases} [/Tex]
then, apply the multi-linear function to the input variables X.
[Tex]z = \left(\sum_{i=1}^{n} w_{i}x_{i}\right) + b [/Tex]
Here [Tex]x_i [/Tex] is the ith observation of X, [Tex]w_i = [w_1, w_2, w_3, \cdots,w_m] [/Tex] is the weights or Coefficient, and b is the bias term also known as intercept. simply this can be represented as the dot product of weight and bias.
[Tex]z = w\cdot X +b [/Tex]
whatever we discussed above is the linear regression .
Sigmoid Function
Now we use the sigmoid function where the input will be z and we find the probability between 0 and 1. i.e. predicted y.
[Tex]\sigma(z) = \frac{1}{1+e^{-z}} [/Tex]
Sigmoid function
As shown above, the figure sigmoid function converts the continuous variable data into the probability i.e. between 0 and 1.
- [Tex]\sigma(z) [/Tex] tends towards 1 as [Tex]z\rightarrow\infty [/Tex]
- [Tex]\sigma(z) [/Tex] tends towards 0 as [Tex]z\rightarrow-\infty [/Tex]
- [Tex]\sigma(z) [/Tex] is always bounded between 0 and 1
where the probability of being a class can be measured as:
[Tex]P(y=1) = \sigma(z) \\ P(y=0) = 1-\sigma(z) [/Tex]
Logistic Regression Equation
The odd is the ratio of something occurring to something not occurring. it is different from probability as the probability is the ratio of something occurring to everything that could possibly occur. so odd will be:
[Tex]\frac{p(x)}{1-p(x)} = e^z[/Tex]
Applying natural log on odd. then log odd will be:
[Tex]\begin{aligned} \log \left[\frac{p(x)}{1-p(x)} \right] &= z \\ \log \left[\frac{p(x)}{1-p(x)} \right] &= w\cdot X +b \\ \frac{p(x)}{1-p(x)}&= e^{w\cdot X +b} \;\;\cdots\text{Exponentiate both sides} \\ p(x) &=e^{w\cdot X +b}\cdot (1-p(x)) \\p(x) &=e^{w\cdot X +b}-e^{w\cdot X +b}\cdot p(x)) \\p(x)+e^{w\cdot X +b}\cdot p(x))&=e^{w\cdot X +b} \\p(x)(1+e^{w\cdot X +b}) &=e^{w\cdot X +b} \\p(x)&= \frac{e^{w\cdot X +b}}{1+e^{w\cdot X +b}} \end{aligned}[/Tex]
then the final logistic regression equation will be:
[Tex]p(X;b,w) = \frac{e^{w\cdot X +b}}{1+e^{w\cdot X +b}} = \frac{1}{1+e^{-w\cdot X +b}}[/Tex]
Likelihood Function for Logistic Regression
The predicted probabilities will be:
- for y=1 The predicted probabilities will be: p(X;b,w) = p(x)
- for y = 0 The predicted probabilities will be: 1-p(X;b,w) = 1-p(x)
[Tex]L(b,w) = \prod_{i=1}^{n}p(x_i)^{y_i}(1-p(x_i))^{1-y_i}[/Tex]
Taking natural logs on both sides
[Tex]\begin{aligned}\log(L(b,w)) &= \sum_{i=1}^{n} y_i\log p(x_i)\;+\; (1-y_i)\log(1-p(x_i)) \\ &=\sum_{i=1}^{n} y_i\log p(x_i)+\log(1-p(x_i))-y_i\log(1-p(x_i)) \\ &=\sum_{i=1}^{n} \log(1-p(x_i)) +\sum_{i=1}^{n}y_i\log \frac{p(x_i)}{1-p(x_i} \\ &=\sum_{i=1}^{n} -\log1-e^{-(w\cdot x_i+b)} +\sum_{i=1}^{n}y_i (w\cdot x_i +b) \\ &=\sum_{i=1}^{n} -\log1+e^{w\cdot x_i+b} +\sum_{i=1}^{n}y_i (w\cdot x_i +b) \end{aligned}[/Tex]
Gradient of the log-likelihood function
To find the maximum likelihood estimates, we differentiate w.r.t w,
[Tex]\begin{aligned} \frac{\partial J(l(b,w)}{\partial w_j}&=-\sum_{i=n}^{n}\frac{1}{1+e^{w\cdot x_i+b}}e^{w\cdot x_i+b} x_{ij} +\sum_{i=1}^{n}y_{i}x_{ij} \\&=-\sum_{i=n}^{n}p(x_i;b,w)x_{ij}+\sum_{i=1}^{n}y_{i}x_{ij} \\&=\sum_{i=n}^{n}(y_i -p(x_i;b,w))x_{ij} \end{aligned} [/Tex]
Binomial Logistic regression:
Target variable can have only 2 possible types: “0” or “1” which may represent “win” vs “loss”, “pass” vs “fail”, “dead” vs “alive”, etc., in this case, sigmoid functions are used, which is already discussed above.
Importing necessary libraries based on the requirement of model. This Python code shows how to use the breast cancer dataset to implement a Logistic Regression model for classification.
Logistic Regression model accuracy (in %): 95.6140350877193
Multinomial Logistic Regression:
Target variable can have 3 or more possible types which are not ordered (i.e. types have no quantitative significance) like “disease A” vs “disease B” vs “disease C”.
In this case, the softmax function is used in place of the sigmoid function. Softmax function for K classes will be:
[Tex]\text{softmax}(z_i) =\frac{ e^{z_i}}{\sum_{j=1}^{K}e^{z_{j}}}[/Tex]
Here, K represents the number of elements in the vector z, and i, j iterates over all the elements in the vector.
Then the probability for class c will be:
[Tex]P(Y=c | \overrightarrow{X}=x) = \frac{e^{w_c \cdot x + b_c}}{\sum_{k=1}^{K}e^{w_k \cdot x + b_k}}[/Tex]
In Multinomial Logistic Regression, the output variable can have more than two possible discrete outputs . Consider the Digit Dataset.
Logistic Regression model accuracy(in %): 96.52294853963839
We can evaluate the logistic regression model using the following metrics:
- Accuracy: Accuracy provides the proportion of correctly classified instances. [Tex]Accuracy = \frac{True \, Positives + True \, Negatives}{Total} [/Tex]
- Precision: Precision focuses on the accuracy of positive predictions. [Tex]Precision = \frac{True \, Positives }{True\, Positives + False \, Positives} [/Tex]
- Recall (Sensitivity or True Positive Rate): Recall measures the proportion of correctly predicted positive instances among all actual positive instances. [Tex]Recall = \frac{ True \, Positives}{True\, Positives + False \, Negatives} [/Tex]
- F1 Score: F1 score is the harmonic mean of precision and recall. [Tex]F1 \, Score = 2 * \frac{Precision * Recall}{Precision + Recall} [/Tex]
- Area Under the Receiver Operating Characteristic Curve (AUC-ROC): The ROC curve plots the true positive rate against the false positive rate at various thresholds. AUC-ROC measures the area under this curve, providing an aggregate measure of a model’s performance across different classification thresholds.
- Area Under the Precision-Recall Curve (AUC-PR): Similar to AUC-ROC, AUC-PR measures the area under the precision-recall curve, providing a summary of a model’s performance across different precision-recall trade-offs.
Logistic regression becomes a classification technique only when a decision threshold is brought into the picture. The setting of the threshold value is a very important aspect of Logistic regression and is dependent on the classification problem itself.
The decision for the value of the threshold value is majorly affected by the values of precision and recall. Ideally, we want both precision and recall being 1, but this seldom is the case.
In the case of a Precision-Recall tradeoff , we use the following arguments to decide upon the threshold:
- Low Precision/High Recall: In applications where we want to reduce the number of false negatives without necessarily reducing the number of false positives, we choose a decision value that has a low value of Precision or a high value of Recall. For example, in a cancer diagnosis application, we do not want any affected patient to be classified as not affected without giving much heed to if the patient is being wrongfully diagnosed with cancer. This is because the absence of cancer can be detected by further medical diseases, but the presence of the disease cannot be detected in an already rejected candidate.
- High Precision/Low Recall: In applications where we want to reduce the number of false positives without necessarily reducing the number of false negatives, we choose a decision value that has a high value of Precision or a low value of Recall. For example, if we are classifying customers whether they will react positively or negatively to a personalized advertisement, we want to be absolutely sure that the customer will react positively to the advertisement because otherwise, a negative reaction can cause a loss of potential sales from the customer.
The difference between linear regression and logistic regression is that linear regression output is the continuous value that can be anything while logistic regression predicts the probability that an instance belongs to a given class or not.
Logistic Regression – Frequently Asked Questions (FAQs)
What is logistic regression in machine learning.
Logistic regression is a statistical method for developing machine learning models with binary dependent variables, i.e. binary. Logistic regression is a statistical technique used to describe data and the relationship between one dependent variable and one or more independent variables.
What are the three types of logistic regression?
Logistic regression is classified into three types: binary, multinomial, and ordinal. They differ in execution as well as theory. Binary regression is concerned with two possible outcomes: yes or no. Multinomial logistic regression is used when there are three or more values.
Why logistic regression is used for classification problem?
Logistic regression is easier to implement, interpret, and train. It classifies unknown records very quickly. When the dataset is linearly separable, it performs well. Model coefficients can be interpreted as indicators of feature importance.
What distinguishes Logistic Regression from Linear Regression?
While Linear Regression is used to predict continuous outcomes, Logistic Regression is used to predict the likelihood of an observation falling into a specific category. Logistic Regression employs an S-shaped logistic function to map predicted values between 0 and 1.
What role does the logistic function play in Logistic Regression?
Logistic Regression relies on the logistic function to convert the output into a probability score. This score represents the probability that an observation belongs to a particular class. The S-shaped curve assists in thresholding and categorising data into binary outcomes.
Similar Reads
Please login to comment..., improve your coding skills with practice.
What kind of Experience do you want to share?
Companion to BER 642: Advanced Regression Methods
Chapter 10 binary logistic regression, 10.1 introduction.
Logistic regression is a technique used when the dependent variable is categorical (or nominal). Examples: 1) Consumers make a decision to buy or not to buy, 2) a product may pass or fail quality control, 3) there are good or poor credit risks, and 4) employee may be promoted or not.
Binary logistic regression - determines the impact of multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories.
Since the dependent variable is dichotomous we cannot predict a numerical value for it using logistic regression so the usual regression least squares deviations criteria for best fit approach of minimizing error around the line of best fit is inappropriate (It’s impossible to calculate deviations using binary variables!).
Instead, logistic regression employs binomial probability theory in which there are only two values to predict: that probability (p) is 1 rather than 0, i.e. the event/person belongs to one group rather than the other.
Logistic regression forms a best fitting equation or function using the maximum likelihood (ML) method, which maximizes the probability of classifying the observed data into the appropriate category given the regression coefficients.
Like multiple regression, logistic regression provides a coefficient ‘b’, which measures each independent variable’s partial contribution to variations in the dependent variable.
The goal is to correctly predict the category of outcome for individual cases using the most parsimonious model.
To accomplish this goal, a model (i.e. an equation) is created that includes all predictor variables that are useful in predicting the response variable.
10.2 The Purpose of Binary Logistic Regression
- The logistic regression predicts group membership
Since logistic regression calculates the probability of success over the probability of failure, the results of the analysis are in the form of an odds ratio.
Logistic regression determines the impact of multiple independent variables presented simultaneously to predict membership of one or other of the two dependent variable categories.
- The logistic regression also provides the relationships and strengths among the variables ## Assumptions of (Binary) Logistic Regression
Logistic regression does not assume a linear relationship between the dependent and independent variables.
- Logistic regression assumes linearity of independent variables and log odds of dependent variable.
The independent variables need not be interval, nor normally distributed, nor linearly related, nor of equal variance within each group
- Homoscedasticity is not required. The error terms (residuals) do not need to be normally distributed.
The dependent variable in logistic regression is not measured on an interval or ratio scale.
- The dependent variable must be a dichotomous ( 2 categories) for the binary logistic regression.
The categories (groups) as a dependent variable must be mutually exclusive and exhaustive; a case can only be in one group and every case must be a member of one of the groups.
Larger samples are needed than for linear regression because maximum coefficients using a ML method are large sample estimates. A minimum of 50 cases per predictor is recommended (Field, 2013)
Hosmer, Lemeshow, and Sturdivant (2013) suggest a minimum sample of 10 observations per independent variable in the model, but caution that 20 observations per variable should be sought if possible.
Leblanc and Fitzgerald (2000) suggest a minimum of 30 observations per independent variable.
10.3 Log Transformation
The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality.
- Log transformations and sq. root transformations moved skewed distributions closer to normality. So what we are about to do is common.
This log transformation of the p values to a log distribution enables us to create a link with the normal regression equation. The log distribution (or logistic transformation of p) is also called the logit of p or logit(p).
In logistic regression, a logistic transformation of the odds (referred to as logit) serves as the depending variable:
\[\log (o d d s)=\operatorname{logit}(P)=\ln \left(\frac{P}{1-P}\right)\] If we take the above dependent variable and add a regression equation for the independent variables, we get a logistic regression:
\[\ logit(p)=a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\] As in least-squares regression, the relationship between the logit(P) and X is assumed to be linear.
10.4 Equation
\[P=\frac{\exp \left(a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\right)}{1+\exp \left(a+b_{1} x_{1}+b_{2} x_{2}+b_{3} x_{3}+\ldots\right)}\] In the equation above: P can be calculated with the following formula
P = the probability that a case is in a particular category,
exp = the exponential function (approx. 2.72),
a = the constant (or intercept) of the equation and,
b = the coefficient (or slope) of the predictor variables.
10.5 Hypothesis Test
In logistic regression, hypotheses are of interest:
the null hypothesis , which is when all the coefficients in the regression equation take the value zero, and
the alternate hypothesis that the model currently under consideration is accurate and differs significantly from the null of zero, i.e. gives significantly better than the chance or random prediction level of the null hypothesis.
10.6 Likelihood Ratio Test for Nested Models
The likelihood ratio test is based on -2LL ratio. It is a test of the significance of the difference between the likelihood ratio (-2LL) for the researcher’s model with predictors (called model chi square) minus the likelihood ratio for baseline model with only a constant in it.
Significance at the .05 level or lower means the researcher’s model with the predictors is significantly different from the one with the constant only (all ‘b’ coefficients being zero). It measures the improvement in fit that the explanatory variables make compared to the null model.
Chi square is used to assess significance of this ratio.
10.7 R Lab: Running Binary Logistic Regression Model
10.7.1 data explanations ((data set: class.sav)).
A researcher is interested in how variables, such as GRE (Graduate Record Exam scores), GPA (grade point average) and prestige of the undergraduate institution, effect admission into graduate school. The response variable, admit/don’t admit, is a binary variable.
This dataset has a binary response (outcome, dependent) variable called admit, which is equal to 1 if the individual was admitted to graduate school, and 0 otherwise.
There are three predictor variables: GRE, GPA, and rank. We will treat the variables GRE and GPA as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest.
10.7.2 Explore the data
This dataset has a binary response (outcome, dependent) variable called admit. There are three predictor variables: gre, gpa and rank. We will treat the variables gre and gpa as continuous. The variable rank takes on the values 1 through 4. Institutions with a rank of 1 have the highest prestige, while those with a rank of 4 have the lowest. We can get basic descriptives for the entire data set by using summary. To get the standard deviations, we use sapply to apply the sd function to each variable in the dataset.
Before we run a binary logistic regression, we need check the previous two-way contingency table of categorical outcome and predictors. We want to make sure there is no zero in any cells.
10.7.3 Running a logstic regression model
In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.
Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable.
How to do the interpretation?
For every one unit change in gre, the log odds of admission (versus non-admission) increases by 0.002.
For a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.804.
The indicator variables for rank have a slightly different interpretation. For example, having attended an undergraduate institution with rank of 2, versus an institution with a rank of 1, changes the log odds of admission by -0.675.
Below the table of coefficients are fit indices, including the null and deviance residuals and the AIC. Later we show an example of how you can use these values to help assess model fit.
Why the coefficient value of rank (B) are different with the SPSS outputs? - In R, the glm automatically made the Rank 1 as the references group. However, in our SPSS example, we set the rank 4 as the reference group.
We can test for an overall effect of rank using the wald.test function of the aod library. The order in which the coefficients are given in the table of coefficients is the same as the order of the terms in the model. This is important because the wald.test function refers to the coefficients by their order in the model. We use the wald.test function. b supplies the coefficients, while Sigma supplies the variance covariance matrix of the error terms, finally Terms tells R which terms in the model are to be tested, in this case, terms 4, 5, and 6, are the three terms for the levels of rank.
The chi-squared test statistic of 20.9, with three degrees of freedom is associated with a p-value of 0.00011 indicating that the overall effect of rank is statistically significant.
We can also test additional hypotheses about the differences in the coefficients for the different levels of rank. Below we test that the coefficient for rank=2 is equal to the coefficient for rank=3. The first line of code below creates a vector l that defines the test we want to perform. In this case, we want to test the difference (subtraction) of the terms for rank=2 and rank=3 (i.e., the 4th and 5th terms in the model). To contrast these two terms, we multiply one of them by 1, and the other by -1. The other terms in the model are not involved in the test, so they are multiplied by 0. The second line of code below uses L=l to tell R that we wish to base the test on the vector l (rather than using the Terms option as we did above).
The chi-squared test statistic of 5.5 with 1 degree of freedom is associated with a p-value of 0.019, indicating that the difference between the coefficient for rank=2 and the coefficient for rank=3 is statistically significant.
You can also exponentiate the coefficients and interpret them as odds-ratios. R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of mylogit (coef(mylogit)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.
Now we can say that for a one unit increase in gpa, the odds of being admitted to graduate school (versus not being admitted) increase by a factor of 2.23.
For more information on interpreting odds ratios see our FAQ page: How do I interpret odds ratios in logistic regression? Link:
Note that while R produces it, the odds ratio for the intercept is not generally interpreted.
You can also use predicted probabilities to help you understand the model. Predicted probabilities can be computed for both categorical and continuous predictor variables. In order to create predicted probabilities we first need to create a new data frame with the values we want the independent variables to take on to create our predictions
We will start by calculating the predicted probability of admission at each value of rank, holding gre and gpa at their means.
These objects must have the same names as the variables in your logistic regression above (e.g. in this example the mean for gre must be named gre). Now that we have the data frame we want to use to calculate the predicted probabilities, we can tell R to create the predicted probabilities. The first line of code below is quite compact, we will break it apart to discuss what various components do. The newdata1$rankP tells R that we want to create a new variable in the dataset (data frame) newdata1 called rankP, the rest of the command tells R that the values of rankP should be predictions made using the predict( ) function. The options within the parentheses tell R that the predictions should be based on the analysis mylogit with values of the predictor variables coming from newdata1 and that the type of prediction is a predicted probability (type=“response”). The second line of the code lists the values in the data frame newdata1. Although not particularly pretty, this is a table of predicted probabilities.
In the above output we see that the predicted probability of being accepted into a graduate program is 0.52 for students from the highest prestige undergraduate institutions (rank=1), and 0.18 for students from the lowest ranked institutions (rank=4), holding gre and gpa at their means.
Now, we are going to do something that do not exist in our SPSS section
The code to generate the predicted probabilities (the first line below) is the same as before, except we are also going to ask for standard errors so we can plot a confidence interval. We get the estimates on the link scale and back transform both the predicted values and confidence limits into probabilities.
It can also be helpful to use graphs of predicted probabilities to understand and/or present the model. We will use the ggplot2 package for graphing.
We may also wish to see measures of how well our model fits. This can be particularly useful when comparing competing models. The output produced by summary(mylogit) included indices of fit (shown below the coefficients), including the null and deviance residuals and the AIC. One measure of model fit is the significance of the overall model. This test asks whether the model with predictors fits significantly better than a model with just an intercept (i.e., a null model). The test statistic is the difference between the residual deviance for the model with predictors and the null model. The test statistic is distributed chi-squared with degrees of freedom equal to the differences in degrees of freedom between the current and the null model (i.e., the number of predictor variables in the model). To find the difference in deviance for the two models (i.e., the test statistic) we can use the command:
10.8 Things to consider
Empty cells or small cells: You should check for empty or small cells by doing a crosstab between categorical predictors and the outcome variable. If a cell has very few cases (a small cell), the model may become unstable or it might not run at all.
Separation or quasi-separation (also called perfect prediction), a condition in which the outcome does not vary at some levels of the independent variables. See our page FAQ: What is complete or quasi-complete separation in logistic/probit regression and how do we deal with them? for information on models with perfect prediction. Link
Sample size: Both logit and probit models require more cases than OLS regression because they use maximum likelihood estimation techniques. It is sometimes possible to estimate models for binary outcomes in datasets with only a small number of cases using exact logistic regression. It is also important to keep in mind that when the outcome is rare, even if the overall dataset is large, it can be difficult to estimate a logit model.
Pseudo-R-squared: Many different measures of psuedo-R-squared exist. They all attempt to provide information similar to that provided by R-squared in OLS regression; however, none of them can be interpreted exactly as R-squared in OLS regression is interpreted. For a discussion of various pseudo-R-squareds see Long and Freese (2006) or our FAQ page What are pseudo R-squareds? Link
Diagnostics: The diagnostics for logistic regression are different from those for OLS regression. For a discussion of model diagnostics for logistic regression, see Hosmer and Lemeshow (2000, Chapter 5). Note that diagnostics done for logistic regression are similar to those done for probit regression.
10.9 Supplementary Learning Materials
Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley & Sons, NY.
Burns, R. P. & Burns R. (2008). Business research methods & statistics using SPSS. SAGE Publications.
Field, A (2013). Discovering statistics using IBM SPSS statistics (4th ed.). Los Angeles, CA: Sage Publications
Data files from Link1 , Link2 , & Link3 .
COMMENTS
Simple logistic regression uses the following null and alternative hypotheses: H0: β1 = 0. HA: β1 ≠ 0. The null hypothesis states that the coefficient β1 is equal to zero. In other words, there is no statistically significant relationship between the predictor variable, x, and the response variable, y. The alternative hypothesis states ...
Whether a loan applicant will default (Default/No Default). Logistic regression determines which independent variables have statistically significant relationships with the categorical outcome. For example, in the loan default model, logistic regression can assess the likelihood of default based on factors such as income, credit score, and loan ...
Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Some of the examples of classification problems are Email spam or not spam, Online transactions Fraud or not Fraud, Tumor Malignant or Benign. Logistic regression transforms its output using the logistic sigmoid function to return a ...
In regression analysis, logistic regression[1] (or logit regression) estimates the parameters of a logistic model (the coefficients in the linear or non linear combinations). In binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the ...
Logistic Function (Image by author) Hence the name logistic regression. This logistic function is a simple strategy to map the linear combination "z", lying in the (-inf,inf) range to the probability interval of [0,1] (in the context of logistic regression, this z will be called the log(odd) or logit or log(p/1-p)) (see the above plot). ). Consequently, Logistic regression is a type of ...
Logistic Regression is a popular statistical model used for binary classification, that is for predictions of the type this or that, yes or no, A or B, etc. Logistic regression can, however, be used for multiclass classification, but here we will focus on its simplest application. As an example, consider the task of predicting someone's ...
GLOBAL TESTS OF PARAMETERS. In OLS regression, if we wanted to test the hypothesis that all β's = 0 versus the alternative that at least one did not, we used a global F test. In logistic regression, we use a likelihood ratio chi-square test instead. Stata calls this LR chi2. The value in this case is 15.40.
Testing a single logistic regression coefficient in R To test a single logistic regression coefficient, we will use the Wald test, βˆ j −β j0 seˆ(βˆ) ∼ N(0,1), where seˆ(βˆ) is calculated by taking the inverse of the estimated information matrix. This value is given to you in the R output for β j0 = 0. As in linear regression ...
Logistic regression with a single quantitative explanatory variable. The logistic or logit function is used to transform an 'S'-shaped curve into an approximately straight line and to change the range of the proportion from 0-1 to -∞ to +∞. The logit function is defined as the natural logarithm (ln) of the odds of death. That is,
Age (in years) is linear so now we need to use logistic regression. From the logistic regression model we get. Odds ratio = 1.073, p- value < 0.0001, 95% confidence interval (1.054,1.093) interpretation Older age is a significant risk for CAD. For every one year increase in age the odds is 1.073 times larger
Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many social science applications. Logistic Regression is used when the dependent variable (target) is categorical. For example, To predict whether an email is spam (1) or (0) Whether the tumor is malignant (1) or not (0) Consider a scenario ...
The Logistic Regression Equation. Logistic regression uses a method known as maximum likelihood estimation (details will not be covered here) to find an equation of the following form: log [p (X) / (1-p (X))] = β0 + β1X1 + β2X2 + … + βpXp. where: Xj: The jth predictor variable. βj: The coefficient estimate for the jth predictor variable.
12.1 - Logistic Regression. Logistic regression models a relationship between predictor variables and a categorical response variable. For example, we could use logistic regression to model the relationship between various measurements of a manufactured specimen (such as dimensions and chemical composition) to predict if a crack greater than 10 ...
Logistic regression is a supervised machine learning algorithm used for classification tasks where the goal is to predict the probability that an instance belongs to a given class or not. Logistic regression is a statistical algorithm which analyze the relationship between two data factors. The article explores the fundamentals of logistic regression, it's types and implementations.
10.5 Hypothesis Test. In logistic regression, hypotheses are of interest: the null hypothesis, which is when all the coefficients in the regression equation take the value zero, and. the alternate hypothesis that the model currently under consideration is accurate and differs significantly from the null of zero, i.e. gives significantly better than the chance or random prediction level of the ...
Logistic regression is part of a category of statistical models called "generalized linear models" and many of its applications can be found in the medical field. ... Like standard multiple regression, logistic regression carries hypothesis tests for the significance of each variable, along with other tests, estimates, and goodness-of-fit ...
One of the critical assumptions of logistic regression is that the relationship between the logit (aka log-odds) of the outcome and each continuous independent variable is linear. The logit is the logarithm of the odds ratio, where p = probability of a positive outcome (e.g., survived Titanic sinking)
Conducting a Simple (Bivariate) Regression in SPSS . The steps to running a simple, bivariate regression in SPSS are: Click Analyze -> Regression -> Linear from the pull down menus. Drag the names of the predictor ( \(X\) -variable) into the box that says "Independent(s)" and the predicted ( \(Y\) -variable) into the box that says ...
Hypothesis 2: Dating a ... We use ordered logistic regression to test for differences across conditions for the promotion and managerial training items, and for the proposed mediator items (i.e., perceptions of the employee), all of which were captured on ordinal scales. For ease of interpretation, we present these results in odds ratios ...