Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Dr. Gaurav Kamboj
Deptt. of Community Medicine
PGIMS, Rohtak
Logistic
Regression
Introduction
Types of regression
Regression line and equation
Logistic regression
Relation between probability, odds ...
REGRESSION is the measure of the average
relationship between two or more variables in terms
of the original units of the...
SIMPLE LINEAR REGRESSION uses one independent
variable to explain and/or predict the outcome of Y
Y = α + βX + e
MULTIPL...
 The equation of the straight line
is given by regression equation.
 Population Regression equation
Y = α + βX + e
 Sam...
Types of Regression Models. . .
Used to analyze relationships between a CATEGORICAL
dependent variable and metric or categorical independent
variables.
Of...
Each predictor (IV) is given a coefficient ‘b’
which measures its independent contribution
to variations in the DV, the DV...
When And Why
Used because having a categorical outcome
variable violates the assumption of linearity in
normal regression....
Marks
Study
Hours
Passing
Marks
Study
Hours
Result
Pass
Fail
Logistic RegressionLinear Regression
Binary logistic regression model:
Used to model a binary response—e.g. yes or no.
Ordinal (ordered) logistic regression ...
Relation between
probability, odds ratio
and logit
Example :100 participant are randomized to a new or
standard treatment (50 subjects to each treatment
group)
Are chances o...
The probability of success:
Pnew = Pr (success/ new treatment) =20/50=40%
Pst = Pr (success / std. treatment) = 10/50 =20 ...
OR = Onew/Ost =(20/30)/(10/40)= 0.67/0.25 = 2.67
If OR = 1 then the success chances are the
same in each group which means...
The probability of success can be represented via
odds or LOGITs of success
From above example
LOGITnew = -0.41 (new treat...
In this logistic regression -1.39 and 0.98 are
regression coefficients
-1.39 is called the model intercept
0.98 is the tre...
LOGIT = -1.39 + 0.98 (treatment is new)
If treatment is ‘standard” then
LOGIT = -1.39 +0.98*0 = -1.39 and
odds = Ost = exp...
If we apply antilog to 0.98 then exp(0.98) =2.67,
the odds ratio!!!
This 2.67 is different from 1, which means we
have a s...
The crucial limitation of linear regression is that it
cannot deal with DV’s that are dichotomous and
categorical
Logist...
Like ordinary regression, logistic regression
provides a coefficient ‘b’, which measures each
IV’s partial contribution t...
The first is the prediction of group membership.
Since logistic regression calculates the
probability of success over the ...
Methods
Simultaneous method: in which all independents
are included at the same time
Hierarchical method: Variables enter...
The minimum number of cases per independent
variable is 10.
For preferred case-to-variable ratios, we will
use 20 to 1 for...
1. Assumes a linear relationship between the LOGIT of the
IVs and DVs
However, does not assume a liner relationship
betwee...
 Logistic Distribution
 Transformed,
however, the “log
odds” are linear.
ln[p/(1-p)]
P (Y=1)
x
x
In SPSS the b coefficients are located in column ‘B’ in
the ‘Variables in the Equation’ table.
Logistic regression calcula...
compare the fit of
two models. How
well a model fits
as compared to
the other.
-2
Logliklihood
Lower the
Value better
the ...
Likelihood Ratio Test
Based On
it checks whether the fuller model is better
than the base model.
What is it?
Loglikelihood...
Wald Test
Based On
give the “importance” of the contribution of
each variable in the model
What is it?
Chi Square distribu...
Measure of the Proportion of Variance
Based On
Measure of the proportion of variation
explained
What is it?
Comparison of ...
The Hosmer-Lemeshow Goodness-of-
Fit Test
Based On
How well does your model fit the dataWhat is it?
produce a p-value
Inte...
Interpreting the Logistic ModelModel
With one unit
increase in x
log(OR) of the
success will
increase by 1.3
units on aver...
Data from a survey of home owners conducted by an electricity
company about an offer of roof solar panels with a 50% subsi...
In SPSS, the model is always constructed to predict the
group with higher numeric code.
• If responses are coded 1 for Yes...
Logistic regression dialogue box
4. Whether there is any categorical predictor
variables, click “categorical” button and enter it (
there is none in the ex...
5. Click on options botton and select Classification
plots, Hosmer-Lemeshow Goodnes of Fit, Casewise
Listing Of Residuals ...
Option dialogue box
The first one to take note of is the Classification table in
Block 0 Beginning Block.
Block 0: Beginning Block. Block 0 pr...
The variables not in the equation table tells us whether
each IV improves the model
The answer is yes for both variables, ...
The overall significance is tested using what SPSS calls
the Model Chi square, which is derived from the
likelihood of obs...
Cox and Snell’s R-Square attempts to imitate multiple
R-Square based on ‘likelihood’, but its maximum can
be (and usually ...
R2 = +1
Examples of Approximate R2 Values
y
x
y
x
R2 = 1
R2 = 1
Perfect linear relationship
between x and y:
100% of the v...
y
x
y
x
0 < R2 < 1
Weaker linear relationship
between x and y:
Some but not all of the
variation in y is explained
by vari...
R2 = 0
No linear relationship
between x and y:
The value of Y does not
depend on x. (None of the
variation in y is explain...
If the H-L goodness-of-fit test statistic is greater than .05,
as we want for well-fitting models, we fail to reject the
n...
In the Classification table, the columns are the two
predicted values of the dependent, while the rows are the
two observe...
In this case, we note that family size contributed
significantly to the prediction (p = .013) but
mortgage did not (p = .0...
The odds ratio is a measure of effect size.
The ratio of odds ratios of the
independents is the ratio of relative
importan...
Thank You
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Logistic regression with SPSS examples
Upcoming SlideShare
Loading in …5
×

Logistic regression with SPSS examples

13,702 views

Published on

Establishing association between dependent and independent variables

Published in: Data & Analytics
  • Login to see the comments

Logistic regression with SPSS examples

  1. 1. Dr. Gaurav Kamboj Deptt. of Community Medicine PGIMS, Rohtak Logistic Regression
  2. 2. Introduction Types of regression Regression line and equation Logistic regression Relation between probability, odds ratio and logit Purpose Uses Assumptions Logistic regression equation Interpretation of log odd and odds ratio Example CONTENTS
  3. 3. REGRESSION is the measure of the average relationship between two or more variables in terms of the original units of the data. There are different types of regression. Among many types of regression, the most common in medical research is LOGISTIC REGRESSION. Introduction
  4. 4. SIMPLE LINEAR REGRESSION uses one independent variable to explain and/or predict the outcome of Y Y = α + βX + e MULTIPLE LINEAR REGRESSION uses two or more independent variables to predict the outcome. The general form of each type of regression is: Introduction
  5. 5.  The equation of the straight line is given by regression equation.  Population Regression equation Y = α + βX + e  Sample regression equation Y= a + bx Where ‘α’ or ‘a’ is the intercept ‘β’ or ‘b’ is the slope of the line which measures the amount of change in y for unit change in x. ‘e’ is the regression residual/error
  6. 6. Types of Regression Models. . .
  7. 7. Used to analyze relationships between a CATEGORICAL dependent variable and metric or categorical independent variables. Often chosen if the predictor/independent variables are a mix of continuous and categorical variables ln[p/(1-p)] = α + β1X1 + β2X2 + β3X3 + ... + βtXt + e The estimated probability is: p = 1/[1 + exp-(α + β1X1 + β2X2 + β3X3 + ... + βtXt )] • p is the probability that the event Y occurs, p(Y=1) • p/(1-p) is the "odds ratio" • ln[p/(1-p)] is the log odds ratio, or "logit" Logistic Regression
  8. 8. Each predictor (IV) is given a coefficient ‘b’ which measures its independent contribution to variations in the DV, the DV can only take on one of the two values: 0 or 1. What we want to predict from a knowledge of relevant IVs and coefficients is therefore not a numerical value of a DV as in linear regression, but rather the probability (p) that it is 1 rather than 0 (belonging to one group rather than the other). Logistic regression equation
  9. 9. When And Why Used because having a categorical outcome variable violates the assumption of linearity in normal regression. Does not assume a linear relationship between DV and IV Predictors do not have to be normally distributed Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent
  10. 10. Marks Study Hours Passing Marks Study Hours Result Pass Fail Logistic RegressionLinear Regression
  11. 11. Binary logistic regression model: Used to model a binary response—e.g. yes or no. Ordinal (ordered) logistic regression model (ordinal multinomial logistic model.) Used to model an ordered response—e.g. low, medium, or high. Nominal (unordered) logistic regression model (polytomous, polychotomous, or multinomial) Used to model a multilevel response with no ordering—e.g. eye color with levels brown, green, and blue. Types Of Logistic Regression
  12. 12. Relation between probability, odds ratio and logit
  13. 13. Example :100 participant are randomized to a new or standard treatment (50 subjects to each treatment group) Are chances of success equal for each treatment group? Groups New Standard Total Success 20 10 30 Failure 30 40 70 Total 50 50 100
  14. 14. The probability of success: Pnew = Pr (success/ new treatment) =20/50=40% Pst = Pr (success / std. treatment) = 10/50 =20 % The odds of success: Onew = Pnew/ (1-Pnew) = 20/30 = 0.66 Ost = Pst/(1-Pst) = 10/40 = 0.25 The natural logarithm of odds of success (= LOGIT) LOGITnew = log (20/30) = -0.41 (new treatment) LOGITst = log (10/40) = log(0.25) = -1.39 (std. treatment) How to measure the chances of success?
  15. 15. OR = Onew/Ost =(20/30)/(10/40)= 0.67/0.25 = 2.67 If OR = 1 then the success chances are the same in each group which means Pnew = Pst or Onew = Ost The null hypothesis is H0. OR=1 vs the alternative Ha: OR is not equal to 1 In this case, the odds of success are 2.67 times higher for the new treatment comparing to the standard one Odds Ratio is a possible way in the chances of success to capture inequality
  16. 16. The probability of success can be represented via odds or LOGITs of success From above example LOGITnew = -0.41 (new treatment) LOGITst = -1.39 (standard treatment) So the difference between the log odds = .98 We can combine these two log odds for different groups into one formula Log(odds) = -1.39 +0.98*(treatment is new) (example of simple logistic regression) Simple logistic regression
  17. 17. In this logistic regression -1.39 and 0.98 are regression coefficients -1.39 is called the model intercept 0.98 is the treatment effect or the difference between LOGITs Simple logistic regression
  18. 18. LOGIT = -1.39 + 0.98 (treatment is new) If treatment is ‘standard” then LOGIT = -1.39 +0.98*0 = -1.39 and odds = Ost = exp(-1.39) = 0.25 and Pst = 20% If treatment is ‘new” then LOGIT = -1.39 +0.98*1 = -0.41 and odds = Onew = exp(-0.41) = 0.67 and Pnew = 40% Simple logistic regresion
  19. 19. If we apply antilog to 0.98 then exp(0.98) =2.67, the odds ratio!!! This 2.67 is different from 1, which means we have a significant increase in odds of treatment success (chi-square p-value was <5%) Simple logistic regresion
  20. 20. The crucial limitation of linear regression is that it cannot deal with DV’s that are dichotomous and categorical Logistic regression employs binomial probability theory in which there are only two values to predict: that probability (p) is 1 rather than 0, i.e. the event/person belongs to one group rather than the other. Logistic regression forms a best fitting equation or function using the maximum likelihood method, which maximizes the probability of classifying the observed data into the appropriate category given the regression coefficients. Purpose of logistic regression
  21. 21. Like ordinary regression, logistic regression provides a coefficient ‘b’, which measures each IV’s partial contribution to variations in the DV. To accomplish this goal, a model (i.e. an equation) is created that includes all predictor variables that are useful in predicting the response variable. Variables can, if necessary, be entered into the model in the order specified by the researcher in a stepwise fashion like regression. Purpose of logistic regression
  22. 22. The first is the prediction of group membership. Since logistic regression calculates the probability of success over the probability of failure, the results of the analysis are in the form of an ODDS RATIO. It also provides knowledge of the relationships and strengths among the variables (e.g. marrying the boss’s daughter puts you at a higher probability for job promotion than undertaking five hours unpaid overtime each week). Uses of logistic regression
  23. 23. Methods Simultaneous method: in which all independents are included at the same time Hierarchical method: Variables entered in blocks. Blocks should be based on past research, or theory being tested. Good Method. Stepwise method: (forward conditional in SPSS) in which variables are selected in the order in which they maximize the statistically significant contribution to the model. Binary Logistic Regression
  24. 24. The minimum number of cases per independent variable is 10. For preferred case-to-variable ratios, we will use 20 to 1 for simultaneous and hierarchical logistic regression and 50 to 1 for stepwise logistic regression. Sample size requirements
  25. 25. 1. Assumes a linear relationship between the LOGIT of the IVs and DVs However, does not assume a liner relationship between the actual dependent and independent variables 2. The sample is ‘large’- reliability of estimation declines when there are only a few cases. A minimum of 50 cases per predictor is recommended. 3. IVs are not linear functions of each other 4. Normal distribution is not necessary or assumed for the dependent variable.. 5. Homoscedasticity is not necessary for each level of the independent variables. Assumptions
  26. 26.  Logistic Distribution  Transformed, however, the “log odds” are linear. ln[p/(1-p)] P (Y=1) x x
  27. 27. In SPSS the b coefficients are located in column ‘B’ in the ‘Variables in the Equation’ table. Logistic regression calculates changes in the log odds of the dependent, not changes in the dependent value. Odds value can range from 0 to infinity and tell you how much more likely it is that an observation is a member of the target group rather than a member of the other group. SPSS actually calculates this value of the ln(odds ratio) for us and presents it as EXP(B) in the results printout in the ‘Variables in the Equation’ table. Interpreting log odds and the odds ratio
  28. 28. compare the fit of two models. How well a model fits as compared to the other. -2 Logliklihood Lower the Value better the fit of Alternative Chi Square Test Base Model is better Alternative is better Table showing how many observations have been predicted correctly Both Models are same Proposed is better Larger difference is better P < 0.05 Diagnosis of LR Classification Table Difference between the Base Model and Proposed Model Higher the correct prediction the better
  29. 29. Likelihood Ratio Test Based On it checks whether the fuller model is better than the base model. What is it? Loglikelihood function= -2loglikelihood Measures the discrepancy between the observed and predicted values Interpretation loglikelihood Lower the value the better
  30. 30. Wald Test Based On give the “importance” of the contribution of each variable in the model What is it? Chi Square distribution at 1 df Interpretatio n Higher the value, the more “important” it is.
  31. 31. Measure of the Proportion of Variance Based On Measure of the proportion of variation explained What is it? Comparison of log-liklihood of the base and proposed model Measures Cox & Snell’s R2 Nagelkerke’s R2 Interpretati on The higher the better (Value is between 0 & 1) Does not attain 1 for the perfect model Attains1 for the perfect model
  32. 32. The Hosmer-Lemeshow Goodness-of- Fit Test Based On How well does your model fit the dataWhat is it? produce a p-value Interpretation if it’s low (< .05), you reject the model. If it’s high, then your model passes the test
  33. 33. Interpreting the Logistic ModelModel With one unit increase in x log(OR) of the success will increase by 1.3 units on average Interpretation Logit Odd Ratio Probability With one unit increase in x OR of success will increase by e1.3 units or by 3.67 units. It gives the probability of success for a particular value of x
  34. 34. Data from a survey of home owners conducted by an electricity company about an offer of roof solar panels with a 50% subsidy from the state government as part of the state’s environmental policy. The variables involve household income measured in units of a thousand dollars, age, monthly mortgage, size of family household, and whether the householder would take or decline the offer. 1. Click Analyze >>Regression >> Binary Logistic 2. Select the grouping variable (the variable to be predicted) which must be a dichotomous measure and place it into the Dependent box. 3. Enter your predictors (IV’s) into the Covariates box. These are ‘family size’ and ‘mortgage’. SPSS Example
  35. 35. In SPSS, the model is always constructed to predict the group with higher numeric code. • If responses are coded 1 for Yes and 2 for No, SPSS will predict membership in the No category. • If responses are coded 1 for No and 2 for Yes, SPSS will predict membership in the Yes category. We will refer to the predicted event for a particular analysis as the modeled event.
  36. 36. Logistic regression dialogue box
  37. 37. 4. Whether there is any categorical predictor variables, click “categorical” button and enter it ( there is none in the example).
  38. 38. 5. Click on options botton and select Classification plots, Hosmer-Lemeshow Goodnes of Fit, Casewise Listing Of Residuals and select Outliers Outside 2sd. Retain default entries for probability of stepwise, classifi cation cutoff and maximum iterations 6. Continue then OK.
  39. 39. Option dialogue box
  40. 40. The first one to take note of is the Classification table in Block 0 Beginning Block. Block 0: Beginning Block. Block 0 presents the results with only the constant included before any coefficients (i.e. those relating to family size and mortgage) are entered into the equation. The table suggests that if we knew nothing about our variables and guessed that a person would take the offer we would be correct 53.3% of the time. Interpretation of printout tables
  41. 41. The variables not in the equation table tells us whether each IV improves the model The answer is yes for both variables, with family size slightly better than mortgage size, as both are significant and if included would add to the predictive power of the model. If they had not been significant and able to contribute to the prediction, then termination of the analysis would obviously occur at this point. Variables not in the equation
  42. 42. The overall significance is tested using what SPSS calls the Model Chi square, which is derived from the likelihood of observing the actual data under the assumption that the model that has been fitted is accurate. In our case model chi square has 2 degrees of freedom, a value of 24.096 and a probability of p < 0.000 . Thus, the indication is that the model has a poor fit, with the model containing only the constant indicating that the predictors do have a significant effect and create essentially a different model. So we need to look closely at the predictors and from later tables determine if one or both are significant predictors. Model chi-square
  43. 43. Cox and Snell’s R-Square attempts to imitate multiple R-Square based on ‘likelihood’, but its maximum can be (and usually is) less than 1.0 The Nagelkerke modification that does range from 0 to 1 is a more reliable measure of the relationship. Nagelkerke’s R2 is part of SPSS output in the ‘Model Summary’ table and is the most-reported of the R- squared estimates. In this case it is 0.737, indicating a moderately strong relationship of 73.7% between the predictors and the prediction. Model Summary
  44. 44. R2 = +1 Examples of Approximate R2 Values y x y x R2 = 1 R2 = 1 Perfect linear relationship between x and y: 100% of the variation in y is explained by variation in x
  45. 45. y x y x 0 < R2 < 1 Weaker linear relationship between x and y: Some but not all of the variation in y is explained by variation in x Examples of Approximate R2 Values
  46. 46. R2 = 0 No linear relationship between x and y: The value of Y does not depend on x. (None of the variation in y is explained by variation in x) y xR2 = 0 Examples of Approximate R2 Values
  47. 47. If the H-L goodness-of-fit test statistic is greater than .05, as we want for well-fitting models, we fail to reject the null hypothesis that there is no difference between observed and model-predicted values, implying that the model’s estimates fit the data at an acceptable level. That is, well-fitting models show non-significance on the H-L goodness-of-fit test. Hosmer and Lemeshow statistic
  48. 48. In the Classification table, the columns are the two predicted values of the dependent, while the rows are the two observed (actual) values of the dependent. In this study, 87.5% were correctly classified for the take offer group and 92.9% for the decline offer group. Overall 90% were correctly classified. This is a considerable improvement on the 53.3% correct classification with the constant model so we know that the model with predictors is a significantly better mode. The benchmark that we will use to characterize a logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone. Classification table
  49. 49. In this case, we note that family size contributed significantly to the prediction (p = .013) but mortgage did not (p = .075). The EXP(B) value associated with family size is 11.007. Hence when family size is raised by one unit (one person) the odds ratio is 11 times as large and therefore householders are 11 more times likely to belong to the take offer group. Variables in the Equation
  50. 50. The odds ratio is a measure of effect size. The ratio of odds ratios of the independents is the ratio of relative importance of the independent variables in terms of effect on the dependent variable’s odds. In this example family size is 11 times as important as monthly mortgage in determining the decision. Effect size
  51. 51. Thank You

×