Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

13,702 views

Published on

Establishing association between dependent and independent variables

Published in:
Data & Analytics

No Downloads

Total views

13,702

On SlideShare

0

From Embeds

0

Number of Embeds

13

Shares

0

Downloads

0

Comments

18

Likes

72

No notes for slide

is usually employed with a categorical dependent variable, & all of the predictors are continuous and nicely distributed;

LOGIT ANALYSIS

is usually employed if all of the predictors are categorical;

is usually employed with a categorical dependent variable, & all of the predictors are continuous and nicely distributed;

LOGIT ANALYSIS

is usually employed if all of the predictors are categorical;

is usually employed with a categorical dependent variable, & all of the predictors are continuous and nicely distributed;

LOGIT ANALYSIS

is usually employed if all of the predictors are categorical;

This assumption means that the variance around the regression line is the same for all values of the predictor variable (X)

(L1) over the maximized value of the likelihood function for the simpler model (L0). This log

transformation of the likelihood functions yields a chi-squared statistic.

calculates a statistic. This z value is then squared, yielding a Wald statistic with a chi-square

distribution.

Wald estimates give the “importance”

of the contribution of each variable in the model.

The higher the value, the more “important” it is.

- 1. Dr. Gaurav Kamboj Deptt. of Community Medicine PGIMS, Rohtak Logistic Regression
- 2. Introduction Types of regression Regression line and equation Logistic regression Relation between probability, odds ratio and logit Purpose Uses Assumptions Logistic regression equation Interpretation of log odd and odds ratio Example CONTENTS
- 3. REGRESSION is the measure of the average relationship between two or more variables in terms of the original units of the data. There are different types of regression. Among many types of regression, the most common in medical research is LOGISTIC REGRESSION. Introduction
- 4. SIMPLE LINEAR REGRESSION uses one independent variable to explain and/or predict the outcome of Y Y = α + βX + e MULTIPLE LINEAR REGRESSION uses two or more independent variables to predict the outcome. The general form of each type of regression is: Introduction
- 5. The equation of the straight line is given by regression equation. Population Regression equation Y = α + βX + e Sample regression equation Y= a + bx Where ‘α’ or ‘a’ is the intercept ‘β’ or ‘b’ is the slope of the line which measures the amount of change in y for unit change in x. ‘e’ is the regression residual/error
- 6. Types of Regression Models. . .
- 7. Used to analyze relationships between a CATEGORICAL dependent variable and metric or categorical independent variables. Often chosen if the predictor/independent variables are a mix of continuous and categorical variables ln[p/(1-p)] = α + β1X1 + β2X2 + β3X3 + ... + βtXt + e The estimated probability is: p = 1/[1 + exp-(α + β1X1 + β2X2 + β3X3 + ... + βtXt )] • p is the probability that the event Y occurs, p(Y=1) • p/(1-p) is the "odds ratio" • ln[p/(1-p)] is the log odds ratio, or "logit" Logistic Regression
- 8. Each predictor (IV) is given a coefficient ‘b’ which measures its independent contribution to variations in the DV, the DV can only take on one of the two values: 0 or 1. What we want to predict from a knowledge of relevant IVs and coefficients is therefore not a numerical value of a DV as in linear regression, but rather the probability (p) that it is 1 rather than 0 (belonging to one group rather than the other). Logistic regression equation
- 9. When And Why Used because having a categorical outcome variable violates the assumption of linearity in normal regression. Does not assume a linear relationship between DV and IV Predictors do not have to be normally distributed Logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent
- 10. Marks Study Hours Passing Marks Study Hours Result Pass Fail Logistic RegressionLinear Regression
- 11. Binary logistic regression model: Used to model a binary response—e.g. yes or no. Ordinal (ordered) logistic regression model (ordinal multinomial logistic model.) Used to model an ordered response—e.g. low, medium, or high. Nominal (unordered) logistic regression model (polytomous, polychotomous, or multinomial) Used to model a multilevel response with no ordering—e.g. eye color with levels brown, green, and blue. Types Of Logistic Regression
- 12. Relation between probability, odds ratio and logit
- 13. Example :100 participant are randomized to a new or standard treatment (50 subjects to each treatment group) Are chances of success equal for each treatment group? Groups New Standard Total Success 20 10 30 Failure 30 40 70 Total 50 50 100
- 14. The probability of success: Pnew = Pr (success/ new treatment) =20/50=40% Pst = Pr (success / std. treatment) = 10/50 =20 % The odds of success: Onew = Pnew/ (1-Pnew) = 20/30 = 0.66 Ost = Pst/(1-Pst) = 10/40 = 0.25 The natural logarithm of odds of success (= LOGIT) LOGITnew = log (20/30) = -0.41 (new treatment) LOGITst = log (10/40) = log(0.25) = -1.39 (std. treatment) How to measure the chances of success?
- 15. OR = Onew/Ost =(20/30)/(10/40)= 0.67/0.25 = 2.67 If OR = 1 then the success chances are the same in each group which means Pnew = Pst or Onew = Ost The null hypothesis is H0. OR=1 vs the alternative Ha: OR is not equal to 1 In this case, the odds of success are 2.67 times higher for the new treatment comparing to the standard one Odds Ratio is a possible way in the chances of success to capture inequality
- 16. The probability of success can be represented via odds or LOGITs of success From above example LOGITnew = -0.41 (new treatment) LOGITst = -1.39 (standard treatment) So the difference between the log odds = .98 We can combine these two log odds for different groups into one formula Log(odds) = -1.39 +0.98*(treatment is new) (example of simple logistic regression) Simple logistic regression
- 17. In this logistic regression -1.39 and 0.98 are regression coefficients -1.39 is called the model intercept 0.98 is the treatment effect or the difference between LOGITs Simple logistic regression
- 18. LOGIT = -1.39 + 0.98 (treatment is new) If treatment is ‘standard” then LOGIT = -1.39 +0.98*0 = -1.39 and odds = Ost = exp(-1.39) = 0.25 and Pst = 20% If treatment is ‘new” then LOGIT = -1.39 +0.98*1 = -0.41 and odds = Onew = exp(-0.41) = 0.67 and Pnew = 40% Simple logistic regresion
- 19. If we apply antilog to 0.98 then exp(0.98) =2.67, the odds ratio!!! This 2.67 is different from 1, which means we have a significant increase in odds of treatment success (chi-square p-value was <5%) Simple logistic regresion
- 20. The crucial limitation of linear regression is that it cannot deal with DV’s that are dichotomous and categorical Logistic regression employs binomial probability theory in which there are only two values to predict: that probability (p) is 1 rather than 0, i.e. the event/person belongs to one group rather than the other. Logistic regression forms a best fitting equation or function using the maximum likelihood method, which maximizes the probability of classifying the observed data into the appropriate category given the regression coefficients. Purpose of logistic regression
- 21. Like ordinary regression, logistic regression provides a coefficient ‘b’, which measures each IV’s partial contribution to variations in the DV. To accomplish this goal, a model (i.e. an equation) is created that includes all predictor variables that are useful in predicting the response variable. Variables can, if necessary, be entered into the model in the order specified by the researcher in a stepwise fashion like regression. Purpose of logistic regression
- 22. The first is the prediction of group membership. Since logistic regression calculates the probability of success over the probability of failure, the results of the analysis are in the form of an ODDS RATIO. It also provides knowledge of the relationships and strengths among the variables (e.g. marrying the boss’s daughter puts you at a higher probability for job promotion than undertaking five hours unpaid overtime each week). Uses of logistic regression
- 23. Methods Simultaneous method: in which all independents are included at the same time Hierarchical method: Variables entered in blocks. Blocks should be based on past research, or theory being tested. Good Method. Stepwise method: (forward conditional in SPSS) in which variables are selected in the order in which they maximize the statistically significant contribution to the model. Binary Logistic Regression
- 24. The minimum number of cases per independent variable is 10. For preferred case-to-variable ratios, we will use 20 to 1 for simultaneous and hierarchical logistic regression and 50 to 1 for stepwise logistic regression. Sample size requirements
- 25. 1. Assumes a linear relationship between the LOGIT of the IVs and DVs However, does not assume a liner relationship between the actual dependent and independent variables 2. The sample is ‘large’- reliability of estimation declines when there are only a few cases. A minimum of 50 cases per predictor is recommended. 3. IVs are not linear functions of each other 4. Normal distribution is not necessary or assumed for the dependent variable.. 5. Homoscedasticity is not necessary for each level of the independent variables. Assumptions
- 26. Logistic Distribution Transformed, however, the “log odds” are linear. ln[p/(1-p)] P (Y=1) x x
- 27. In SPSS the b coefficients are located in column ‘B’ in the ‘Variables in the Equation’ table. Logistic regression calculates changes in the log odds of the dependent, not changes in the dependent value. Odds value can range from 0 to infinity and tell you how much more likely it is that an observation is a member of the target group rather than a member of the other group. SPSS actually calculates this value of the ln(odds ratio) for us and presents it as EXP(B) in the results printout in the ‘Variables in the Equation’ table. Interpreting log odds and the odds ratio
- 28. compare the fit of two models. How well a model fits as compared to the other. -2 Logliklihood Lower the Value better the fit of Alternative Chi Square Test Base Model is better Alternative is better Table showing how many observations have been predicted correctly Both Models are same Proposed is better Larger difference is better P < 0.05 Diagnosis of LR Classification Table Difference between the Base Model and Proposed Model Higher the correct prediction the better
- 29. Likelihood Ratio Test Based On it checks whether the fuller model is better than the base model. What is it? Loglikelihood function= -2loglikelihood Measures the discrepancy between the observed and predicted values Interpretation loglikelihood Lower the value the better
- 30. Wald Test Based On give the “importance” of the contribution of each variable in the model What is it? Chi Square distribution at 1 df Interpretatio n Higher the value, the more “important” it is.
- 31. Measure of the Proportion of Variance Based On Measure of the proportion of variation explained What is it? Comparison of log-liklihood of the base and proposed model Measures Cox & Snell’s R2 Nagelkerke’s R2 Interpretati on The higher the better (Value is between 0 & 1) Does not attain 1 for the perfect model Attains1 for the perfect model
- 32. The Hosmer-Lemeshow Goodness-of- Fit Test Based On How well does your model fit the dataWhat is it? produce a p-value Interpretation if it’s low (< .05), you reject the model. If it’s high, then your model passes the test
- 33. Interpreting the Logistic ModelModel With one unit increase in x log(OR) of the success will increase by 1.3 units on average Interpretation Logit Odd Ratio Probability With one unit increase in x OR of success will increase by e1.3 units or by 3.67 units. It gives the probability of success for a particular value of x
- 34. Data from a survey of home owners conducted by an electricity company about an offer of roof solar panels with a 50% subsidy from the state government as part of the state’s environmental policy. The variables involve household income measured in units of a thousand dollars, age, monthly mortgage, size of family household, and whether the householder would take or decline the offer. 1. Click Analyze >>Regression >> Binary Logistic 2. Select the grouping variable (the variable to be predicted) which must be a dichotomous measure and place it into the Dependent box. 3. Enter your predictors (IV’s) into the Covariates box. These are ‘family size’ and ‘mortgage’. SPSS Example
- 35. In SPSS, the model is always constructed to predict the group with higher numeric code. • If responses are coded 1 for Yes and 2 for No, SPSS will predict membership in the No category. • If responses are coded 1 for No and 2 for Yes, SPSS will predict membership in the Yes category. We will refer to the predicted event for a particular analysis as the modeled event.
- 36. Logistic regression dialogue box
- 37. 4. Whether there is any categorical predictor variables, click “categorical” button and enter it ( there is none in the example).
- 38. 5. Click on options botton and select Classification plots, Hosmer-Lemeshow Goodnes of Fit, Casewise Listing Of Residuals and select Outliers Outside 2sd. Retain default entries for probability of stepwise, classifi cation cutoff and maximum iterations 6. Continue then OK.
- 39. Option dialogue box
- 40. The first one to take note of is the Classification table in Block 0 Beginning Block. Block 0: Beginning Block. Block 0 presents the results with only the constant included before any coefficients (i.e. those relating to family size and mortgage) are entered into the equation. The table suggests that if we knew nothing about our variables and guessed that a person would take the offer we would be correct 53.3% of the time. Interpretation of printout tables
- 41. The variables not in the equation table tells us whether each IV improves the model The answer is yes for both variables, with family size slightly better than mortgage size, as both are significant and if included would add to the predictive power of the model. If they had not been significant and able to contribute to the prediction, then termination of the analysis would obviously occur at this point. Variables not in the equation
- 42. The overall significance is tested using what SPSS calls the Model Chi square, which is derived from the likelihood of observing the actual data under the assumption that the model that has been fitted is accurate. In our case model chi square has 2 degrees of freedom, a value of 24.096 and a probability of p < 0.000 . Thus, the indication is that the model has a poor fit, with the model containing only the constant indicating that the predictors do have a significant effect and create essentially a different model. So we need to look closely at the predictors and from later tables determine if one or both are significant predictors. Model chi-square
- 43. Cox and Snell’s R-Square attempts to imitate multiple R-Square based on ‘likelihood’, but its maximum can be (and usually is) less than 1.0 The Nagelkerke modification that does range from 0 to 1 is a more reliable measure of the relationship. Nagelkerke’s R2 is part of SPSS output in the ‘Model Summary’ table and is the most-reported of the R- squared estimates. In this case it is 0.737, indicating a moderately strong relationship of 73.7% between the predictors and the prediction. Model Summary
- 44. R2 = +1 Examples of Approximate R2 Values y x y x R2 = 1 R2 = 1 Perfect linear relationship between x and y: 100% of the variation in y is explained by variation in x
- 45. y x y x 0 < R2 < 1 Weaker linear relationship between x and y: Some but not all of the variation in y is explained by variation in x Examples of Approximate R2 Values
- 46. R2 = 0 No linear relationship between x and y: The value of Y does not depend on x. (None of the variation in y is explained by variation in x) y xR2 = 0 Examples of Approximate R2 Values
- 47. If the H-L goodness-of-fit test statistic is greater than .05, as we want for well-fitting models, we fail to reject the null hypothesis that there is no difference between observed and model-predicted values, implying that the model’s estimates fit the data at an acceptable level. That is, well-fitting models show non-significance on the H-L goodness-of-fit test. Hosmer and Lemeshow statistic
- 48. In the Classification table, the columns are the two predicted values of the dependent, while the rows are the two observed (actual) values of the dependent. In this study, 87.5% were correctly classified for the take offer group and 92.9% for the decline offer group. Overall 90% were correctly classified. This is a considerable improvement on the 53.3% correct classification with the constant model so we know that the model with predictors is a significantly better mode. The benchmark that we will use to characterize a logistic regression model as useful is a 25% improvement over the rate of accuracy achievable by chance alone. Classification table
- 49. In this case, we note that family size contributed significantly to the prediction (p = .013) but mortgage did not (p = .075). The EXP(B) value associated with family size is 11.007. Hence when family size is raised by one unit (one person) the odds ratio is 11 times as large and therefore householders are 11 more times likely to belong to the take offer group. Variables in the Equation
- 50. The odds ratio is a measure of effect size. The ratio of odds ratios of the independents is the ratio of relative importance of the independent variables in terms of effect on the dependent variable’s odds. In this example family size is 11 times as important as monthly mortgage in determining the decision. Effect size
- 51. Thank You

No public clipboards found for this slide

Login to see the comments