Successfully reported this slideshow.
Upcoming SlideShare
×

# Linear regression without tears

Linear regression without tears

See all

### Related Audiobooks

#### Free with a 30 day trial from Scribd

See all
• Full Name
Comment goes here.

Are you sure you want to Yes No

### Linear regression without tears

1. 1. Digg Data Linear Regression without Tears ANKIT SHARMA, DIGG DATA www.diggdata.in
2. 2. Content  What is Regression Analysis  When to use regression  Intuition behind linear regression - Machine learning  Simple Linear Regression  Multivariate Linear Regression  Performance Analysis  ANOVA  Goodness of fit  Confidence & Prediction bands  Assumptions Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 2
3. 3. What is Regression Analysis? In statistics, regression analysis is a statistical process for estimating the relationships among variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. Regression analysis is widely used for prediction and forecasting. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 3
4. 4. When to use regression? Regression analysis is used to describe the relationship between:  A single response variable: Y ; and  One or more predictor variables: X1, X2,…,Xp • p = 1: Simple Regression • p > 1: Multivariate Regression Response Variable ‘Y’ must be a continuous variable. Predictor Variables X1,…,Xp can be continuous, discrete or categorical variables. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 4
5. 5. The Meaning of the term “Linear” Linearity in the Variables The first meaning of linearity is that the conditional expectation of Y, E(Y|Xi), is a linear function of Xi, the regression curve in this case is a straight line. But E(Y|Xi) = β1 + β2X2i is not a linear function Linearity in the Parameters The second interpretation of linearity is that the conditional expectation of Y, E(Y|Xi), is a linear function of the parameters, the β’s; it may or may not be linear in the variable X. E(Y|Xi) = β1 + β2X2i is a linear (in parameter) regression model. All the models shown in Figure are thus linear regression models, that is, models linear in the parameters. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 5
6. 6. The Meaning of the term “Linear” Cond... Now consider the model: E(Y|Xi) = β1 + β22 Xi The preceding model is an example of a nonlinear (in the parameter) regression model. From now on the term “linear” regression will always mean a regression that is linear in the parameters; the β’s (that is, the parameters are raised to the first power only). Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 6
7. 7. Intuition LINEAR REGRESSION Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 7
8. 8. Hypothesis(for one variable) 500 Price (in 1000s of dollars) Training Set 400 300 200 Learning Algorithm 100 0 0 Size of house Friday, November 22, 2013 h Estimated price 500 1000 1500 2000 2500 3000 Size (feet2) WITHOUT TEARS SERIES, DIGG DATA 8
9. 9. Cost function Hypothesis: How to choose ‘s ? Cost Function: Goal: Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 9
10. 10. (for fixed , this is a function of x) (function of the parameter 3 3 2 2 1 1 0 ) 0 y 0 1 x 2 -0.5 3 1 𝐽 1 = 1−1 2𝑚 Friday, November 22, 2013 2 + 2−2 2 + 3−3 WITHOUT TEARS SERIES, DIGG DATA 0 2 0.5 1 1.5 2 2.5 =0 10
11. 11. (for fixed , this is a function of x) (function of the parameter 3 3 2 2 1 1 0 ) 0 y 0 1 x 2 -0.5 1 𝐽 0.5 = 0.5 − 1 2𝑚 Friday, November 22, 2013 0 + 1.5 − 3 2 3 2 + 1−2 2 WITHOUT TEARS SERIES, DIGG DATA 0.5 1 1.5 2 2.5 = 0.68 11
12. 12. (for fixed , this is a function of x) (function of the parameter 3 3 2 2 1 1 0 ) 0 y 0 1 x 2 -0.5 3 0 0.5 1 1.5 2 2.5 min 𝐽 𝜃1 𝜃1 Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 12
13. 13. Contour plot Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 13
14. 14. Linear Regression in R SINGLE PREDICTOR Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 14
15. 15. Data cleaning & preprocessing Prior to any analysis, the data should always be inspected for Data-entry errors Missing values Outliers Numerical summaries 5-number summaries Correlations … Graphical summaries Boxplots Histograms Scatterplots Friday, November 22, 2013 Unusual distributions Changes in variability Clustering Non-linear bivariate relationships Unexpected patterns … WITHOUT TEARS SERIES, DIGG DATA 15
16. 16. Simple Linear Regression Objective Describe the relationship between two variables, say X and Y as a straight line, that is, Y is modeled as a linear function of X. X  X: explanatory variable (horizontal axis)  Y : response variable (vertical axis) After data collection, we have pairs of observations: (X1,Y1),…,(Xn,Yn) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA X1 Y1 X2 Y2 … The variables Y … Xn Yn 16
17. 17. Simple LR model The regression of variable Y on variable X is given by: yi = β0 + β1xi + ϵi Residuals i = 1,...,n The difference between the observed value yi and the fitted value ^yi is called residual and is given by: where: Random Error: ϵi ̴N(0, σ2), independent Linear Function: β0 + β1xi = E(Y|X = xi ) Unknown parameters ei = yi - ^yi - β0 (Intercept): point in which the line intercepts the y-axis; - β1 (Slope): increase in Y per unit change in X. Least Squares Method Estimation of unknown parameters We want to find the equation of the line that “best" fits the data. It means finding β0 and β1 such that the fitted values of yi , given by ^yi = β0 + β1 xi ; A usual way of calculating β0 and β1 is based on the minimization of the sum of the squared residuals, or residual sum of squares (RSS): are as “close" as possible to the observed values yi . 𝑒2 𝑖 = 𝑅𝑆𝑆 = 𝑖 (𝑦𝑖 − 𝑦 𝑖)2 𝑖 (𝑦𝑖 − β0 − β1xi)2 𝑅𝑆𝑆 = 𝑖 Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 17
18. 18. Simple LR in R > # Download the data from a url > production <read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/productio n.txt", header=T, sep="") > # analyze the data > head(production) Case RunTime RunSize 1 1 195 175 2 2 215 189 3 3 243 344 4 4 162 88 5 5 185 114 6 6 231 338 > table(is.na(production)) FALSE 60 > str(production) 'data.frame': 20 obs. of 3 variables: \$ Case : int 1 2 3 4 5 6 7 8 9 10 ... \$ RunTime: int 195 215 243 162 185 231 234 166 253 196 ... \$ RunSize: int 175 189 344 88 114 338 271 173 284 277 ... > attach(production) The following object is masked from production (position 3): Case, RunSize, RunTime > # Lets plot the data > plot(RunTime~RunSize) > # Fit the regression model using the lm() > production.lm <- lm(RunTime~RunSize, data=production) > # Use the function summary() to get some results Friday, November 22, 2013 > summary(production.lm) Call: lm(formula = RunTime ~ RunSize, data = production) Residuals: Min 1Q Median 3Q Max -28.597 -11.079 3.329 8.302 29.627 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 149.74770 8.32815 17.98 6.00e-13 *** RunSize 0.25924 0.03714 6.98 1.61e-06 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 16.25 on 18 degrees of freedom Multiple R-squared: 0.7302, Adjusted R-squared: 0.7152 F-statistic: 48.72 on 1 and 18 DF, p-value: 1.615e-06 > # plot a line fitting the model > abline(production.lm) > production <data.frame(production,fitted.value=fitted(production.lm),residual=resid(productio n.lm)) > head(production) Case RunTime RunSize fitted.value residual 1 1 195 175 195.1152 -0.1152469 2 2 215 189 198.7447 16.2553496 3 3 243 344 238.9273 4.0726679 4 4 162 88 172.5611 -10.5610965 5 5 185 114 179.3014 5.6985827 6 6 231 338 237.3719 -6.3718734 WITHOUT TEARS SERIES, DIGG DATA 18
19. 19. Multivariate Linear Regression Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 19
20. 20. Multivariate Linear Regression Objective Generalize the simple regression methodology in order to describe the relationship between a response variable Y and a set of predictors X1,X2,…, Xp in terms of a linear function. The variables  Y : response variable (vertical axis) After data collection, we have pairs of observations: (X11,…,X1p,Y1),…,(Xn1,…,Xnp,Yn) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA … Xp Y X1 … X1p Y1 X2  X: explanatory variable (horizontal axis) X1 … X2p Y2 … … … … Xn … Xnp Yn 20
21. 21. Polynomial regression Price (y) Size (x) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 21
22. 22. Multivariate LR model The model is given by: yi = β0 + β1xi +…+ βpxp + ϵi i = 1,...,n Residuals The difference between the observed value yi and the fitted value ^yi is called residual and is given by: where: Random Error: ϵi ̴N(0, σ2), independent ei = yi - ^yi Linear Function: β0 + β1xi + βpxp = E(y|x1 ,…, xp) Unknown parameters - β0 : overall mean - βk : regression coefficient Least Squares Method Estimation of unknown parameters We want to find the equation of the line that “best" fits the data. It means finding β0 and βk such that the fitted values of yi , given by ^yi = β0 + β1 xi ; A usual way of calculating β0, β1, …, βp is based on the minimization of the sum of the squared residuals, or residual sum of squares (RSS): : are as “close" as possible to the observed values yi . 𝑒2 𝑖 = 𝑅𝑆𝑆 = 𝑖 (𝑦𝑖 − 𝑦 𝑖)2 𝑖 (𝑦𝑖 − β0 − β1xi − ⋯ )2 𝑅𝑆𝑆 = 𝑖 Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 22
23. 23. Performance measurement Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 23
24. 24. Analysis of Variance (ANOVA) Total sample variability TSS Unexplained (or error) variability RSS Variability explained by the model SSreg > anova(production.lm) Analysis of Variance Table Response: RunTime Df Sum Sq Mean Sq F value Pr(>F) RunSize 1 12868.4 12868.4 48.717 1.615e-06 *** Residuals 18 4754.6 264.1 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The ANOVA Table gives us the following information: • Degrees Of Freedom • The Sum Of The Squares • The Mean Square • The F ratio • The p-value Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 24
25. 25. ANOVA Cond… Select a model: y = β0+ β1x1+ β2x2+ β3x3+ … + ε Use sample data to estimate unknown parameters Evaluate how useful the model is If we want to test the usefulness of a particular term in our model, we would perform a t-test and look at the p-value for that term. However, if we wanted to test whether any of the terms in our model are useful in predicting y we would use the F-test. The F-test is a test of the hypothesis: H0: β1= β2= … = βk= 0 H1: At least one of the coefficients is non-zero Note1 our H0 will always include all of our parameters except our y-intercept β0. Note2 this test has a general set-up of: H0: None of the explanatory variables are helping H1: At least one of the explanatory variables are helping which shares the general format seen throughout the last couple of chapters of: H0: Model not useful H1: Model useful Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA Once we know the test statistic of our F-test, we will often want to determine whether it is significant. As in all our tests, if our test statistic is more extreme (i.e. greater) than our critical value, we reject H0. By rejecting H0 we are saying that our model is significantly better than just estimating y with avg(y). 25
26. 26. The Coefficient Of Correlation  The Correlation Coefficient (denoted r) is a measure of the strength of the linear relationship between x and y. It will always be between -1 and 1.  If r is near -1 or 1, then there is a strong linear relationship.  If r is near 0, then there is little or no linear relationship.  A positive correlation occurs when an increase in one variable typically leads to an increase in the other variable.  A negative correlation occurs when an increase in one variable typically leads to a decrease in the other variable. 𝑟= Friday, November 22, 2013 𝑆𝑆 𝑋𝑌 𝑆𝑆 𝑋𝑋 − 𝑆𝑆 𝑌𝑌 WITHOUT TEARS SERIES, DIGG DATA 26
27. 27. Measuring Goodness of Fit Coefficient of Determination, r2  Represents the proportion of the total sample variability explained by the regression model. Adjusted r2adj  For simple linear regression, the r2 statistic corresponds to the square of the correlation between Y and X.  The adjusted r2 takes into account the number of degrees of freedom and is preferable to r2.  Indicates of how well the model fits the data. 𝑟2 𝑆𝑆 𝑦𝑦 − 𝑆𝑆𝐸 𝑆𝑆𝐸 = =1 − 𝑆𝑆 𝑦𝑦 𝑆𝑆 𝑦𝑦 Important Note: Neither r2 nor r2adj give direct indication on how well the model will perform in the prediction of a new observation. About 100(r2)% of the sample variation in y can be explained by (or attributed to) using x to predict y in the straight line model. Ideally this value will be close to 1. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 27
28. 28. Confidence & Prediction band Confidence Bands Reflect the uncertainty about the regression line (how well the line is determined). Prediction Bands Include also the uncertainty about future observations. Attention 250 200 50 100 150 RunTime 300 350 These limits rely strongly on the assumption of normally distributed errors with constant variance and should not be used if this assumption is violated for the data being analyzed. 50 100 150 200 250 300 350 > predict(production.lm, interval="confidence") fit lwr upr 1 195.1152 187.2000 203.0305 2 198.7447 191.0450 206.4443 … 20 167.3762 154.4448 180.3077 > predict(production.lm, interval="prediction") fit lwr upr 1 195.1152 160.0646 230.1659 2 198.7447 163.7421 233.7472 … 20 167.3762 130.8644 203.8881 # Create a new data frame containing the values of X # at which we want the predictions to be made pred.frame <- data.frame(RunSize=seq(55,345,by=10)) # Confidence bands pc <- predict(production.lm, int="c", newdata=pred.frame) # Prediction bands pp <- predict(production.lm, int="p", newdata=pred.frame) require ( graphics ) # Standard scatterplot with extended limits plot(RunSize, RunTime, ylim=range(RunSize,pp,na.rm=T)) pred.Size <- pred.frame\$RunSize # Add curves matlines(pred.Size, pc, lty=c(1,2,2), lwd=1.5, col=1) matlines(pred.Size, pp, lty=c(1,3,3), lwd=1.5, col=1) RunSize Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 28
29. 29. Validity of regression model For all data sets, the fitted regression is the same: ^y = 3.0 + 0.5x All models have r2= 0.67, ^σ = 1.24 and the slope coefficients are significant at < 1% level. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 29
30. 30. Residual plots  A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis.  If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate. The first plot shows a random pattern, indicating a good fit for a linear model. The other plot patterns are non-random (U-shaped and inverted U), suggesting a better fit for a non-linear model. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 30
31. 31. Residual plots Residuals vs. X Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 31
32. 32. Residual plots Residuals vs. fitted values Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 32
33. 33. Influential point Outliers Data points that diverge in a big way from the overall pattern are called outliers. There are four ways that a data point might be considered an outlier.  It could have an extreme X value compared to other data points.  It could have an extreme Y value compared to other data points.  It could have extreme X and Y values.  It might be distant from the rest of the data, even without extreme X or Y values. Influential Points An influential point is an outlier that greatly affects the slope of the regression line. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 33
34. 34. Influential point Cond… How to deal with them: Leverage/Influential Points  Good leverage points have their standardized residuals within the interval [ 2; 2]  Outliers are leverage points whose standardized residuals fall outside the interval [ 2; 2]  Remove invalid data points o if they look unusual or are different from the rest of the data  Fit a different regression model o if the model is not valid for the data  higher-order terms  transformation Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 34
35. 35. Normality & constant variance of errors Normality and Constant Variance Assumptions, these assumptions are necessary for inference: • hypothesis testing • confidence intervals • prediction intervals  Check the Normal Q-Q plot of the standardized residuals.  Check the Standardized Residuals vs. X plot. When these assumptions do not hold, we can try to correct the problem using data transformations. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 35
36. 36. Normality & constant variance check > production.lm <- lm(RunTime~RunSize, data=production) # Residual plots > plot(production.lm) Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 36
37. 37. Cook’s distance Cook's Distance: D the Cook's distance statistic combines the effects of leverage and the magnitude of the residual. it is used to evaluate the impact of a given observation on the estimated regression coefficients. D > 1: undue influence The Cook's distance plot is obtained by applying the function plot() to the linear model object. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 37
38. 38. Transformation When to use transformation? Transformations can be used to correct for:  non-constant variance There are many ways to transform variables to achieve linearity for regression analysis. Some common methods are summarized below.  non-linearity  non-normality Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 38
39. 39. Assumptions for Simple LR There are four principal assumptions which justify the use of linear regression models for purposes of prediction: I. linearity of the relationship between dependent & independent variables Y = β 0 + β 1X + ϵ II. independence of the errors (no serial correlation) III. homoscedasticity (constant variance) of the errors a) versus time b) versus the predictions (or versus any independent variable) IV.normality of the error distribution. If any of these assumptions is violated (i.e., if there is nonlinearity, serial correlation, heteroscedasticity, and/or non-normality), then the forecasts, confidence intervals, and economic insights yielded by a regression model may be inefficient or seriously biased or misleading. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA What can go wrong? Violations: In the linear regression model: • linearity (e.g. quadratic relationship or higher order terms) • In the residual assumptions: • non-normal distribution • non-constant variances • dependence • outliers Checks:  Residuals vs. each predictor variable o nonlinearity: higher-order terms in that variable  Residuals vs. fitted values o variance increasing with the response: transformation  Residuals Q-Q norm plot o deviation from a straight line: nonnormality 39
40. 40. Violations of linearity These are extremely serious--if you fit a linear model to data which are nonlinearly related, your predictions are likely to be seriously in error, especially when you extrapolate beyond the range of the sample data. How to detect Plot • observed vs. predicted values, or • residuals vs predicted values Look carefully for evidence of a "bowed" pattern, indicating that the model makes systematic errors whenever it is making unusually large or small predictions. How to fix  Consider applying a nonlinear transformation to the dependent and/or independent variables. For example, if the data are strictly positive, a log transformation may be feasible.  Another possibility to consider is adding another regressor which is a nonlinear function of one of the other variables. For example, if you have regressed Y on X, and the graph of residuals versus predicted suggests a parabolic curve, then it may make sense to regress Y on both X and X^2 (i.e., X-squared). The latter transformation is possible even when X and/or Y have negative values, whereas logging may not be. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 40
41. 41. Violations of homoscedasticity Violations of homoscedasticity makes it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow. In particular, if the variance of the errors is increasing over time, confidence intervals for out-ofsample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to small subset of the data (namely the subset where the error variance was largest) when estimating coefficients. How to detect Plots of • residuals vs. time, and • residuals vs. predicted value Check for residuals that are getting larger (i.e., more spread-out) either as a function of time or as a function of the predicted value. (To be really thorough, you might also want to plot residuals versus some of the independent variables.) How to fix  In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth, perhaps magnified by a multiplicative seasonal pattern. Some combination of logging and/or deflating will often stabilize the variance in this case.  A simple fix would be to work with shorter intervals of data in which volatility is more nearly constant.  Heteroscedasticity can also be a byproduct of a significant violation of the linearity and/or independence assumptions, in which case it may also be fixed as a byproduct of fixing those problems. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 41
42. 42. Violations of normality  It compromise the estimation of coefficients and the calculation of confidence intervals. Sometimes the error distribution is "skewed" by the presence of a few large outliers.  Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates.  Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors.  If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow. How to detect The best test for normally distributed errors is a normal probability plot of the residuals. o This is a plot of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on this plot should fall close to the diagonal line. o A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in the same direction). o An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis--i.e., there are either two many or two few large errors in both directions. How to fix Violations of normality often arise either because (a) the distributions of the dependent and/or independent variables are themselves significantly non-normal, and/or (b) the linearity assumption is violated. In such cases, a nonlinear transformation of variables might cure both problems. In some cases, the problem with the residual distribution is mainly due to one or two very large errors. Such values should be scrutinized closely: are they genuine (i.e., not the result of data entry errors), are they explainable, are similar events likely to occur again in the future, and how influential are they in your model-fitting results? (The "influence measures" report is a guide to the relative influence of extreme observations.) If they are merely errors or if they can be explained as unique events not likely to be repeated, then you may have cause to remove them. In some cases, however, it may be that the extreme values in the data provide the most useful information about values of some of the coefficients and/or provide the most realistic guide to the magnitudes of forecast errors. Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 42
43. 43. Thank you! Friday, November 22, 2013 WITHOUT TEARS SERIES, DIGG DATA 43