The Tobit model is a statistical model used to analyze censored or limited dependent variables. It accounts for data where the dependent variable is left-censored, only observing values above a cutoff. The model estimates the relationship between independent variables and an underlying latent dependent variable that is observed only when it exceeds zero. Tobit regression can be used when the dependent variable is limited, such as wages being limited by minimum wage, or donation amounts. It is estimated using maximum likelihood to account for censored observations.
8447779800, Low rate Call girls in Tughlakabad Delhi NCR
1 tobit analysis
1.
TOBIT ANALYSIS
Rajender Parsad and Sanju
I.A.S.R.I., Library Avenue, New Delhi – 110 012
rajender@iasri.res.in; san.iss26@gmail.com
The Tobit model is a statistical model proposed by James Tobin (1958) to describe the
relationship between a non-negative dependent variable yi and an independent variable (or
vector) xi. The word Tobit is taken from Tobin and adding “it” to it. The tobit model can be
described in terms of a latent variable y*. Suppose, however that *
iy is observed if *
iy >0 and
is not observed if *
iy ≤ 0. Then the observed yi will be defined as
)~
0
0
2
i
*
i
*
iii
*
i
i
IIDN(0,u
0yif
yifuβxy
y
This is known as the tobit model. The tobit model, also called a censored regression model,
because some observation on
*
iy (those for which 0*
iy ) are censored. Our objective is to
estimate the parameters β and σ . In other words, the latent variable y* is observed only
observed if Y*
> 0. In particular, the actual dependent variable is: y = max(0,y*). For
example, let Y be the amount of money that an individual spends on tobacco, given his or her
characteristics X. Then Y > 0 if the individual is a smoker, and Y = 0 if not.
It is also known as a censored regression model which is designed to estimate linear
relationships between variables when there is either left- or right-censoring in the dependent
variable (also known as censoring from below and above, respectively). Censoring from
above takes place when cases with a value at or above some threshold, all take on the value
of that threshold, so that the true value might be equal to the threshold, but it might also be
higher. In the case of censoring from below, values those that fall at or below some threshold
are censored.
Tobit model has been used in a large number of applications where the dependent variable is
observed to be zero for some individuals in the sample (automobile expenditures, medical
expenditures, hours worked, wages, etc.). This model is for metric dependent variable and
when it is “limited” in the sense we observe it only if it is above or below some cut off level.
For example,
the wages may be limited from below by the minimum wage
The donation amount give to charity
Top coding” income at, say, at $300,000
Time use and leisure activity of individuals
However, on careful scrutiny we find that the censored regression model (tobit model) is
inappropriate for the analysis of these problems. The tobit model is applicable in only those
situations where the latent variable can, in principal, take negative values, but these negative
values are not observed because of censoring.
2. Tobit Analysis
Expenditureonhousing
To explain this model, we have a data on housing expenditure in relation to income for a
cross section of 30 families. Now our interest is in finding out the amount of money a person
or family spends on a house in relation to socioeconomic variables. If a consumer does not
purchase a house, obviously we have no data on housing expenditure for such consumers; we
have such data only on consumers who actually purchase a house.
Thus consumers are divided into two groups, one consisting of, say, n1 consumers amount
whom we have information on the regressors (say, income, number of people in the family,
mortgage interest rate, etc.) as well as the regressand (amount of expenditure on housing) and
another consisting of n2 consumers about whom we have information only on the regressors
but not on the regressand.
We cannot estimate regression using only n1 observations. If we use OLS estimates of the
parameters obtained from the subset of n1 observation will be biased as well as inconsistent;
that is, they are biased even asymptotically. The bias arises from the fact that if we consider
only the n1 observations and omit the others, there is no guarantee that E(ui) will be
necessarily zero and without E(ui)=0 we cannot guarantee that the OLS estimates will be
unbiased.
x: Expenditure data not
available, but income
data available
: Both expenditure and
income data available
Y
x x x x x X
Income
As the figure shows, if Y is not observed (because of censoring), all such observations (= n2),
denoted by crosses, will lie on the horizontal axis. If Y is observed, the observations(= n1),
denoted by dots, will lie in the X-Y plane. If we estimate a regression line based on the n1
observations only, the resulting intercept and slope coefficients are bound to be different than
if all the (n1+n2) observations were taken into account.
There is sometimes confusion about the difference between truncated model and censored
model. With censored variables, all of the observations are in the dataset, but we don't know
the "true" values of some of them. In the censored model we have observation on the
3. Tobit Analysis
explanatory variable ix for all individuals. It is only the dependent variable *
iy that is missing
for some individuals. In the truncated model, we have no data on either *
iy or ix for some
individuals because no samples are drawn if *
iy is below or above a certain level.
To estimate a Tobit model in SAS, we can use either the QLIM procedure of SAS/ETS or the
LIFEREG procedure of SAS/STAT. QLIM represents qualitative and limited dependent
variable. An example of Tobit analysis using QLIM s also given at
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_qlim_sect
034.htm
A lots of problems related to this are available in literature. The following is one example
which we have taken from the website http://www.ats.ucla.edu/stat/sas/dae/tobit.htm.
Example 1: Consider the situation in which we have a measure of academic aptitude (scaled
200-800) which we want to model using reading and math test scores, as well as, the type of
program the student is enrolled in (academic, general, or vocational). The students who
answer all questions on the academic aptitude test correctly receive a score of 800, even
though it is likely that these students are not "truly" equal in aptitude. The same is true of
students who answer all of the questions incorrectly. All such students would have a score of
200, although they may not all be of equal aptitude. The problem here is that in the dataset,
the lowest value of academic aptitude is 352. And no students received a score of 200 (i.e. the
lowest score possible), meaning that even though censoring from below was possible, but it
does not occur in the dataset.
Solution:
“Here the academic aptitude variable is denoted by apt, the reading and math test scores are
read and math respectively. The variable prog is the type of program the student is in, it is a
categorical (nominal) variable that takes on three values, academic (prog = 1), general (prog
= 2), and vocational (prog = 3).”
data sastobit;
input id read math prog apt;
format prog pro.;
cards;
1 34 40 3 352
2 39 33 3 449
3 63 48 2 648
4 44 41 2 501
5 47 43 2 762
6 47 46 2 658
7 57 59 2 800
8 39 52 2 613
9 48 52 3 531
10 47 49 1 528
11 34 45 2 584
12 37 45 3 610
13 47 39 3 586
14 47 54 2 769
15 39 44 3 402
8. Tobit Analysis
proc means data = sastobit maxdec=2 nonobs;
class prog;
vars apt read math;
run;
The results are given in Table 1.1.
Table 1.1
prog Variable N Mean
Std
Dev Minimum Maximum
academic apt
read
math
45
45
45
639.02
49.76
50.02
78.63
9.23
7.44
454.00
28.00
35.00
800.00
68.00
63.00
general apt
read
math
105
105
105
677.76
56.16
56.73
88.21
9.59
8.73
462.00
34.00
38.00
800.00
76.00
75.00
vocational apt
read
math
50
50
50
561.72
46.20
46.42
92.76
8.91
7.95
352.00
31.00
33.00
800.00
68.00
75.00
For depicting the distribution of apt in Histogram, use the following statements
proc sgplot data = sastobit noautolegend;
histogram apt;
density apt /type = normal lineattrs=(color=blue);
run;
The results are presented in Figure 1.1.
Figure 1.1
Looking at the above histogram showing the distribution of apt, we can see the censoring in
the data, that is, there are far more cases with scores of 775 to 800 than one would expect
looking at the rest of the distribution. Further, fit a normal distribution to the apt data using
the following statememts:
proc univariate data=sastobit noprint;
histogram apt / midpoints=350 to 800 by 1 normal ;
run;
9. Tobit Analysis
The results are presented in Tables 2.1 and 2.2 and Figure 2.1
Table 2.1
Table 2.2
Goodness-of-Fit Tests for Normal Distribution
Test Statistic p Value
Kolmogorov-
Smirnov
D 0.056072
62
Pr > D 0.126
Cramer-von
Mises
W-Sq 0.079552
20
Pr > W-Sq 0.216
Anderson-
Darling
A-Sq 0.935990
49
Pr > A-Sq 0.019
At the α = 0.05 significance level, kolmogorov-Smirnov and Cramer-von Mises tests support
the conclusion that the normal distribution with mean μ= 640.035, and standards deviation σ
=99.21903 provides a good model for the distribution of academic aptitude.
Figure 2.1
In the histogram above, midpoints option is used to produce a histogram where each unique
value of apt has its own bar by specifying that there should be bins from 350 (the minimum
of apt is 352) and a max of 800 in units of 1. The spike on the far right of the histogram is the
bar for cases where apt=800, the height of this bar relative to all the others clearly shows the
excess number of cases with this value. To study the correlation between read, math and apt,
one can use the following statements and the results are given in Table 3.1 and Figure 3.1.
ods graphics on;
proc corr data = sastobit nosimple;
var read math apt;
run;
ods graphics off;
Parameters for Normal
Distribution
Parameter Symbol Estimate
Mean Mu 640.035
Std Dev Sigma 99.21903
10. Tobit Analysis
Table 3.1
Pearson Correlation Coefficients, N = 200
Prob > |r| under H0: Rho=0
read math apt
read 1.00000 0.66228
<.0001
0.64512
<.0001
math 0.66228
<.0001
1.00000 0.73327
<.0001
apt 0.64512
<.0001
0.73327
<.0001
1.00000
Figure 3.1
The collection of cases at the top of the bottom row of the scatter plots are due to the
censoring in the distribution of apt. The QLIM Procedure
proc qlim data = sastobit ;
class prog;
model apt = read math prog;
endogenous apt ~ censored (ub=800);
run;
In the above, the class statement identifies prog (represented as programme in which the
students get enrolled) as a categorical variable. Here “1” denotes acdemic program, “2”
denotes general program and “3” denotes vocational program. The model statement specifies
that apt should be modeled using read, math, and prog. The endogenous statement specifies
that the outcome variable apt is censored, with an upper bound of 800 (i.e. ub=800). The
results are given in Tables 4.1, 4.2, 4.3 and 4.4.
11. Tobit Analysis
Table 4.1
Summary Statistics of Continuous Responses
Variable Mean
Standard
Error Type
Lower
Bound
Upper
Bound
N Obs
Lower
Bound
N Obs
Upper
Bound
apt 640.035 99.219030 Censored 800 17
Above table 4.1 provides a summary of the number of left- and right-censored values.
Table 4.2
Class Level Information
Class Levels Values
prog 3 academic general vocational
The class level information shows that prog is a classification variable taking values 1, 2 and
3.
Table 4.3
Model Fit Summary
Number of Endogenous Variables 1
Endogenous Variable apt
Number of Observations 200
Log Likelihood -1041
Maximum Absolute Gradient 8.40561E-7
Number of Iterations 26
Optimization Method Quasi-Newton
AIC 2094
Schwarz Criterion 2114
Table 4.3 labelled Model Fit Summary includes information on the number of observations
(200), the number of iterations it took the model to converge, the final log likelihood, and the
AIC and Schwarz Criterion (also known as the BIC).
12. Tobit Analysis
Table 4.4
Parameter Estimates
Parameter DF Estimate
Standard
Error
t Val
ue
Approx
Pr > |t|
Intercept 1 163.422155 30.408580 5.37 <.0001
read 1 2.697939 0.618806 4.36 <.0001
math 1 5.914484 0.709818 8.33 <.0001
prog academic 1 46.143900 13.724195 3.36 0.0008
prog general 1 33.429162 12.955628 2.58 0.0099
prog vocational 0 0 . . .
_Sigma 1 65.676720 3.481423 18.86 <.0001
The coefficients for read and math are statistically significant, as are the terms for
prog="academic" and prog="general" (with prog="vocational" as the reference category).
Tobit regression coefficients are interpreted in the same manner as OLS regression
coefficients. A one unit increase in read is associated with a 2.7 point increase in the
predicted value of apt. A one unit increase in math is associated with a 5.9 point increase in
the predicted value of apt. The terms for prog have a slightly different interpretation. The
predicted value of apt is 46.14 higher for students in an academic program
(prog="academic") than for students in a vocational program (prog="vocational"). The
predicted value of apt is 33.43 points higher for students in a general program
(prog="general") than for students in a vocational program (prog="vocational").
In the “Parameter Estimates” table there are seven rows. The first six of these rows
correspond to the vector estimate of the regression coefficients . The last one is called
_Sigma, which corresponds to the estimate of the error variance σ .
We can include a test of the overall effect of prog, by testing whether the coefficients for
prog="academic" and prog="general" are simultaneously equal to 0. To do this we add a test
statement to the proc qlim code. To figure out how SAS names the dummy variables for a
class variable, it is usually a good idea to output the parameter estimates as a data set (in this
example, we named it as t) and print it out to see how internally SAS names these variables.
In our example, we see that SAS has appended the value label to prog in naming the dummy
variables for prog. The results obtained are given in Tables 5.1 and 5.2.
proc qlim data = sastobit outest=t;
class prog;
model apt = read math prog;
endogenous apt ~ censored (ub=800);
run;
proc print data = t noobs;
run;
13. Tobit Analysis
Table 5.1
_NAME_ _TYPE_ _STATUS_ Intercept read math
Progacad
emic
Progge
neral
Progvo
catinal _Sigma
PARM 0 Converged 163.422 2.69794 5.91448 46.1439 33.4292 . 65.6767
STD 0 Converged 30.409 0.61881 0.70982 13.7242 12.9556 . 3.4814
proc qlim data =sastobit ;
class prog;
model apt = read math prog;
endogenous apt ~ censored (ub=800);
test 'prog' progacademic = 0,
proggeneral = 0;
run;
Table 5.2
Test Results
Test Type Statistic Pr > ChiSq Label
'prog' Wald 11.96 0.0025 progacademic = 0 , proggeneral = 0
We may also wish to evaluate how well our model fits. This can be particularly useful when
comparing competing models. One method of assessing model fit is to compare the predicted
values based on the tobit model to the observed values in the dataset. Below we use proc qlim
to generate predicted values along with the data via the output statement. Then proc corr is
used to estimate the correlation between the predicted and observed values of apt. The
predicted values are given in Table 6.1.
proc qlim data=sastobit ;
model apt = read math prog;
endogenous apt ~ censored (ub=800);
output out = temp1 predicted;
run;
proc print data=temp1;
run;
Table 6.1
Obs id read math prog apt P_apt
1 1 34 40 3 352 493.356
2 2 39 33 3 449 464.504
3 3 63 48 2 648 645.855
4 4 44 41 2 501 550.096
5 5 47 43 2 762 570.686
6 6 47 46 2 658 589.025
7 7 57 59 2 800 696.371
8 8 39 52 2 613 603.400
9 9 48 52 3 531 605.742
10 10 47 49 1 528 630.112
11 11 34 45 2 584 546.670
18. Tobit Analysis
Obs id read math prog apt P_apt
199 199 52 50 2 558 627.416
200 200 68 75 2 800 800.000
proc corr data = temp1 nosimple;
var apt p_apt;
run;
The correlation between observed and predicted values is given in Table 6.2 and scatter plot
in Figure 6.1.
Pearson Correlation Coefficients, N = 200
Prob > |r| under H0: Rho=0
Table 6.2
Figure 6.1
The output from proc corr gives the correlation between the predicted and observed values of
apt, which is 0.78094. If we square this value, we get the squared multiple correlation, this
indicates that the predicted values share about 61% (0.78094^2 = .6099) of their variance
with the observed values of apt.
apt P_apt
apt 1.00000 0.78094
<0.0001
P_apt 0.78094
<.0001
1.00000
19. Tobit Analysis
Some Important Points
Below is a list of some analysis methods you may have encountered. Some of the methods
listed are quite reasonable while others have either fallen out of favor or have limitations.
One can analyze these data using OLS regression. OLS regression will treat the 800 as the
actual values and not as the upper limit of the top academic aptitude. A limitation of this
approach is that when the variable is censored, OLS provides inconsistent estimates of the
parameters, meaning that the coefficients from the analysis will not necessarily approach the
"true" population parameters as the sample size increases.
There is sometimes confusion about the difference between truncated data and censored data.
With censored variables, all of the observations are in the dataset, but we don't know the
"true" values of some of them. With truncation some of the observations are not included in
the analysis because of the value of the variable. When a variable is censored, regression
models for truncated data provide inconsistent estimates of the parameters.
References:
SAS Data Analysis Examples Tobit Analysis at
http://www.ats.ucla.edu/stat/sas/dae/tobit.htm
Robin, James (1958), "Estimation of relationships for limited dependent
variables", Econometrica (The Econometric Society) 26 (1): 24–36, doi:10.2307/190738
http://en.wikipedia.org/wiki/Tobit_model
http://www.ats.ucla.edu/stat/stata/dae/tobit.htm
http://support.sas.com/documentation/cdl/en/etsug/60372/HTML/default/viewer.htm#etsug_q
lim_sect034.htm