Date: Thu, 11 Apr 2002 12:14:52 -0700
Reply-To: Matthias Kehder <matthiaskehder@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Matthias Kehder <matthiaskehder@YAHOO.COM>
Subject: Re: Tobit model by Heckman two stage estimator
In-Reply-To: <FB2C8BA000B7D5119ABB00508BC76298AB810A@msbmail.usc.edu>
Content-Type: text/plain; name="HECKMAN.TXT"
/* This macro is supported by the author (see below), not by SAS
Institute.
*/
options noovp linesize=75;
title 'Heckman Two-Step Selection Correction Estimation';
/******************************************************************
Sample Selection-Corrected Estimation ("Heckit")
Programmmer:
David A. Jaeger
Statistical Consultant | Internet: davej@umich.edu
Population Studies Center |
The University of Michigan |
1225 South University Ave. |
Ann Arbor, MI 48104 |
Program Date: 15 January 1993
Program Version: SAS1.3
Notice:
This program is provided on an as-is basis. While this program
has been thoroughly tested, no guarantee, expressed or implied,
is made that the results produced by this program are correct.
No telephone or fax support whatsoever will be provided for this
program by its author. Questions or comments via electronic mail
are welecomed; however, a response is not guaranteed.
Program Revisions:
January 1993 (1.3): Revised IML code to make more efficient
use of memory.
Background:
Heckman (1979) discusses the bias that results from using
nonrandomly selected samples when estimating behavioral
relationships as "omitted variables" bias. He proposes
a simple consistent method to estimate these models,
using a bivariate normal model for the selection equation,
and ordinary least squares to estimate the behavioral
equation with the selected sample.
Greene (1981) notes that the standard errors in the OLS
stage that are typically computed can either be smaller
or larger than the correct standard errors, not
just smaller as Heckman had asserted. He then derives
a simple-to-compute formula for the correct variance-
covariance matrix of the OLS estimates.
Description:
This program uses PROC PROBIT, PROC REG, and PROC IML
to consistently estimate the parameters and their
standard errors in a Heckman selection-correction model.
PROC PROBIT and PROC REG will consistently estimate the
parameters of the model but the standard errors in the
second stage (OLS) reported by PROC REG will be inconsis-
tent. PROC IML is then used to estimate consistent
standard errors for the second stage. Because SAS will
not save the variance-covariance matrix from the first
stage (probit), this must also be calculated using PROC
IML for use in calculating the standard errors for the
OLS estimates in the second stage.
Use:
The code requires very little modification. Change the
variable names in the macro declarations that appear after
this introduction to reflect the names of the dependent
and independent variables in the first (probit) and second
(OLS) stages of the estimation. If your selection criterion
is different, modify the values of the macros %slct and %nonslct
as well (it is unlikely that you'll want to do this), and possibly
the sort before the probit. Modify the first DATA step (where
dataset a is created) to access your data. The rest of the program
should be left unchanged.
References:
Heckman, James. "Sample Selection Bias as a Specification Error",
{\sl Econometrica}, Vol 47, No 1., January 1979, pp. 153-161.
Greene, William. "Sample Selection Bias as a Specification Error:
Comment", {\sl Econometrica}, Vol. 49, No. 3, May 1981,
pp. 795-798.
Greene, William. {\sl Econometric Analysis}, pp. 677-678, 744-747.
*****************************************************************/
/******************************************************************/
/* Macros */
/* */
/* While you may be unfamiliar with SAS macros, their use here */
/* allows you not to modify any of estimation code that follows */
/* slct: value of the selection variable that indicates the */
/* "selected" sample. Usually this is "1", as below, */
/* but you may want to modify it. */
%macro slct;
1
%mend slct;
/* nonslct: value of the selection variable that indicates the */
/* "unselected" sample. Usually this is "0", as below */
/* but you may want to modify it. */
%macro nonslct;
0
%mend nonslct;
/* prbtlhs: dependent variable for first stage (probit) */
/* replace "sel" below with your variable */
%macro prbtlhs;
sel
%mend prbtlhs;
/* prbtrhs: independent variables for first stage (probit) */
/* replace "x1" below with your variable(s) */
/* you can use more than one line for them */
%macro prbtrhs;
x1
%mend prbtrhs;
/* olslhs: dependent variable for second stage (ols) */
/* replace "dep" below with your variable */
%macro olslhs;
dep
%mend olslhs;
/* olsrhs: independent variables for second stage (ols) */
/* replace "x2" below with your variable(s) */
/* you can use more than one line for them */
%macro olsrhs;
x2
%mend olsrhs;
/* Modify this DATA step to access your data. */
/* Do _not_ change the name of dataset a */
/* The KEEP option keeps only variables that will be used in */
/* the estimation procedure; it may be deleted if so desired */
data a(keep=%prbtlhs %prbtrhs %olslhs %olsrhs);
infile '~/heckit/data'; /* example data */
/* if you've already */
/* created a SAS data */
/* you don't need an */
/* INFILE */
input dep sel x1 x2; /* if you already have a */
/* SAS data set, created */
/* you'll need a SET */
/* statement instead of */
/* INPUT */
/* Print out descriptive statistics of all variables in */
/* dataset a. This is a good habit to get into. */
/* This procedure can be deleted if so desired */
title2 'Means of All Variables Used in Estimation';
proc means;
/***************************************************************/
/* You shouldn't need to modify anything below this point */
/***************************************************************/
/* need to sort data to get coefficients with right signs */
proc sort data=a; by descending %prbtlhs;
/***** First Stage: Probit *****/
/* note use of order=data to get coefficients with right signs */
/* will save predicted gammaw's to dataset imr for calculation of
inverse mills ratio; note that all variables from dataset a
will also be saved in dataset imr */
proc probit order=data ;
class %prbtlhs;
model %prbtlhs=%prbtrhs / covb;
output out=imr xbeta=gammaw;
title2 'First Stage: Probit Estimates of Selection';
run;
/* Next we create the Inverse Mills' Ratio, as well as some
variables we'll need to calculate the Var-Cov Matrix of the
Probit Estimates and the OLS Estimates */
data x(keep=intercep %prbtrhs) /* variables for both x and w should */
w(keep=intercep %prbtrhs) /* be the same */
xstar(keep=intercep %olsrhs lambda)
delta(keep=delta)
h(keep=h)
b(keep=%olslhs %olsrhs lambda);
/* the retain below just gets the variables in proper order */
retain intercep %prbtrhs %olsrhs %olslhs;
set imr;
/* create inverse mills ratio */
if (%prbtlhs eq %slct) then
lambda=(1/sqrt(2*3.141592654)*exp(-1*gammaw**2/2))/probnorm(gammaw);
else if (%prbtlhs eq %nonslct) then
lambda=(1/sqrt(2*3.141592654)*exp(-1*gammaw**2/2))/
(probnorm(gammaw)-1);
else lambda=.;
/* create intercep for use in cross-product matrices */
intercep=1;
/* create h for estimating asy. var-cov matrix of probit
coefficients */
h=lambda**2+lambda*gammaw;
/* create delta for estimating asy. var-cov matrix of ols
coefficients; this is a little redundant, but makes the
notation easier to follow */
delta=h;
if (%prbtlhs eq %slct) then do; /* output datasets with */
output delta; /* selected observations */
output w; /* for calculating OLS */
output xstar; /* standard errors */
output b;
end;
output x; /* output datasets with */
output h; /* all observations for */
/* calculating probit */
/* standard errors */
/***** Second Stage: OLS *****/
/* Run only on selected sample */
/* Note that selection is done in the above DATA step */
/* we could have also done it here with a WHERE clause */
proc reg data=b outest=olsest;
model %olslhs=%olsrhs lambda;
output out=err residual=e;
title2 'Second Stage: OLS Estimates of Model';
run;
/***** Estimate Consistent Standard Errors of OLS Stage *****/
title2 'Consistent Estimates of Standard Errors for Second Stage (OLS)';
proc iml;
/* First, calculate asymptotic variance-covariance matrix of
the probit estimates. SAS isn't very friendly and doesn't
allow us to save them from the probit estimation.
Be sure to check these estimates against those produced by
the probit procedure above.
See Greene, Econometric Analysis, pp. 677-678 for formulae. */
use x;
read all var _all_ into x;
use h;
read all var _all_ into h;
k=ncol(x);
n=nrow(h);
invsig=J(k,k,0);
do i= 1 to n;
invsig=invsig+J(k,k,h[i,])#(x[i,]`*x[i,]);
end;
sig=inv(invsig);
prbtnm={INTERCEP %prbtrhs};
print,"Asymptotic Variance-Covariance Matrix",
"of First Stage (Probit) Coefficients",
sig[r=prbtnm c=prbtnm format=12.6];
free x h invsig;
/* Now estimate the selection-corrected standard error for the Second
Stage (OLS) */
/* Get estimate of coefficient on lambda from olsest, the dataset
containing the ols estimates; SAS is nice in that we don't need
to keep track of which element of the beta vector has the coefficient,
since it's named. */
use olsest;
read all var{lambda} into theta;
/* deltas */
use delta var{delta};
read all var{delta} into delta;
deltabar=sum(delta)/nrow(delta);
/* residuals */
use err var{e};
read all var{e} into e;
/* calculate adjusted standard error */
sigsqe=e`*e/nrow(e)+theta**2*deltabar;
sige=sqrt(sigsqe);
print,"Standard Error of Second Stage (OLS)",
"Corrected for Selection",
sige[format=12.4];
numrowe=nrow(e);
free e ;
/* calculate rho squared */
rhosq=theta**2/sigsqe;
rho=(theta/abs(theta))*sqrt(rhosq);
print,"Corrlection of Disturbance in Regression",
"and Selection Criterion (Rho)",rho[format=8.4];
use xstar;
read all var _all_ into xstar;
use w;
read all var _all_ into w;
/* Calculate Consistent Standard Errors
See Greene, Econometric Analysis, pp. 744-747 for formulae */
delcol=delta;
do i=1 to ncol(w)-1;
delcol=delcol||delta;
end;
cdeltaw=delcol#w;
free delcol;
delcol=delta;
do i=1 to ncol(xstar)-1;
delcol=delcol||delta;
end;
cdeltaxs=delcol#xstar;
free delcol;
/**** Version 1.3 (January 1993) ****/
/* cdeltaw=capdelta*w, cdeltaxs=capdelta*xstar */
/* where capdelta=diag(delta). capdelta is */
/* n x n, wherease cdelw is n x ncol(w) and */
/* cdelxstr is n x ncol(xstar). This */
/* reduces memory use. */
/***********************************************/
Q=rhosq*(xstar`*cdeltaw)*sig*(w`*cdeltaxs);
Irhosqd=1-rhosq*delta;
delcol=Irhosqd;
free delta;
do i=1 to ncol(xstar)-1;
delcol=delcol||Irhosqd;
end;
Irsdltxs=delcol#xstar;
free Irhosqd delcol;
/**** Version 1.3 (January 1993) ****/
/* Irsdltxs=(ident(nrow(capdelta))-rwhosq*capdelta)*xstar */
/* again, this is an n x nrow(xstar) matrix, rather than */
/* needing capdelta, which is n x n. */
/**********************************************************/
asyvcov=sigsqe*inv(xstar`*xstar)*
(xstar`*Irsdltxs+ Q)*
inv(xstar`*xstar);
olsnm={INTERCEP %olsrhs LAMBDA};
print ,"Consistent Asymptotic Covariance Matrix of Estimates",
"in Second Stage (OLS)",asyvcov[r=olsnm c=olsnm format=12.6];
asyse=sqrt(vecdiag(asyvcov));
use olsest;
read all var{INTERCEP %olsrhs LAMBDA} into coeff;
variable=coeff`||asyse||(coeff`/asyse)||
2*(1-probt(abs(coeff`/asyse),numrowe-nrow(coeff)));
colnm={"Coeff." "Std. Err." "T-Ratio" "P Value"};
print ,,,"Parameter Estimates and ",
"Consistent Asymptotic Standard Errors of Estimates",
"in Second Stage (OLS)",variable[r=olsnm c=colnm format=12.4];
quit;
endsas;