| Date: | Wed, 21 Jul 2010 12:36:37 -0700 |
| Reply-To: | Steve Denham <stevedrd@YAHOO.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Steve Denham <stevedrd@YAHOO.COM> |
| Subject: | Re: Weighted Least Squares Question in SAS |
|
| In-Reply-To: | <995375.41905.qm@web120417.mail.ne1.yahoo.com> |
| Content-Type: | text/plain; charset=iso-8859-1 |
Jon,
it is not the "perfectly correlated" values that drive the R**2 value.
Consider the following:
data temp;
input x y w1 w2 w3;
cards;
1 1 1 1 1
2 2 1 1 .1
3 4 1 .1 1
;
run;
proc reg data=work.temp;
weight w1;
model y=x;
run;
quit;
proc reg data=work.temp;
weight w2;
model y=x;
run;
quit;
proc reg data=work.temp;
weight w3;
model y=x;
run;
quit;
where now we underweight the middle value using w3. And what is the R**2 for
this? It is 0.9947, the highest of the three.
The catch is that a regression line is determined by its endpoints, much more
than the midpoints, especially when you have a small sample size.
Consider for example one more regression, where we stretch the x axis out to 10,
and leave the three weights as before:
data temp2;
input x y w1 w2 w3;
cards;
1 1 1 1 1
2 2 1 1 .1
10 4 1 .1 1
;
run;
proc reg data=work.temp2;
weight w1;
model y=x;
run;
quit;
proc reg data=work.temp2;
weight w2;
model y=x;
run;
quit;
proc reg data=work.temp2;
weight w3;
model y=x;
run;
quit;
R**2 values are now 0.9472, 0.7879, and 0.9909.
By underweighting the extreme (x,y) value observation, we "miss" the y value
with our predicted value, and increase the residual error, thus decreasing the
R**2. By underweighting the mid value, we increase the accuracy at the
ends--but not as much as when the extreme value doesn't have as much leverage.
All in all, weighted least squares is a muckety bog that hides many dangers. If
you are aware of them, you get to use the shortcut across the island, but if
not, you will end up being stuck. I know some professor of mine used that
analogy.
Steve Denham
Associate Director, Biostatistics
MPI Research, Inc.
----- Original Message ----
From: Jon Matthews <jmatthews7101@YAHOO.COM>
To: SAS-L@LISTSERV.UGA.EDU
Sent: Wed, July 21, 2010 1:42:38 PM
Subject: Weighted Least Squares Question in SAS
Hi,
I am using SAS to create a weighted least squares regression, and I've run into
a question about the coefficient of determination when using weighted least
squares regression.
Here is some code I wrote:
data work.temp;
input x y w;
cards;
1 1 1
2 2 1
3 4 1
;
run;
proc reg data=work.temp;
weight w;
model y=x;
run;
quit;
Since the weights are all 1, this is the same as unweighted regression and this
gives me an R-squared of .9643. Note that in my data, the first two
observations are perfectly correlated while the third is not. Now, if I
re-weight the last observation to place less weight on it since it's not
perfectly corrected with the others and rerun the weighted least squares
regression, I get a lower R-squared:
input x y w;
cards;
1 1 1
2 2 1
3 4 .1
;
run;
proc reg data=work.temp;
weight w;
model y=x;
run;
quit;
R-squared now equals .9391.
This does not seem intuitive. Since I'm now underweighting the only
non-perfectly correlated observation, shouldn't R-squared improve or am I
missing something?
Thanks for any insight.
|