Date: Wed, 21 Jul 2010 12:36:37 -0700 Steve Denham "SAS(r) Discussion" Steve Denham Re: Weighted Least Squares Question in SAS To: Jon Matthews <995375.41905.qm@web120417.mail.ne1.yahoo.com> text/plain; charset=iso-8859-1

Jon,

it is not the "perfectly correlated" values that drive the R**2 value.

Consider the following:

data temp; input x y w1 w2 w3; cards; 1 1 1 1 1 2 2 1 1 .1 3 4 1 .1 1 ; run; proc reg data=work.temp; weight w1; model y=x; run; quit; proc reg data=work.temp; weight w2; model y=x; run; quit; proc reg data=work.temp; weight w3; model y=x; run; quit;

where now we underweight the middle value using w3. And what is the R**2 for this? It is 0.9947, the highest of the three.

The catch is that a regression line is determined by its endpoints, much more than the midpoints, especially when you have a small sample size.

Consider for example one more regression, where we stretch the x axis out to 10, and leave the three weights as before:

data temp2; input x y w1 w2 w3; cards; 1 1 1 1 1 2 2 1 1 .1 10 4 1 .1 1 ; run; proc reg data=work.temp2; weight w1; model y=x; run; quit; proc reg data=work.temp2; weight w2; model y=x; run; quit; proc reg data=work.temp2; weight w3; model y=x; run; quit;

R**2 values are now 0.9472, 0.7879, and 0.9909.

By underweighting the extreme (x,y) value observation, we "miss" the y value with our predicted value, and increase the residual error, thus decreasing the R**2. By underweighting the mid value, we increase the accuracy at the ends--but not as much as when the extreme value doesn't have as much leverage.

All in all, weighted least squares is a muckety bog that hides many dangers. If you are aware of them, you get to use the shortcut across the island, but if not, you will end up being stuck. I know some professor of mine used that analogy.

Steve Denham Associate Director, Biostatistics MPI Research, Inc.

----- Original Message ---- From: Jon Matthews <jmatthews7101@YAHOO.COM> To: SAS-L@LISTSERV.UGA.EDU Sent: Wed, July 21, 2010 1:42:38 PM Subject: Weighted Least Squares Question in SAS

Hi,

I am using SAS to create a weighted least squares regression, and I've run into a question about the coefficient of determination when using weighted least squares regression. Here is some code I wrote: data work.temp; input x y w; cards; 1 1 1 2 2 1 3 4 1 ; run; proc reg data=work.temp; weight w; model y=x; run; quit;

Since the weights are all 1, this is the same as unweighted regression and this gives me an R-squared of .9643. Note that in my data, the first two observations are perfectly correlated while the third is not. Now, if I re-weight the last observation to place less weight on it since it's not perfectly corrected with the others and rerun the weighted least squares regression, I get a lower R-squared:

input x y w; cards; 1 1 1 2 2 1 3 4 .1 ; run; proc reg data=work.temp; weight w; model y=x; run; quit;

R-squared now equals .9391.

This does not seem intuitive. Since I'm now underweighting the only non-perfectly correlated observation, shouldn't R-squared improve or am I missing something?

Thanks for any insight.

Back to: Top of message | Previous page | Main SAS-L page