Date: Wed, 27 Jun 2007 17:27:55 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: SAS douple loop question
In-Reply-To: <1182871037.534725.36010@w5g2000hsg.googlegroups.com>
Content-Type: text/plain; format=flowed
dayday.sun@GMAIL.COM wrote back:
>
>Thanks for your suggestion. like what you said, my boss asked me to do
>this.
Why don't you talk this over with your boss?
Tell him/her that you asked for advice on efficient processing, and you got
your
posterior reamed out by grouchy statisticians who told you that this is
totally
unacceptable as a model-building process. Then ask him/her if using
unsupported
statistical approaches might get him/her chopped up by journal editors,
reviewers,
auditors, professors, government agencies, ....
>i used the following codes to find the gene with largest AUC:
>
>%macro logistic;
>%do i=1 %to 5;
>proc logistic data=tsun;
>model patient(event='c')=a&(i);
>output out=out p=p;
>ods output Association=auc;
>run;
>%end;
>%mend;
>%logistic
>
>Now, he asked me to find the pair of gene with largest AUC. At the
>beginning, I wanted to revise the macro and add some loops but someone
>told me it is possible but not likely to use MACRO to realise this
>purpose. she suggested me to use by statement. Do you have any idea
>with by statement?
I see that Howard has shown you how to do that. But I don't recommend
using it.
Instead, think about this: your AUC is going to be highly susceptible to
any errors or outliers or other wierdness in the data. So you need to check
your regression diagnostics for your winning AUC, AND ALSO the losing AUC
values, in order to find the regressions which are actually doing a good job
of
prediction *AND* are meeting the model assumptions.
PROC LOGISTIC *already* has selection methods that would be better
than what you are doing. But none of these selection methods will stand
up to statistical peer review. Just look at what the experts on STAT-L
have to say about such methods. (Hint: it's in the STAT-L FAQ because
it's such a problem.)
Furthermore, models like this do not stand up well when you split the data
and use part of it for model building and the other part for model
validation.
You'll find that your process inflates the coefficient of determination,
biases
the parameter estimates high, biases the p-vlaues low, etc.
If you really need model prediction tools like this, then look into PROC
GLMSELECT instead. But you'll always do better in terms of real model
prediction if you use expert knowledge instead.
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
PC Magazine’s 2007 editors’ choice for best Web mail—award-winning Windows
Live Hotmail.
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_pcmag_0507