Date: Wed, 5 Jul 2006 11:19:45 -0700
Reply-To: shiling99@YAHOO.COM
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: shiling99@YAHOO.COM
Organization: http://groups.google.com
Subject: Re: Increasing efficiency / reducing runtime
In-Reply-To: <1152047313.932143.96220@v61g2000cwv.googlegroups.com>
Content-Type: text/plain; charset="iso-8859-1"
I think to create the large data set. It is the write-to-disk will take
much time rather than others in term of wall clock.
In your process,
1) keep only necessary variables(both in and out datasets)
2) create all NEW necessary variables in one step
3) define all variable lengths as necessary.
In your example, event30 and event45 may define as $1. This may/will
save both process time and storage space.
> I would like to know if you agree with the use of the if-then
> statements as illustrated above. I believe that to improve efficiency,
> the condition that follows the "if" should be the one that will be true
> most of the time. I have written the statements accordingly.
It may not be true in your case, but I am not sure. I construct a
simple case (never true vs alwayse true) in a similar way of your
example. You may test it on your PC and judge it by yoursely.
data t1;
do i = 1 to 5000000;
x=ranuni(123);
output;
end;
run;
data _null_;
set t1;
if x<-1 then;
else;
run;
data _null_;
set t1;
if x>-1 then;
else;
run;
data _null_;
set t1;
if x<-1 then y=0;
else y=1;
run;
data _null_;
set t1;
if x>-1 then y=1;
else y=0;
run;
For this part, I would like the principle 'clear in logic' and forget
about efficiency.
HTH.
Daniel wrote:
> Hello All,
>
> I have a dataset with 500 million simulated observations grouped by
> replicate (1000 replicates of a random cohort with n=500,000). I have
> an indicator variable telling me whether an event occurred (1) or not
> (0). I also have a time variable that tells me when this event occurred
> or until when I followed the subject (i.e. survival data). I need to do
> some analyses (such as computing the rate ratio and odds ratio, as well
> as the hazard ratio using the PHREG procedure) where I will employ
> different definitions for "event". Two such definitions would be:
>
> * Cutoff is 30 days ;
> If time<=30 and event=1 then do;
> time30=time;
> event30=1;
> end;
>
> else do;
> event30=0;
> time30=30;
> end;
>
> * Cutoff is 45 days ;
> If time<=45 and event=1 then do;
> time45=time;
> event45=1;
> end;
>
> else do;
> event45=0;
> time45=30;
> end;
>
> and so on. I first create these conditions in my dataset. Given the
> large size of this dataset I expect this to take quite a while; hence
> any improvement in efficiency would be helpful.
>
> I would like to know if you agree with the use of the if-then
> statements as illustrated above. I believe that to improve efficiency,
> the condition that follows the "if" should be the one that will be true
> most of the time. I have written the statements accordingly.
>
> According to FAQ #4278 (http://support.sas.com/faq/042/FAQ04278.html),
> the WHERE condition might be more efficient than IF because SAS doesn't
> have to read all observations from the input dataset. However, in this
> case, I think this would imply creating a couple of datasets, applying
> the above conditions and merging them back together in the end. I am
> not sure this would be more efficient, although I have not tried.
>
> I also noticed some conflicting opinions on this listserv on whether
> the bufno and bufsize options should be modified or left as they are.
> Unfortunately this issue is a bit beyond my comprehension but if this
> might be of some help in reducing runtime then I would be willing to
> look at possible references that relate to these topics, if you have
> any.
>
> I am running this program on a Pentium 4 3.20GHz, with 2GB RAM, WinXP
> SP2, and SAS 9.1 TS1M2 on a disk with 147 GB of free space.
>
> Whether you have information from previous experience or if you believe
> I did not do my homework and I am missing something, any comment will
> be greatly appreciated.
>
> Thank you,
>
> Daniel