|
Paul
I ran a test earlier that seemed to show a difference in times depending on
where the retain was and which eliminate I/O. After reading your posting,
it occured to me that I was using a chunk of memory to store my first data
set so I deleted the first set between runs and got the following results
193 data one;
194 retain x ;
195 do i=1 to 1000000; x=1; output;end;
196 run;
NOTE: The data set WORK.ONE has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.19 seconds
cpu time 0.19 seconds
197 Proc delete data = one;
198 run;
NOTE: Deleting WORK.ONE (memtype=DATA).
NOTE: PROCEDURE DELETE used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
199 Data two;
200 do i = 1 to 1000000; retain x ;x=1;output;end;
201 run;
NOTE: The data set WORK.TWO has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.20 seconds
cpu time 0.20 seconds
202 Proc delete data = two;
203 run;
NOTE: Deleting WORK.TWO (memtype=DATA).
NOTE: PROCEDURE DELETE used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
204 Data three;
205 if _n_ = 1 then do;
206 retain x;
207 end;
208 do i=1 to 1000000; x=1; output;end;
209 run;
NOTE: The data set WORK.THREE has 1000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.23 seconds
cpu time 0.21 seconds
It looks like my earlier results were affected by memory usage.
Nat
Nat
Nat Wooding
Environmental Specialist III
Dominion, Environmental Biology
4111 Castlewood Rd
Richmond, VA 23234
Phone:804-271-5313, Fax: 804-271-2977
Paul Dorfman
<sashole@BELLSOUT
H.NET> To
Sent by: "SAS(r) SAS-L@LISTSERV.UGA.EDU
Discussion" cc
<SAS-L@LISTSERV.U
GA.EDU> Subject
Re: Why does retain work faster
conditionally?
02/19/2008 10:52
AM
Please respond to
Paul Dorfman
<sashole@BELLSOUT
H.NET>
Art,
I suspect that this difference in the run times is dictated by the external
factors rather than the differences between the two DATA step versions. I
have eliminated the output data set HAVE to reduce I/O background noise and
repeated the test twice for consistency sake (under Windows XPro on a T61
ThinkPad as so):
514 data a ;
515 retain lname 'Galt' fname 'John' ;
516 do _n_ = 1 to 1e7 ;
517 output ;
518 end ;
519 run ;
NOTE: The data set WORK.A has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 6.56 seconds
cpu time 2.57 seconds
520 data _null_ ;
521 retain fname;
522 set a;
523 run;
NOTE: There were 10000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
real time 1.51 seconds
cpu time 1.51 seconds
524 data _null_ ;
525 if _n_ eq 1 then do;
526 retain fname;
527 end;
528 set a;
529 run;
NOTE: There were 10000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
real time 1.54 seconds
cpu time 1.51 seconds
530 data _null_ ;
531 retain fname;
532 set a;
533 run;
NOTE: There were 10000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
real time 1.48 seconds
cpu time 1.48 seconds
534 data _null_ ;
535 if _n_ eq 1 then do;
536 retain fname;
537 end;
538 set a;
539 run;
NOTE: There were 10000000 observations read from the data set WORK.A.
NOTE: DATA statement used (Total process time):
real time 1.54 seconds
cpu time 1.54 seconds
However, even though the steps compared as I expected (i.e. executing a
conditional statement 10 million times costs more than nothing) I would not
draw the definite conclusion based on this comparison because the
background input noise still mars the measurement.
The analogy I usually use in this sort of situation is that it is
physically impossible to use a weigh station scale to weigh a fly by
subtracting the weight of an elephant with the fly on its behind measured
from the weight of the bare-ass elephant, for the difference will be
inevitably dwarfed by the measurement errors. To weigh the fly, one needs
to eliminate the elephant from the picture and weigh the fly (preferably
not airborne) itself using a precision scale.
In this case, eliminating the elephant would mean:
602 data _null_ ;
603 lname = 'Galt' ;
604 fname = 'John' ;
605 do _n_ = 1 to 5e9 ;
606 retain fname ;
607 end ;
608 run ;
NOTE: DATA statement used (Total process time):
real time 1:17.40
cpu time 1:17.35
609 data _null_ ;
610 lname = 'Galt' ;
611 fname = 'John' ;
612 do _n_ = 1 to 5e9 ;
613 if _n_ = 1 then do ;
614 retain fname ;
615 end ;
616 end ;
617 run ;
NOTE: DATA statement used (Total process time):
real time 1:21.50
cpu time 1:21.23
Note SAS kis so blazingly fast in the execution of the conditional
statement that I have been able to detect a measurable difference (and that
is after eliminating all I/O!) by iterating the loops over a billion times.
Iterating them 10 million times only has resulted in 0.15 seconds for each
step, the difference being beyond the accuracy.
Of course, to my mind, all the measurements with RETAIN between IF and DO
are a funny exercise not unlike an experiment I would stage to prove to
myself that it is impossible to build a perpetuum mobile, because I know
from the onset that at the run time, SAS simply does not see RETAIN (all
its actions have been completed at the compile time beforehand). A good
hint at the RETAIN not having been intended to be run conditionally is that
the "instruction"
if _n_ = 1 then do retain fname ;
will not even compile -- a RETAIN statement must begin with the RETAIN
keyword right after the preceding semicolon. That is why it compiles within
the DO-END block, although at the run time SAS sees no difference
whatsoever between
if _n_ = 1 then do ;
retain fname ;
end ;
and
if _n_ = 1 then do ;
end ;
Kind regards
------------
Paul Dorfman
Jax, FL
------------
-------------- Original message ----------------------
From: Arthur Tabachneck <art297@NETSCAPE.NET>
>
> One of our most respected list members wrote me off-line, asking why in
> the world I would have suggested wrapping a retain statement within a
> condition.
>
> That is, given the following data:
>
> data have;
> input lname$ fname$;
> do i=1 to 1000000;output;end;
> cards;
> lname1 fname1
> lname2 fname2
> ;
>
> why write:
>
> data want;
> if _n_ eq 1 then do;
> retain fname;
> end;
> set have;
> run;
>
> instead of:
> data want;
> retain fname;
> set a;
> run;
>
> I know why I provided the solution, because it had better performance,
but
> I could sure use some feedback explaining why that would be so.
>
> I initially wrote it correctly and, upon seeing that it worked slower
than
> Jiann's SQL solution, tried to see if I could bypass reading the data
> (i.e., when _n_ eq 0).
>
> After I soon realized that wouldn't be possible, I ran the step as
> presented.
>
> Someone please explain to me why:
>
> 60 data want;
> 61 if _n_ eq 1 then do;
> 62 retain fname;
> 63 end;
> 64 set a;
> 65 run;
>
> NOTE: There were 2000000 observations read from the data set WORK.A.
> NOTE: The data set WORK.WANT has 2000000 observations and 3 variables.
> NOTE: DATA statement used (Total process time):
> real time 1.12 seconds
> cpu time 1.12 seconds
>
> runs almost 50% faster than:
> 56 data want;
> 57 retain fname;
> 58 set a;
> 59 run;
>
> NOTE: There were 2000000 observations read from the data set WORK.A.
> NOTE: The data set WORK.WANT has 2000000 observations and 3 variables.
> NOTE: DATA statement used (Total process time):
> real time 1.43 seconds
> cpu time 1.43 seconds
>
> I ran the tests on a 4-processor Window's 2003 system with 12 gig of ram
> and SAS 9.1.3. It was during a holiday, thus I was the only one using
the
> computer and I re-ran the tests 3 times with the same results.
>
> Art
> --------
> On Mon, 18 Feb 2008 23:21:23 -0500, Arthur Tabachneck
> <art297@NETSCAPE.NET> wrote:
>
> >Miguel,
> >
> >As Jiann indicated, you can do what you want with proc sql. However,
you
> >can also accomplish the same thing in a data step. For example,
> >
> >data have;
> > input lname$ fname$;
> > do i=1 to 1000000;output;end;
> > cards;
> > lname1 fname1
> > lname2 fname2
> > ;
> >
> >data want;
> > if _n_ eq 1 then do;
> > retain fname;
> > end;
> > set have;
> >run;
> >
> >HTH,
> >Art
> >---------
> >On Tue, 19 Feb 2008 02:55:04 +0000, Miguel de la Hoz
<miguel_hoz@YAHOO.ES>
> >wrote:
> >
> >>I am starting my problem with the following disposal of my dataset:
> >
> ># variable
> >1 lname
> >2 fname
> >
> >I am trying to export it to excel but it is keeping that order. I would
> >like to be able to write
> >
> ># variable
> >1 fname
> >2 lname
> >
> >This is only an example my dataset contains around 20 fields.
> >
> >Thanks.
> >
> >MDH.
> >
> >
> >
> >______________________________________________
> >¿Con Mascota por primera vez? Sé un mejor Amigo. Entra en Yahoo!
> >Respuestas http://es.answers.yahoo.com/info/welcome
-----------------------------------------
CONFIDENTIALITY NOTICE: This electronic message contains
information which may be legally confidential and/or privileged and
does not in any case represent a firm ENERGY COMMODITY bid or offer
relating thereto which binds the sender without an additional
express written confirmation to that effect. The information is
intended solely for the individual or entity named above and access
by anyone else is unauthorized. If you are not the intended
recipient, any disclosure, copying, distribution, or use of the
contents of this information is prohibited and may be unlawful. If
you have received this electronic transmission in error, please
reply immediately to the sender that you have received the message
in error, and delete it. Thank you.
|