LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2012, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 16 Apr 2012 18:45:41 +0000
Reply-To:   toby dunn <tobydunn@HOTMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   toby dunn <tobydunn@HOTMAIL.COM>
Subject:   Re: How to change many missing variables to 0 in a single data step
Comments:   To: friedegg2012@gmail.com
In-Reply-To:   <CANS8eEgvnVcdcE=pe3W=w+7+vR0Ea0hvsT+_zL_3Xg+jOP81Dg@mail.gmail.com>
Content-Type:   text/plain; charset="Windows-1252"

Lets See here: > data _null_; > set foo; > array v _numeric_; > do over v; > if missing(v) then v=0; > end; > run; > > NOTE: There were 500000 observations read from the data set WORK.FOO. > NOTE: DATA statement used (Total process time): > real time 0.63 seconds > user cpu time 0.56 seconds > system cpu time 0.07 seconds

VS > data _null_; > set foo; > array v &var_names; > do over v; > if missing(v) then v=0; > end; > run; > > NOTE: There were 500000 observations read from the data set WORK.FOO. > NOTE: DATA statement used (Total process time): > real time 0.63 seconds > user cpu time 0.54 seconds > system cpu time 0.09 seconds

So Real time is the same, #2 wins by .o2 on the CPU time While #1 wins by .02 on system CPU Time.. Two things comes to mind... I dont see a clear winner and number two who gives a flying rats ass which one is faster with numbers this freakin close.... In short too many people after all these years are still hung up on speed... in reality they should be worried about readability and maintainability of the code. Why? because people coming behind you will spend more time reading and trying to understand and mainatin your code than you did writing and testing the damn thing.... Unless there is a significant difference in time why do we waste or efforts on eeking .02 in CPU time. In which I am still in favor of #1 over #2 because it is easier to read and maintain. This: > proc stdize data=foo out=_null_ reponly missing=0; run; > > NOTE: No VAR statement is given. All numerical variables not named > elsewhere make up the first set of variables. > NOTE: There were 500000 observations read from the data set WORK.FOO. > NOTE: PROCEDURE STDIZE used (Total process time): > real time 0.74 seconds > user cpu time 0.66 seconds > system cpu time 0.09 seconds

Is the best so far even if it takes a hair longer to run.

Toby Dunn

If you get thrown from a horse, you have to get up and get back on, unless you landed on a cactus; then you have to roll around and scream in pain. “Any idiot can face a crisis—it’s day to day living that wears you out” ~ Anton Chekhov

> Date: Mon, 16 Apr 2012 12:11:48 -0600 > From: friedegg2012@GMAIL.COM > Subject: Re: How to change many missing variables to 0 in a single data step > To: SAS-L@LISTSERV.UGA.EDU > > The generate if statements does appear to be the quickest implementation > with the given problem ( ~50 columns x ~500k rows). Here is some code to > generate and compare the given solutions. I also expanded the miss2zero > macro a little work with non-standardized variable names through > collection. It would fit nicely into a macro function sandwich (a la Mike > Rhoads) to avoid the compile and generate steps into a single call. > > /* simulate non standardizes variable names */ > proc sql noprint; > select distinct compress(Subsidiary,,'ka') > into :bar_arr separated by ' ' > from sashelp.shoes; > %let bar_dim=&sqlobs; > quit; > NOTE: PROCEDURE SQL used (Total process time): > real time 0.01 seconds > user cpu time 0.00 seconds > system cpu time 0.00 seconds > > > > /* generate 53x500,000 sample data with 40% random missing */ > data foo; > call streaminit(12345); > array bar[&bar_dim] &bar_arr; > do id=1 to 500000; > do _n_=1 to &bar_dim; > bar[_n_]=rand('uniform'); > if rand('table',.6,.4) > 1 then call missing(bar[_n_]); > end; > output; > end; > run; > > NOTE: The data set WORK.FOO has 500000 observations and 54 variables. > NOTE: DATA statement used (Total process time): > 2 The SAS System > 10:43 Monday, April 16, 2012 > > real time 3.16 seconds > user cpu time 2.53 seconds > system cpu time 0.60 seconds > > > > /* test variable imputation methods > will use missing() instead of =. to account for all missing > values i.e. =.Z */ > > *array method using _numeric_ variable list; > data _null_; > set foo; > array v _numeric_; > do over v; > if missing(v) then v=0; > end; > run; > > NOTE: There were 500000 observations read from the data set WORK.FOO. > NOTE: DATA statement used (Total process time): > real time 0.63 seconds > user cpu time 0.56 seconds > system cpu time 0.07 seconds > > > > *array method using collected variable list; > proc sql noprint; > select name > into :var_names separated by ' ' > from sashelp.vcolumn > where libname='WORK' and memname='FOO'; > quit; > NOTE: PROCEDURE SQL used (Total process time): > real time 0.00 seconds > user cpu time 0.00 seconds > system cpu time 0.00 seconds > > > > data _null_; > set foo; > array v &var_names; > do over v; > if missing(v) then v=0; > end; > run; > > NOTE: There were 500000 observations read from the data set WORK.FOO. > NOTE: DATA statement used (Total process time): > real time 0.63 seconds > user cpu time 0.54 seconds > system cpu time 0.09 seconds > > > > *proc stdize with reponly option, my personal favorite response to > this topic; > proc stdize data=foo out=_null_ reponly missing=0; run; > > NOTE: No VAR statement is given. All numerical variables not named > elsewhere make up the first set of variables. > NOTE: There were 500000 observations read from the data set WORK.FOO. > NOTE: PROCEDURE STDIZE used (Total process time): > real time 0.74 seconds > user cpu time 0.66 seconds > system cpu time 0.09 seconds > > > > *macro with if statements for non-standardized varaible names; > %macro impute_missing(action= ,libname= ,memname= ,type=num > ,prefix=n ,impute_value=0); > %if &action = compile %then %do; > data _null_; > do i=1 by 1 until(done); > set sashelp.vcolumn end=done; > where libname="%upcase(&libname)" and > memname="%upcase(&memname)" and type="%lowcase(&type)"; > call symputx(cats("g_m2z_&prefix",i),name,'g'); > end; > call symputx("g_m2z_&prefix.0",i,'g'); > run; > %end; > %if &action = generate %then %do; > %do i=1 %to &&g_m2z_&prefix.0; > if missing(&&&g_m2z_&prefix.&i) then &&&g_m2z_&prefix.&i=0; > %end; > %end; > %mend; > > %impute_missing(action=compile ,libname=WORK ,memname=FOO > ,type=num ,prefix=n); > > NOTE: The query as specified involves ordering by an item that doesn't > appear in its SELECT clause. > NOTE: There were 54 observations read from the data set SASHELP.VCOLUMN. > WHERE (libname='WORK') and (memname='FOO') and (type='num'); > NOTE: DATA statement used (Total process time): > real time 0.08 seconds > user cpu time 0.06 seconds > system cpu time 0.02 seconds > > > data _null_; > set foo; > %impute_missing(action=generate ,prefix=n ,impute_value=0); > run; > > NOTE: There were 500000 observations read from the data set WORK.FOO. > NOTE: DATA statement used (Total process time): > real time 0.28 seconds > user cpu time 0.18 seconds > system cpu time 0.09 seconds


Back to: Top of message | Previous page | Main SAS-L page