|
Dear All:
For the purposes of regression, missing is indeed equal to 0.
Randy
On Mon, 16 Apr 2012 18:45:41 +0000, toby dunn <tobydunn@HOTMAIL.COM> wrote:
>Lets See here:
>
>
>> data _null_;
>> set foo;
>> array v _numeric_;
>> do over v;
>> if missing(v) then v=0;
>> end;
>> run;
>>
>> NOTE: There were 500000 observations read from the data set WORK.FOO.
>> NOTE: DATA statement used (Total process time):
>> real time 0.63 seconds
>> user cpu time 0.56 seconds
>> system cpu time 0.07 seconds
>
>
>
>
>VS
>
>
>> data _null_;
>> set foo;
>> array v &var_names;
>> do over v;
>> if missing(v) then v=0;
>> end;
>> run;
>>
>> NOTE: There were 500000 observations read from the data set WORK.FOO.
>> NOTE: DATA statement used (Total process time):
>> real time 0.63 seconds
>> user cpu time 0.54 seconds
>> system cpu time 0.09 seconds
>
>
>
>So Real time is the same, #2 wins by .o2 on the CPU time While #1 wins by
.02 on system CPU Time..
>
>Two things comes to mind... I dont see a clear winner and number two who
gives a flying rats ass which one is faster with numbers this
>freakin close....
>
>In short too many people after all these years are still hung up on
speed... in reality they should be worried about readability and
maintainability of the code.
>Why? because people coming behind you will spend more time reading and
trying to understand and mainatin your code than you did writing and testing
the damn thing....
>Unless there is a significant difference in time why do we waste or efforts
on eeking .02 in CPU time.
>
>In which I am still in favor of #1 over #2 because it is easier to read and
maintain.
>
>
>This:
>
>> proc stdize data=foo out=_null_ reponly missing=0; run;
>>
>> NOTE: No VAR statement is given. All numerical variables not named
>> elsewhere make up the first set of variables.
>> NOTE: There were 500000 observations read from the data set WORK.FOO.
>> NOTE: PROCEDURE STDIZE used (Total process time):
>> real time 0.74 seconds
>> user cpu time 0.66 seconds
>> system cpu time 0.09 seconds
>
>
>Is the best so far even if it takes a hair longer to run.
>
>
>
>Toby Dunn
>
>
>If you get thrown from a horse, you have to get up and get back on, unless
you landed on a cactus; then you have to roll around and scream in pain.
>
>�Any idiot can face a crisis�it�s day to day living that wears you out�
>~ Anton Chekhov
>
>
>
>> Date: Mon, 16 Apr 2012 12:11:48 -0600
>> From: friedegg2012@GMAIL.COM
>> Subject: Re: How to change many missing variables to 0 in a single data step
>> To: SAS-L@LISTSERV.UGA.EDU
>>
>> The generate if statements does appear to be the quickest implementation
>> with the given problem ( ~50 columns x ~500k rows). Here is some code to
>> generate and compare the given solutions. I also expanded the miss2zero
>> macro a little work with non-standardized variable names through
>> collection. It would fit nicely into a macro function sandwich (a la Mike
>> Rhoads) to avoid the compile and generate steps into a single call.
>>
>> /* simulate non standardizes variable names */
>> proc sql noprint;
>> select distinct compress(Subsidiary,,'ka')
>> into :bar_arr separated by ' '
>> from sashelp.shoes;
>> %let bar_dim=&sqlobs;
>> quit;
>> NOTE: PROCEDURE SQL used (Total process time):
>> real time 0.01 seconds
>> user cpu time 0.00 seconds
>> system cpu time 0.00 seconds
>>
>>
>>
>> /* generate 53x500,000 sample data with 40% random missing */
>> data foo;
>> call streaminit(12345);
>> array bar[&bar_dim] &bar_arr;
>> do id=1 to 500000;
>> do _n_=1 to &bar_dim;
>> bar[_n_]=rand('uniform');
>> if rand('table',.6,.4) > 1 then call missing(bar[_n_]);
>> end;
>> output;
>> end;
>> run;
>>
>> NOTE: The data set WORK.FOO has 500000 observations and 54 variables.
>> NOTE: DATA statement used (Total process time):
>> 2 The SAS System
>> 10:43 Monday, April 16, 2012
>>
>> real time 3.16 seconds
>> user cpu time 2.53 seconds
>> system cpu time 0.60 seconds
>>
>>
>>
>> /* test variable imputation methods
>> will use missing() instead of =. to account for all missing
>> values i.e. =.Z */
>>
>> *array method using _numeric_ variable list;
>> data _null_;
>> set foo;
>> array v _numeric_;
>> do over v;
>> if missing(v) then v=0;
>> end;
>> run;
>>
>> NOTE: There were 500000 observations read from the data set WORK.FOO.
>> NOTE: DATA statement used (Total process time):
>> real time 0.63 seconds
>> user cpu time 0.56 seconds
>> system cpu time 0.07 seconds
>>
>>
>>
>> *array method using collected variable list;
>> proc sql noprint;
>> select name
>> into :var_names separated by ' '
>> from sashelp.vcolumn
>> where libname='WORK' and memname='FOO';
>> quit;
>> NOTE: PROCEDURE SQL used (Total process time):
>> real time 0.00 seconds
>> user cpu time 0.00 seconds
>> system cpu time 0.00 seconds
>>
>>
>>
>> data _null_;
>> set foo;
>> array v &var_names;
>> do over v;
>> if missing(v) then v=0;
>> end;
>> run;
>>
>> NOTE: There were 500000 observations read from the data set WORK.FOO.
>> NOTE: DATA statement used (Total process time):
>> real time 0.63 seconds
>> user cpu time 0.54 seconds
>> system cpu time 0.09 seconds
>>
>>
>>
>> *proc stdize with reponly option, my personal favorite response to
>> this topic;
>> proc stdize data=foo out=_null_ reponly missing=0; run;
>>
>> NOTE: No VAR statement is given. All numerical variables not named
>> elsewhere make up the first set of variables.
>> NOTE: There were 500000 observations read from the data set WORK.FOO.
>> NOTE: PROCEDURE STDIZE used (Total process time):
>> real time 0.74 seconds
>> user cpu time 0.66 seconds
>> system cpu time 0.09 seconds
>>
>>
>>
>> *macro with if statements for non-standardized varaible names;
>> %macro impute_missing(action= ,libname= ,memname= ,type=num
>> ,prefix=n ,impute_value=0);
>> %if &action = compile %then %do;
>> data _null_;
>> do i=1 by 1 until(done);
>> set sashelp.vcolumn end=done;
>> where libname="%upcase(&libname)" and
>> memname="%upcase(&memname)" and type="%lowcase(&type)";
>> call symputx(cats("g_m2z_&prefix",i),name,'g');
>> end;
>> call symputx("g_m2z_&prefix.0",i,'g');
>> run;
>> %end;
>> %if &action = generate %then %do;
>> %do i=1 %to &&g_m2z_&prefix.0;
>> if missing(&&&g_m2z_&prefix.&i) then &&&g_m2z_&prefix.&i=0;
>> %end;
>> %end;
>> %mend;
>>
>> %impute_missing(action=compile ,libname=WORK ,memname=FOO
>> ,type=num ,prefix=n);
>>
>> NOTE: The query as specified involves ordering by an item that doesn't
>> appear in its SELECT clause.
>> NOTE: There were 54 observations read from the data set SASHELP.VCOLUMN.
>> WHERE (libname='WORK') and (memname='FOO') and (type='num');
>> NOTE: DATA statement used (Total process time):
>> real time 0.08 seconds
>> user cpu time 0.06 seconds
>> system cpu time 0.02 seconds
>>
>>
>> data _null_;
>> set foo;
>> %impute_missing(action=generate ,prefix=n ,impute_value=0);
>> run;
>>
>> NOTE: There were 500000 observations read from the data set WORK.FOO.
>> NOTE: DATA statement used (Total process time):
>> real time 0.28 seconds
>> user cpu time 0.18 seconds
>> system cpu time 0.09 seconds
>
|