| Date: | Fri, 29 Dec 2000 11:11:24 -0500 |
| Reply-To: | Howard Schreier <Howard_Schreier@ITA.DOC.GOV> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Howard Schreier <Howard_Schreier@ITA.DOC.GOV> |
| Subject: | Re: dropping missing data records |
|---|
Ian raises some good points. Here is his adaptation of my adaptation of ...
data nonmissing ( drop = i idpos ) ;
retain idpos ;
set a;
array chr(*) _character_;
if _n_=1 then do i = 1 to dim(chr);
if upcase(vname(chr(i)))="ID" then idpos = i;
end;
if n (of _numeric_) > 1 then return ;
do i = 1 to idpos - 1 , idpos + 1 to dim(chr) ;
if chr(i) ^= " " then return ;
end;
delete ;
run ;
One very subtle point is that because the reference to _NUMERIC_
appears after the DO loop with index variable I, _NUMERIC_ includes I in
addition to the numeric variables in the data set A. I presume that testing
for >1 rather than >0 is an allowance for this. But it would be pretty easy
in maintenance of working code to introduce other numeric variables without
making the necessary adjustments. That in part is why I used the roundabout
method of defining an array of the numeric variables in A.
Also, it looks like Ian's DO loop testing for missing values in character
variables will generate out-of-range subscripts if ID is either the first or
last character variable in A. The fix would be (not tested):
do i = 1 to dim(chr) ;
if chr(i) ^= " " and i^=idpos then return ;
end;
Another approach is to use a RETAIN statement to force ID into the first
sequential position. This would have the presumably acceptable side effect
of possibly changing the order of variables between the input and output
data sets. The code then becomes (not tested):
data nonmissing ;
retain id ;
set a;
array chr(*) _character_;
if n (of _numeric_) > 0 then return ;
if dim(chr)>1 then do i = 2 to dim(chr) ;
if chr(i) ^= " " then return ;
end;
delete ;
run ;
In my earlier code I defined a temporary array just to record the sequential
position of the variable ID. I agree with Ian that it's not necessary.
However, it would support solutions to a more general version of the problem
in which there might be multiple key and categorical variables to be
excluded from the filter.
On Thu, 28 Dec 2000 16:44:18 -0500, Ian Whitlock <WHITLOI1@WESTAT.COM>
wrote:
>Subject: Re: dropping missing data records
>Summary: Mistakes, symmetry, and summary.
>Respondent: Ian Whitlock <whitloi1@westat.com>
>
>Vaishali Joshi [vjoshi@QRC.COM] asked how to remove records where ID
>was the only non-missing value.
>
>Puddin' Man's [pudding_man@POSTMARK.NET] response to Paul Dorfman
>[paul_dorfman@HOTMAIL.COM] comment caught my eye.
>
>> > time. This becomes readily evident if we run the Puddin' Man's
solution
>> > alongside. For the sake of equal justice, I have trimmed Puddin's
>product
>> > down as well. Parsimony has its drawbacks, but sometimes it allows to
>> > achieve an eye-pleasing symmetry:
>> >
>> > data c;
>> > set a;
>> > array nn _numeric_; array cc _character_;
>> > do _i_=1 to dim(nn) until (nn(_i_) > .); end; if _i_ <= dim(nn);
>> > do _i_=1 to dim(cc) until (cc(_i_) > ''); end; if _i_ <= dim(cc);
>> > run ;
>>
>> "Beauty Is In The Eye Of The Beholder" ???
>>
>> Well, the data step code I presented was sufficiently inelegant
>> that I suppose _someone_ should have cleaned it up (I was in a
>> hurry). Come to think of it the SQL wasn't exactly "Pretty"
>> either ... :-)
>>
>> But I, for one, don't see much in the way of "eye-pleasing symmetry"
>> in Paul's code (above). At least with my ("undefendable") code,
>> you could glance at it and (due to the explicit transfer-of-
>> control) it's purpose would sorta jump up and smack you in
>> the face. I hadda stare at the above for a minute to determine
>> that it would do the same processing ...
>
>Oddly, the mistake here is one of symmetry. While the code is
>symmetric to the eye, it is not symmetric in action. If all numeric
>variables are missing the record will be deleted no matter how many
>character variables are not missing.
>
>I then turned to Puddin's original offering.
>
>> proc sql noprint;
>> select name into :num separated by " "
>> from dictionary.columns
>> where libname = "WORK" and memname="TEST"
>> and not(left(upcase(name))='ID')
>> and left(upcase(type))='NUM '
>> ;
>> select name into :char separated by " "
>> from dictionary.columns
>> where libname = "WORK" and memname="TEST"
>> and not(left(upcase(name))='ID')
>> and left(upcase(type))='CHAR'
>> ;
>>
>> quit ;
>> %put num=&num char=&char;
>>
>> data test;
>> set test ;
>> array num #
>> array char &char;
>> do over num;
>> if num ne . then goto outp;
>> end;
>> do over char ;
>> if char ne ' ' then goto outp;
>> end;
>> delete;
>> return;
>>
>> outp:
>> output;
>> run ;
>
>Here the mistake is that two records will be output when there are
>both nonmissing numeric and character variables. To simplify and
>correct
>
> data test;
> set test ;
> array num #
> array char &char;
> do over num;
> if num ne . then return ;
> end;
> do over char ;
> if char ne ' ' then return ;
> end;
> delete;
> run ;
>
>For a different kind of solution consider
>
> data dummy ; output ; stop ; set a ; run ;
>
> proc sql ;
> create table c as
> select * from a
> except all
> select * from dummy
> ;
> quit ;
>
>As Howard Schreier [Howard_Schreier@ITA.DOC.GOV] pointed out, none
>of these solutions consider the presence if the ID variable. He
>then offered a solution when ID is character using a reworked
>version of Paul's reworked version of Ed Heaton's
>[heatone@westat.com] code. There is no mistake, but the solution can
>be simplified and the efficiency improved.
>
>> data nonmissing (drop=nmc);
>> set a;
>> array num _numeric_;
>> array chr(*) _character_;
>> array notidpos(100) _temporary_ (100*1);
>> length nmc $1;
>> if _n_=1 then do i = 1 to dim(chr);
>> if upcase(vname(chr(i)))="ID" then notidpos(i) = 0;
>> end;
>> do i = 1 to dim(chr);
>> if notidpos(i) then nmc = compress(nmc || chr(i));
>> end;
>> if nmiss(of num(*)) < dim(num) or nmc > '' ;
>> run ;
>
>
>There is no need for two of the arrays. There is no need to test
>NOTIDPOS(I). I also dropped Ed's concatenation flag for the simpler
>test CHR(I) ^= " " so that the loop would stop as soon as possible.
>
> data nonmissing ( drop = i idpos ) ;
> retain idpos ;
> set a;
> array chr(*) _character_;
>
> if _n_=1 then do i = 1 to dim(chr);
> if upcase(vname(chr(i)))="ID" then idpos = i;
> end;
>
> if n (of _numeric_) > 1 then return ;
> do i = 1 to idpos - 1 , idpos + 1 to dim(chr) ;
> if chr(i) ^= " " then return ;
> end;
> delete ;
> run ;
>
>I like the simplicity of
>
> data nonmissing ( drop = i ) ;
>
> set a ( drop = id ) ;
> array chr(*) _character_;
> set a ( keep = id ) ;
>
> if n (of _numeric_) > 1 then return ;
> do i = 1 to dim(chr) ;
> if chr(i) ^= " " then return ;
> end;
> delete ;
> run ;
>
>But it requires two buffers to manage A and would probably loose on
>efficiency.
>
>Ian Whitlock
|