LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (December 2000, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Fri, 29 Dec 2000 13:43:05 -0500
Reply-To:   Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject:   Re: dropping missing data records
Comments:   To: Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Content-Type:   text/plain; charset="iso-8859-1"

Howard,

I agree that my use of _NUMERIC_ could lead to maintenance problems if the step is modified to do more than intended. I also agree that your array approach is superior if one needs to generalize to removing a number of character variables from the filter. However, in this case I would probably turn to a two step appraoch like that of Puddin'. My main objection to your

array notidpos(100) _temporary_ (100*1);

was the specific size 100, since it too can lead to a maintenance problem.

However, there is nothing wrong with my

do i = 1 to idpos - 1 , idpos + 1 to dim(chr) ;

when IDPOS=1 or DIM(CHR), since iterative loop indices are tested at the top of the loop. To consider the extreme, let's make ID both the first and last character variable in the array.

448 data _null_ ; 449 array chr (1) id ; 450 idpos = 1 ; 451 put "before loop" ; 452 do i = 1 to idpos - 1 , idpos + 1 to dim(chr) ; 453 put "inloop" ; 454 end ; 455 put "after loop" ; 456 run ;

before loop after loop

This morning I reconsidered the question looping over the numeric variables versus using the function NMISS or N. If there are many numeric variables or it is likely that one of the early ones is not missing then the loopsing will be more efficient. Thus it seems that the looping method will be best much of the time.

I also relooked at DO OVER. I had stopped using this construction early in Version 5 when the Institute claimed it to be slow and possibly to be removed from future versions. In Version 8.0, I was surprised to find it quite comparable in speed with explicit indexing and possibly better. On the other hand, a quick search for the phrases, "do over" and "implicit array" in the on line documentation turned up empty.

Any comments?

Ian Whitlock <whitloi1@westat.com>

-----Original Message----- From: Howard Schreier [mailto:Howard_Schreier@ITA.DOC.GOV] Sent: Friday, December 29, 2000 11:11 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: dropping missing data records

Ian raises some good points. Here is his adaptation of my adaptation of ...

data nonmissing ( drop = i idpos ) ; retain idpos ; set a; array chr(*) _character_; if _n_=1 then do i = 1 to dim(chr); if upcase(vname(chr(i)))="ID" then idpos = i; end; if n (of _numeric_) > 1 then return ; do i = 1 to idpos - 1 , idpos + 1 to dim(chr) ; if chr(i) ^= " " then return ; end; delete ; run ;

One very subtle point is that because the reference to _NUMERIC_ appears after the DO loop with index variable I, _NUMERIC_ includes I in addition to the numeric variables in the data set A. I presume that testing for >1 rather than >0 is an allowance for this. But it would be pretty easy in maintenance of working code to introduce other numeric variables without making the necessary adjustments. That in part is why I used the roundabout method of defining an array of the numeric variables in A.

Also, it looks like Ian's DO loop testing for missing values in character variables will generate out-of-range subscripts if ID is either the first or last character variable in A. The fix would be (not tested):

do i = 1 to dim(chr) ; if chr(i) ^= " " and i^=idpos then return ; end;

Another approach is to use a RETAIN statement to force ID into the first sequential position. This would have the presumably acceptable side effect of possibly changing the order of variables between the input and output data sets. The code then becomes (not tested):

data nonmissing ; retain id ; set a; array chr(*) _character_; if n (of _numeric_) > 0 then return ; if dim(chr)>1 then do i = 2 to dim(chr) ; if chr(i) ^= " " then return ; end; delete ; run ;

In my earlier code I defined a temporary array just to record the sequential position of the variable ID. I agree with Ian that it's not necessary. However, it would support solutions to a more general version of the problem in which there might be multiple key and categorical variables to be excluded from the filter.

On Thu, 28 Dec 2000 16:44:18 -0500, Ian Whitlock <WHITLOI1@WESTAT.COM> wrote:

>Subject: Re: dropping missing data records >Summary: Mistakes, symmetry, and summary. >Respondent: Ian Whitlock <whitloi1@westat.com> > >Vaishali Joshi [vjoshi@QRC.COM] asked how to remove records where ID >was the only non-missing value. > >Puddin' Man's [pudding_man@POSTMARK.NET] response to Paul Dorfman >[paul_dorfman@HOTMAIL.COM] comment caught my eye. > >> > time. This becomes readily evident if we run the Puddin' Man's solution >> > alongside. For the sake of equal justice, I have trimmed Puddin's >product >> > down as well. Parsimony has its drawbacks, but sometimes it allows to >> > achieve an eye-pleasing symmetry: >> > >> > data c; >> > set a; >> > array nn _numeric_; array cc _character_; >> > do _i_=1 to dim(nn) until (nn(_i_) > .); end; if _i_ <= dim(nn); >> > do _i_=1 to dim(cc) until (cc(_i_) > ''); end; if _i_ <= dim(cc); >> > run ; >> >> "Beauty Is In The Eye Of The Beholder" ??? >> >> Well, the data step code I presented was sufficiently inelegant >> that I suppose _someone_ should have cleaned it up (I was in a >> hurry). Come to think of it the SQL wasn't exactly "Pretty" >> either ... :-) >> >> But I, for one, don't see much in the way of "eye-pleasing symmetry" >> in Paul's code (above). At least with my ("undefendable") code, >> you could glance at it and (due to the explicit transfer-of- >> control) it's purpose would sorta jump up and smack you in >> the face. I hadda stare at the above for a minute to determine >> that it would do the same processing ... > >Oddly, the mistake here is one of symmetry. While the code is >symmetric to the eye, it is not symmetric in action. If all numeric >variables are missing the record will be deleted no matter how many >character variables are not missing. > >I then turned to Puddin's original offering. > >> proc sql noprint; >> select name into :num separated by " " >> from dictionary.columns >> where libname = "WORK" and memname="TEST" >> and not(left(upcase(name))='ID') >> and left(upcase(type))='NUM ' >> ; >> select name into :char separated by " " >> from dictionary.columns >> where libname = "WORK" and memname="TEST" >> and not(left(upcase(name))='ID') >> and left(upcase(type))='CHAR' >> ; >> >> quit ; >> %put num=&num char=&char; >> >> data test; >> set test ; >> array num &num; >> array char &char; >> do over num; >> if num ne . then goto outp; >> end; >> do over char ; >> if char ne ' ' then goto outp; >> end; >> delete; >> return; >> >> outp: >> output; >> run ; > >Here the mistake is that two records will be output when there are >both nonmissing numeric and character variables. To simplify and >correct > > data test; > set test ; > array num &num; > array char &char; > do over num; > if num ne . then return ; > end; > do over char ; > if char ne ' ' then return ; > end; > delete; > run ; > >For a different kind of solution consider > > data dummy ; output ; stop ; set a ; run ; > > proc sql ; > create table c as > select * from a > except all > select * from dummy > ; > quit ; > >As Howard Schreier [Howard_Schreier@ITA.DOC.GOV] pointed out, none >of these solutions consider the presence if the ID variable. He >then offered a solution when ID is character using a reworked >version of Paul's reworked version of Ed Heaton's >[heatone@westat.com] code. There is no mistake, but the solution can >be simplified and the efficiency improved. > >> data nonmissing (drop=nmc); >> set a; >> array num _numeric_; >> array chr(*) _character_; >> array notidpos(100) _temporary_ (100*1); >> length nmc $1; >> if _n_=1 then do i = 1 to dim(chr); >> if upcase(vname(chr(i)))="ID" then notidpos(i) = 0; >> end; >> do i = 1 to dim(chr); >> if notidpos(i) then nmc = compress(nmc || chr(i)); >> end; >> if nmiss(of num(*)) < dim(num) or nmc > '' ; >> run ; > > >There is no need for two of the arrays. There is no need to test >NOTIDPOS(I). I also dropped Ed's concatenation flag for the simpler >test CHR(I) ^= " " so that the loop would stop as soon as possible. > > data nonmissing ( drop = i idpos ) ; > retain idpos ; > set a; > array chr(*) _character_; > > if _n_=1 then do i = 1 to dim(chr); > if upcase(vname(chr(i)))="ID" then idpos = i; > end; > > if n (of _numeric_) > 1 then return ; > do i = 1 to idpos - 1 , idpos + 1 to dim(chr) ; > if chr(i) ^= " " then return ; > end; > delete ; > run ; > >I like the simplicity of > > data nonmissing ( drop = i ) ; > > set a ( drop = id ) ; > array chr(*) _character_; > set a ( keep = id ) ; > > if n (of _numeric_) > 1 then return ; > do i = 1 to dim(chr) ; > if chr(i) ^= " " then return ; > end; > delete ; > run ; > >But it requires two buffers to manage A and would probably loose on >efficiency. > >Ian Whitlock


Back to: Top of message | Previous page | Main SAS-L page