LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2007, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 20 Jun 2007 17:02:03 -0400
Reply-To:     "data _null_;" <datanull@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "data _null_;" <datanull@GMAIL.COM>
Subject:      Re: two-variable deduplication problem
Comments: To: Paul Dorfman <sashole@bellsouth.net>
In-Reply-To:  <200706202027.l5KHMfTV005578@mailgw.cc.uga.edu>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

For a hash object with no "data" component would not the function of CHECK() equal the function of FIND()

As both must find the key. Perhaps some overhead associated with FIND that is not associated with CHECK.

On 6/20/07, Paul Dorfman <sashole@bellsouth.net> wrote: > Ian, > > Your code, of course, gets the job properly done, but if I may chime in > here, just a couple of (hopefully, germane) notes. > > First, storing the data part in a hash makes it work harder than necessary > when all we need to store is the keys. > > Second, the FIND() method is specifically desighed to overwrite the data > portion in PDV in the case of match, whilst here we only need to check the > existence of the key value in the table, hence the more economical (and > guaranteed to not touch the PDV) CHECK() method should suffice. > > Third, while the intent of declaring ID1 and ID2 in the LENGTH statement > is noble (prepare PDV for hash), it is really superfluous here because the > compiler does it anyway when it hits the SET W statement. In fact, relying > on SET makes the program automatically correct irrespective of the types > of ID1 and ID2. Not only coding LENGTH here is tautological but it has the > hidden danger of fostering data type and/or length conflicts. To wit, if > the ID1 were $4 and the dollar sign were accidentally omitted in the > LENGTH statement, the step would crash, for then ID1 would have been > declared both numeric and character. > > In this respect, my rule of defensive hash programming is simple: abstain > from using the LENGTH statement (retain, attrib, format, informat, array, > etc.) under any circumstances. It either must be entirely omitted (as in > the case above) or practically always can replaced with IF 0 THEN SET > statement (if necessary, using KEEP/DROP to prevent PDV pollution) because > in real life situations, hash keys and data always come from a file. > > Forth - also from the defensive programming standpoint, I would avoid > using method calls as Booleans for the simple reason that SAS only > guarantees that if a method fails, it does not return a zero. I do not > think it can ever return a missing value, but there is no documented > guarantee. Also, psychologically, the phrase > > if h.check() > > implies "if true", whilst the question being asked is "if not found". For > the stated reasons, methinks that here checking for exact [in]equality is > more proper, although I love Boolean expressions and elsewhere use them > extensively. > > The code could be shrunk to, say: > > data q ; > dcl hash h1 () ; > h1.definekey ('id1') ; > h1.definedone () ; > dcl hash h2 () ; > h2.definekey ('id2') ; > h2.definedone () ; > > do until (0) ; > set w ; > if h1.check() = 0 | h2.check() = 0 then continue ; > h1.add() ; > h2.add() ; > output ; > end ; > run ; > > Kind regards > ------------ > Paul Dorfman > Jax, FL > ------------ > > > On Wed, 20 Jun 2007 14:48:40 +0000, Ian Whitlock <iw1junk@COMCAST.NET> > wrote: > > >Summary: Hash solution to problem > >#iw-value=1 > > > >I think the following is a fair statement of the problem: > > > > You want to match ID's from two sources and have created a dataset > > equivalent to W ordered by priority of choice. Once A has been matched > > with 1, neither A nor 1 should enter into another match. How should > the > > set of highest priority matches be selected? > > > >Had you stated the problem this way you might have gotten good code > >faster. If the title had been "matching problem" or "breaking chains by > >highest priority" this might also have attracted better interest. > > > > data w ; > > length id1 id2 $1 ; > > input id1 $ Id2 $ ; > > seq + 1 ; ** if order important than add code to save it ; > > cards ; > > 1 a > > 1 b > > 2 b > > 2 a > > ; > > > > data q ( drop = rc ) ; > > length id1 id2 $1 ; > > if _n_ = 1 then > > do ; > > declare hash h1() ; > > rc = h1.defineKey('id1'); > > rc = h1.defineData('id1'); > > rc = h1.defineDone(); > > declare hash h2() ; > > rc = h2.defineKey('id2'); > > rc = h2.defineData('id2'); > > rc = h2.defineDone(); > > end ; > > > > set w ; > > if h1.find() and h2.find() then > > do ; > > h1.add() ; > > h2.add() ; > > output ; > > end ; > > run ; > > > >If you feel better about an array solution, then formats or > >informats with the PUT and INPUT functions can be used to > >calculate the index into the two arrays. The structure of > >the code would still be the same. > > > >Ian Whitlock >


Back to: Top of message | Previous page | Main SAS-L page