LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2007, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 20 Jun 2007 19:00:53 -0700
Reply-To:   David L Cassell <davidlcassell@MSN.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   David L Cassell <davidlcassell@MSN.COM>
Subject:   Re: two-variable deduplication problem
In-Reply-To:   <1182348454.404396.274740@o61g2000hsh.googlegroups.com>
Content-Type:   text/plain; format=flowed

paul.vonhippel@CHASE.COM wrote back: >To clarify: In the real data, ID2 consists of 8-digit numbers like >27943969. There are about 6000 distinct values, but the range of >values is more than 16 million, from 11485631 to 27943969.

I see that you have already received hashing solutions, which is what *I* would recommend. BTW, I picked out hash names of h1 and h2 also, before I saw Ian and D0's answers. Great minds think alike, I guess. :-)

I just wanted to point out that your range of 11485631 to 27943969 is actually manageable as a temporary array, if you have enough RAM. 16M elements, times 8 bytes (if you define the array as _TEMPORARY_) is 128 Megs of RAM, which is probably a lot less than what you have on the hardware where you run SAS.

Given the sparseness of this array, I wouldn't recommend it over that hash, though...

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ Get a preview of Live Earth, the hottest event this summer - only on MSN http://liveearth.msn.com?source=msntaglineliveearthhm


Back to: Top of message | Previous page | Main SAS-L page