|Date: ||Wed, 20 Jun 2007 19:00:53 -0700|
|Reply-To: ||David L Cassell <davidlcassell@MSN.COM>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||David L Cassell <davidlcassell@MSN.COM>|
|Subject: ||Re: two-variable deduplication problem|
|Content-Type: ||text/plain; format=flowed|
paul.vonhippel@CHASE.COM wrote back:
>To clarify: In the real data, ID2 consists of 8-digit numbers like
>27943969. There are about 6000 distinct values, but the range of
>values is more than 16 million, from 11485631 to 27943969.
I see that you have already received hashing solutions, which is
what *I* would recommend. BTW, I picked out hash names of
h1 and h2 also, before I saw Ian and D0's answers. Great minds
think alike, I guess. :-)
I just wanted to point out that your range of 11485631 to 27943969
is actually manageable as a temporary array, if you have enough RAM.
16M elements, times 8 bytes (if you define the array as _TEMPORARY_)
is 128 Megs of RAM, which is probably a lot less than what you have on
the hardware where you run SAS.
Given the sparseness of this array, I wouldn't recommend it over that
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
Get a preview of Live Earth, the hottest event this summer - only on MSN