| Date: | Wed, 20 Jun 2007 19:00:53 -0700 |
| Reply-To: | David L Cassell <davidlcassell@MSN.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | David L Cassell <davidlcassell@MSN.COM> |
| Subject: | Re: two-variable deduplication problem |
| In-Reply-To: | <1182348454.404396.274740@o61g2000hsh.googlegroups.com> |
| Content-Type: | text/plain; format=flowed |
|---|
paul.vonhippel@CHASE.COM wrote back:
>To clarify: In the real data, ID2 consists of 8-digit numbers like
>27943969. There are about 6000 distinct values, but the range of
>values is more than 16 million, from 11485631 to 27943969.
I see that you have already received hashing solutions, which is
what *I* would recommend. BTW, I picked out hash names of
h1 and h2 also, before I saw Ian and D0's answers. Great minds
think alike, I guess. :-)
I just wanted to point out that your range of 11485631 to 27943969
is actually manageable as a temporary array, if you have enough RAM.
16M elements, times 8 bytes (if you define the array as _TEMPORARY_)
is 128 Megs of RAM, which is probably a lot less than what you have on
the hardware where you run SAS.
Given the sparseness of this array, I wouldn't recommend it over that
hash, though...
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Get a preview of Live Earth, the hottest event this summer - only on MSN
http://liveearth.msn.com?source=msntaglineliveearthhm
|