Date: Tue, 9 Oct 2001 12:28:45 -0700
Reply-To: "Karsten M. Self" <kmself@IX.NETCOM.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Karsten M. Self" <kmself@IX.NETCOM.COM>
Subject: Re: grabbing the unique variable from a data set
In-Reply-To: <200110091057.f99AvaK196108@listserv.cc.uga.edu>; from
ghellrieg@T-ONLINE.DE on Tue, Oct 09, 2001 at 06:57:36AM -0400
Content-Type: multipart/signed; micalg=pgp-sha1;
protocol="application/pgp-signature";
on Tue, Oct 09, 2001 at 06:57:36AM -0400, Gerhard Hellriegel (ghellrieg@T-ONLINE.DE) wrote:
> On Tue, 9 Oct 2001 10:19:30 +0100, Peter Crawford <peter.crawford@DB.COM>
> wrote:
>
> > if data volumes are small enough to allow sorting the data, use the
> > option of
> > proc sort NODUPKEY; run;
> > proc sort data=bobs.dataset out=uniques NOdupKey ;
> > by that_variable;
> > run;
> >
> > good luck
> > Peter Crawford
> >
> Hi Peter,
> your answer is that what I' suggested too, but what I'm wondering
> about: what would you do if the dataset is huge? Is there another idea
> for doing that? Do you think of something like an index and testing
> that for duplicates?
I'm not sure that an index will return distinct values. It will aid in
retrieving specified values, at the cost of access overhead if output
aren't restricted to typically < 10-20% of total records.
You've basically got two options:
- Order the data (or use ordered data) and compare proximate tuples
for similarity.
- Utilize a mapping function which outputs a single record for each
input record of a given value.
The former calls for sorting, though the clever reader may utilize
existing patterns of sorted or repetitive values with a NOTSORTED key to
reduce the incoming data load, adding KEEP or DROP directives to reduce
the size of the PDV, e.g.:
data unsort;
set input( keep= <key vars>);
by notsorted <key vars>; /* not sure of 'NOTSORTED' syntax */
if first.<key var>;
run;
proc sort unsort nodupkey;
by <key vars>;
run;
The latter would call for some form of hashing algorithm. Paul
Dorfman's written on this on list and in SUGI papers in the past. Rick
Aster's _Professional SAS Programming Secrets_ shows an on-disk hashing
algorithm IIRC. As I've posted here recently, hash tables of upwards of
40m elements may be created in 4-5 GB of real memory, certainly
attainable on current generations of servers, and even some desktops.
Peace.
--
Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/
What part of "Gestalt" don't you understand? Home of the brave
http://gestalt-system.sourceforge.net/ Land of the free
Free Dmitry! Boycott Adobe! Repeal the DMCA! http://www.freesklyarov.org
Geek for Hire http://kmself.home.netcom.com/resume.html
[application/pgp-signature]