LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (October 2001, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 9 Oct 2001 12:28:45 -0700
Reply-To:   "Karsten M. Self" <kmself@IX.NETCOM.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Karsten M. Self" <kmself@IX.NETCOM.COM>
Subject:   Re: grabbing the unique variable from a data set
In-Reply-To:   <200110091057.f99AvaK196108@listserv.cc.uga.edu>; from ghellrieg@T-ONLINE.DE on Tue, Oct 09, 2001 at 06:57:36AM -0400
Content-Type:   multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature";

on Tue, Oct 09, 2001 at 06:57:36AM -0400, Gerhard Hellriegel (ghellrieg@T-ONLINE.DE) wrote: > On Tue, 9 Oct 2001 10:19:30 +0100, Peter Crawford <peter.crawford@DB.COM> > wrote: > > > if data volumes are small enough to allow sorting the data, use the > > option of > > proc sort NODUPKEY; run; > > proc sort data=bobs.dataset out=uniques NOdupKey ; > > by that_variable; > > run; > > > > good luck > > Peter Crawford > >

> Hi Peter, > your answer is that what I' suggested too, but what I'm wondering > about: what would you do if the dataset is huge? Is there another idea > for doing that? Do you think of something like an index and testing > that for duplicates?

I'm not sure that an index will return distinct values. It will aid in retrieving specified values, at the cost of access overhead if output aren't restricted to typically < 10-20% of total records.

You've basically got two options:

- Order the data (or use ordered data) and compare proximate tuples for similarity.

- Utilize a mapping function which outputs a single record for each input record of a given value. The former calls for sorting, though the clever reader may utilize existing patterns of sorted or repetitive values with a NOTSORTED key to reduce the incoming data load, adding KEEP or DROP directives to reduce the size of the PDV, e.g.:

data unsort; set input( keep= <key vars>); by notsorted <key vars>; /* not sure of 'NOTSORTED' syntax */ if first.<key var>; run;

proc sort unsort nodupkey; by <key vars>; run;

The latter would call for some form of hashing algorithm. Paul Dorfman's written on this on list and in SUGI papers in the past. Rick Aster's _Professional SAS Programming Secrets_ shows an on-disk hashing algorithm IIRC. As I've posted here recently, hash tables of upwards of 40m elements may be created in 4-5 GB of real memory, certainly attainable on current generations of servers, and even some desktops.

Peace.

-- Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/ What part of "Gestalt" don't you understand? Home of the brave http://gestalt-system.sourceforge.net/ Land of the free Free Dmitry! Boycott Adobe! Repeal the DMCA! http://www.freesklyarov.org Geek for Hire http://kmself.home.netcom.com/resume.html


[application/pgp-signature]


Back to: Top of message | Previous page | Main SAS-L page