LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (June 2000, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 2 Jun 2000 17:46:08 EDT
Reply-To:     Bernard Tremblay <imaginasys@HOTMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Bernard Tremblay <imaginasys@HOTMAIL.COM>
Subject:      Re: De-Duplication
Comments: To: sashole@mediaone.net
Content-Type: text/plain; charset=iso-8859-1; format=flowed

Well done Paul,

I'd have an SQL solution here:

proc sql undo_policy=none noerrorstop; create table f2 like myfile; create unique index rnd on f2; insert into f2 select * from myfile; drop index rnd from f2; quit;

Et voila! The results are in file f2.

I'm sure it is more CPU expensive than the hashing stuff. But, once you are a good carpenter, all problems tend to like like nails ... and you know: hammer are so good with nails!

regards,

\\\|/// \\ - - // ( @ @ ) +-----oOOo-(_)-oOOo--+-----------------------------------+ | Bernard Tremblay | | | CSST | Tel: (418) 528-9313 | | | Fax: (418) 528-1493 | | | Int: Bernard.Tremblay@csst.qc.ca | +----------------------------+---------------------------+ | Imaginasys enr | Res: (418) 658-1411 | | | Int: bertrem@quebectel.com | | | Hot: imaginasys@hotmail.com | +--------------Oooo--+-----------------------------------+ oooO ( ) ( ) ) / \ ( (_/ \_) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

>From: Paul Dorfman <paul_dorfman@HOTMAIL.COM> >Reply-To: sashole@mediaone.net >To: SAS-L@LISTSERV.UGA.EDU >Subject: Re: De-Duplication >Date: Fri, 2 Jun 2000 21:06:16 GMT > >Bolvan, > >It is amazing how deply rooted SAS habits are. Once I perceived the stem >'dup', 'sort' and 'distinct' poped up in my head. Of course you can use >SORT >with NODUPKEY, having, prior to doing that, created a unique record ID for >RND that will help resurrect the original order. The 'standard' step >sequence usually looks like > >data v / view=v; > set rnd; > seq ++ 1; >run; >proc sort data=v out=nodup nodupkey equals; > by rnd; >run; >proc sort data=nodup out=nodup(drop=seq); > by seq; >run; > >However, for a second, let us try to forget the banal methods we have >learned using SAS for years and ask: Does this 'standard' SAS logic make >any >sense? And the answer is: Absolutely not! From the standpoint of the task >at >hand, we need not rearrange the records at all, for in the output, they >must >be in the same sequence as in the input. Yet we are performing the most >expensive operation in data processing - sorting - twice, just to be able >to >use NODUPKEY as its by-product. In comparison, the common sense (aka >programming) logic would by as plain as a brick: > >1. Read the 'next' record. >2. Look up a table for the key RND associated with the record. >3. If the key is already there, the record is duplicate. Go to 1. >4. Write the record out and store the key in the table. Go to 1. > >In the DATA step, it does not get any simpler, either. Let us have some >test >data first: > >2 data rnd (keep=rnd); >3 do _n_=1 to 1e6; >4 rnd = int(ranuni(1)*1e6) + 1e10; >5 rec = int(ranuni(1)*1e6); >6 output; >7 end; >8 run; >NOTE: The data set WORK.RND has 1000000 observations and 1 variables. >NOTE: The DATA statement used 3.02 CPU seconds and 6199K. > >And now: > >%let hs = 2000003; >9 data nodup; >10 array r (0:&hs) _temporary_; >11 set rnd; >12 do _n_=mod(rnd,&hs) by -1 until (r(_n_) = .); >13 if _n_ < 0 then _n_ = &hs; >14 if r(_n_) = rnd then delete; >15 end; >16 r(_n_) = rnd; >17 run; >NOTE: The data set WORK.NODUP has 632092 observations and 1 variables. >NOTE: The DATA statement used 3.12 CPU seconds and 21830K. > >But what is that weird 2000003 doing up above? It is just the first prime >number greater than 1e6*2. And why the factor 2? Because I am using a hash >table with linear probing, and it will not perform well if is less than >about 50% sparse. However, by adding a single line of code, it can be >transformed into a table with double hashing, thus cutting down on memory >consumption, for the double hashing table performs acceptably well even if >it is 80-90% full: > >%let hs = 1250003; >data nodup; > array r (0:&hs) _temporary_; > set rnd; > c = 1 + mod(rnd,%eval(&hs-2)); > do _n_=mod(rnd,&hs) by -c until (r(_n_) = .); > if _n_ < 0 then _n_ ++ &hs; > if r(_n_) = rnd then delete; > end; > r(_n_) = rnd; >run; >NOTE: The data set WORK.NODUP has 632092 observations and 2 variables. >NOTE: The DATA statement used 3.74 CPU seconds and 16204K. > >How does the 'standard' method posted in the beginning fare in comparison? >To make the long story short, the three steps run in 9.39 CPU seconds, >overall. But I did not write this opus to brag about the 5 CPU seconds >difference. Rather, I wanted to emphasize, and, in the spirit of Friday >brought by Bernard, amuse fellow SAS-Lers with the funny fact that quite >often, we resort to stream-of-the-consciosness solutions whilst a >straightforward stream-of-the-sense solution lies on the surface. > >Kind regards, >=================== >Paul M. Dorfman >Jacksonville, Fl >=================== > > > > > > > >>From: bolvandubina@NETSCAPE.NET >>Reply-To: bolvandubina@NETSCAPE.NET >>To: SAS-L@LISTSERV.UGA.EDU >>Subject: De-Duplication >>Date: Fri, 2 Jun 2000 15:32:06 EDT >> >>Hello: >> >>Could I please ask for a help with a problem. I have a SAS table RND with >>2 >>numeric columns RND and REC and 1 million observations. What I need to do >>is remove duplicates by RND. I know I could use SORT with NODUPKEY. But >>the >>problem is that the order of REC must remain the same as in the original >>dataset, and SORT changes it. In other words, suppose I have in RND: >> >>RND REC >>3 1 >>2 1 >>3 2 >>4 1 >>1 1 >>3 3 >>4 2 >>1 2 >>4 3 >>2 2 >> >>The correct output will be: >> >>RND REC >>3 1 >>2 1 >>4 1 >>1 1 >> >>In actuality RND values are integer like above but may have upwards of 10 >>digits and REC column may have many repeating values itself. Thanks in >>advance for any input. >> >>Bolvan >> >>---------- >>Get your own FREE, personal Netscape Webmail account today at >>http://home.netscape.com/webmail/ > >________________________________________________________________________ >Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com

________________________________________________________________________ Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com


Back to: Top of message | Previous page | Main SAS-L page