Date: Fri, 2 Jun 2000 17:46:08 EDT
Reply-To: Bernard Tremblay <imaginasys@HOTMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Bernard Tremblay <imaginasys@HOTMAIL.COM>
Subject: Re: De-Duplication
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Well done Paul,
I'd have an SQL solution here:
proc sql undo_policy=none noerrorstop;
create table f2 like myfile;
create unique index rnd on f2;
insert into f2 select * from myfile;
drop index rnd from f2;
Et voila! The results are in file f2.
I'm sure it is more CPU expensive than the hashing stuff.
But, once you are a good carpenter, all problems tend to like like nails ...
and you know: hammer are so good with nails!
\\ - - //
( @ @ )
| Bernard Tremblay | |
| CSST | Tel: (418) 528-9313 |
| | Fax: (418) 528-1493 |
| | Int: Bernard.Tremblay@csst.qc.ca |
| Imaginasys enr | Res: (418) 658-1411 |
| | Int: firstname.lastname@example.org |
| | Hot: email@example.com |
oooO ( )
( ) ) /
\ ( (_/
>From: Paul Dorfman <paul_dorfman@HOTMAIL.COM>
>Subject: Re: De-Duplication
>Date: Fri, 2 Jun 2000 21:06:16 GMT
>It is amazing how deply rooted SAS habits are. Once I perceived the stem
>'dup', 'sort' and 'distinct' poped up in my head. Of course you can use
>with NODUPKEY, having, prior to doing that, created a unique record ID for
>RND that will help resurrect the original order. The 'standard' step
>sequence usually looks like
>data v / view=v;
> set rnd;
> seq ++ 1;
>proc sort data=v out=nodup nodupkey equals;
> by rnd;
>proc sort data=nodup out=nodup(drop=seq);
> by seq;
>However, for a second, let us try to forget the banal methods we have
>learned using SAS for years and ask: Does this 'standard' SAS logic make
>sense? And the answer is: Absolutely not! From the standpoint of the task
>hand, we need not rearrange the records at all, for in the output, they
>be in the same sequence as in the input. Yet we are performing the most
>expensive operation in data processing - sorting - twice, just to be able
>use NODUPKEY as its by-product. In comparison, the common sense (aka
>programming) logic would by as plain as a brick:
>1. Read the 'next' record.
>2. Look up a table for the key RND associated with the record.
>3. If the key is already there, the record is duplicate. Go to 1.
>4. Write the record out and store the key in the table. Go to 1.
>In the DATA step, it does not get any simpler, either. Let us have some
>2 data rnd (keep=rnd);
>3 do _n_=1 to 1e6;
>4 rnd = int(ranuni(1)*1e6) + 1e10;
>5 rec = int(ranuni(1)*1e6);
>NOTE: The data set WORK.RND has 1000000 observations and 1 variables.
>NOTE: The DATA statement used 3.02 CPU seconds and 6199K.
>%let hs = 2000003;
>9 data nodup;
>10 array r (0:&hs) _temporary_;
>11 set rnd;
>12 do _n_=mod(rnd,&hs) by -1 until (r(_n_) = .);
>13 if _n_ < 0 then _n_ = &hs;
>14 if r(_n_) = rnd then delete;
>16 r(_n_) = rnd;
>NOTE: The data set WORK.NODUP has 632092 observations and 1 variables.
>NOTE: The DATA statement used 3.12 CPU seconds and 21830K.
>But what is that weird 2000003 doing up above? It is just the first prime
>number greater than 1e6*2. And why the factor 2? Because I am using a hash
>table with linear probing, and it will not perform well if is less than
>about 50% sparse. However, by adding a single line of code, it can be
>transformed into a table with double hashing, thus cutting down on memory
>consumption, for the double hashing table performs acceptably well even if
>it is 80-90% full:
>%let hs = 1250003;
> array r (0:&hs) _temporary_;
> set rnd;
> c = 1 + mod(rnd,%eval(&hs-2));
> do _n_=mod(rnd,&hs) by -c until (r(_n_) = .);
> if _n_ < 0 then _n_ ++ &hs;
> if r(_n_) = rnd then delete;
> r(_n_) = rnd;
>NOTE: The data set WORK.NODUP has 632092 observations and 2 variables.
>NOTE: The DATA statement used 3.74 CPU seconds and 16204K.
>How does the 'standard' method posted in the beginning fare in comparison?
>To make the long story short, the three steps run in 9.39 CPU seconds,
>overall. But I did not write this opus to brag about the 5 CPU seconds
>difference. Rather, I wanted to emphasize, and, in the spirit of Friday
>brought by Bernard, amuse fellow SAS-Lers with the funny fact that quite
>often, we resort to stream-of-the-consciosness solutions whilst a
>straightforward stream-of-the-sense solution lies on the surface.
>Paul M. Dorfman
>>Date: Fri, 2 Jun 2000 15:32:06 EDT
>>Could I please ask for a help with a problem. I have a SAS table RND with
>>numeric columns RND and REC and 1 million observations. What I need to do
>>is remove duplicates by RND. I know I could use SORT with NODUPKEY. But
>>problem is that the order of REC must remain the same as in the original
>>dataset, and SORT changes it. In other words, suppose I have in RND:
>>The correct output will be:
>>In actuality RND values are integer like above but may have upwards of 10
>>digits and REC column may have many repeating values itself. Thanks in
>>advance for any input.
>>Get your own FREE, personal Netscape Webmail account today at
>Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com