Date: Sat, 6 Apr 2002 07:06:22 +0000
Reply-To: sashole@bellsouth.net
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject: Re: a interesting question.
Content-Type: text/plain; format=flowed
I tried Paul's programs below and got results which were not consistent
with the desired output. However, it seems to me that Paul has framed
the general approach to an efficient algorithm, even if his code does
not return the desired solution.
Dale,
Paul has just goofed. I assume that you mean the key-indexing variant when
you say 'inconsistent'. You cannot say 'inconsistent' about an algorithm as
simple as the key-indexing without pointing at the cause. The cause is, I
erroneously posted the code that I myself found wrong.
Now, it is instructional to see what is wrong with
data z ;
array h (0 : 100) _temporary_ ;
do p = _n_ to 1 by -1 ;
set w point = p nobs = _n_ ;
if h (b) = . then output ;
h (a) = 1 ;
end ;
stop ;
run ;
Running it clearly shows that the output consists of 1 record, which clearly
indicates not that the code is inconsistent but that it is totally out of
whack - which is usually much easier to debug. The culprit is _N_, which I
initially tried to use in my zeal of avoiding to utilize any more new
variables than necessary. Here the parsimony shows its ugly face, and it is
clear why. The first reference to _N_, when _N_ still equals 1, occurs
before it is populated with the number of observations from the descriptor.
Thus the initial value of P is 1, and hence the loop iterates just once. End
of story. Replacing _N_ with anything else, for example, NOBS, fixes the
problem - which I subsequently did, but failed to replace in the post. And I
did not find anything inconsistent with the hashing algorithm, either. Have
you? In any case, a second pass through the input is not called for.
Kind regards,
===================
Paul M. Dorfman
Jacksonville, FL
===================
We can still set up a small table of
values of variable A, only we need to keep track of the record which
contains the last occurence of a value in column A. We can then loop
through the data to see if a value in column B is found in A at a
record number further into the data. If no such record is found,
then output the current record.
Ya Huang did get an implementation which correctly returned the
desired results, however his approach is highly inefficient, as Paul
notes. I present a modified algorithm which returns the same results
as Ya gets. I have placed my approach and Ya's approach in a macro
where we can modify the number of observations which we generate.
Invocations of this macro with various different database sizes
demonstrates the efficiency of the modified table approach compared
with Ya's method.
%macro test(n=);
data w ;
do n = 1 to &n ;
a = int(ranuni(0)*5000);
b = int(ranuni(0)*5000);
output ;
end ;
drop n;
run ;
* Ya Huang algorithm ;
data z1 (keep=n a b);
array aa(&n) _temporary_;
array bb(&n) _temporary_;
array keep(&n) _temporary_;
do i=1 by 1 until(end);
set w end=end;
aa(i)=a;
bb(i)=b;
end;
do j=1 to i;
keep(j)=0;
do k=j+1 to i;
if bb(j)=aa(k) then do; keep(j)=1; leave; end;
end;
end;
do l=1 to i-1;
if keep(l)=0 then do;
a=aa(l);
b=bb(l);
n=l;
output;
end;
end;
run;
* Modified Dorfman algorithm;
data z2;
array h (0:5000) _temporary_;
do p=1 to __n;
set w nobs=__n;
h(a)=p;
end;
do p=1 to __n-1;
set w;
if p>=h(b) then output;
end;
stop;
drop i p;
run;
* Demonstrate equivalency of Ya's results and modified Dorfman results;
proc compare base=z2 compare=z1(drop=n);
run;
%mend test;
options mprint;
%test(n=1000)
%test(n=10000)
%test(n=100000)
Dale
--- "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM> wrote:
> Ya,
>
> Your second approach is of course much more efficient than first one.
> I
> would only question the necessity of creating three arrays and making
> three
> passes through the data (1 disk, 2 memory). If we read the file in
> reverse,
> the problem boils down to the following single-pass scheme:
>
> 0. Create an O(1) lookup table.
> 1. Read the next record.
> 2. Search for the key in the table.
> 3. If not found then output.
> 4. Insert the key in the table.
> 5. Go to 1.
>
> O(1) means using a direct-address table. In your test data, the keys
> are
> 100-range integers, which instantly points to "key-indexing" in the
> brain-resident lookup table containing lookup methods:
>
> %let n = 15000 ;
>
> data w ;
> do n = 1 to &n ;
> a = int(ranuni(0)*100);
> b = int(ranuni(0)*100);
> output ;
> end ;
> run ;
>
> data z ;
> array h (0 : 100) _temporary_ ;
> do p = _n_ to 1 by -1 ;
> set w point = p nobs = _n_ ;
> if h (b) = . then output ;
> h (a) = 1 ;
> end ;
> stop ;
> run ;
>
> Note that the code looks somewhat more concise. Now let us see how
> much
> faster it executes (real time under Windows NT, SAS 8.1):
>
> ---------------------------------------
> N | 3 arrays | K-X |
> ----------+--------------+------------|
> 15000 | 0.39 | 0.15 |
> 150000 | 3.81 | 0.16 |
> 1500000 | 32.25 | 0.20 |
> ---------------------------------------
>
> The method is good for any number of records, as long as the key
> range does
> not exceed a reasonable number, say, 7 digits (10 million elements in
> array
> H). Now, what if it does not? Say,
>
> data w ;
> do n = 1 to &n ;
> a = 1e13 + int(ranuni(1)*100);
> b = 1e13 + int(ranuni(1)*100);
> output ;
> end ;
> run ;
>
> Then we have to replace the key-indexed array by the old good friend
> the
> hash table. The null step below is only needed to determine an
> optimum hash
> modulo divisor. In the rest, it is almost as easy to administer as
> the
> key-indexing:
>
> data _null_;
> do p = ceil(&n / 0.9) by 1 until (j = u + 1);
> u = ceil (sqrt(p));
> do j = 2 to u until (mod(p,j) = 0);
> end;
> end;
> call symput('h', put(p, best.-l)) ;
> run;
>
> data z ;
> array h (0 : &h) _temporary_ ;
> do p = nobs to 1 by -1 ;
> set w point = p nobs = nobs ;
> do _n_ = mod(b,&h) by 1 until ( h (_n_) = b or h (_n_) = .) ;
> if _n_ > &h then j = 0 ;
> end ;
> if h (_n_) = b then continue ;
> h (_n_) = a ;
> output ;
> end ;
> stop ;
> run ;
>
> The reason I can use the simplest collision resolution policy, i.e.
> the
> linear probing, and get away with the load factor of 0.9 ( while
> normally
> linear probing requires 0.5-0.6 for an acceptable performance) is
> that by
> the very nature of this case, the table stays half-empty until about
> 2/3
> through the file, and searching the crowded table is thus limited to
> the
> keys at the end of the file. Of course, with the hash, the limiting
> factor
> is not the range of the keys but their number. The hash is just a tad
> slower
> than the key-indexed table:
>
> ----------------------------------------------
> N | 3 arrays | K-X | Hash |
> ----------+--------------+------------+------|
> 15000 | 0.39 | 0.15 | 0.28 |
> 150000 | 3.81 | 0.16 | 0.29 |
> 1500000 | 32.25 | 0.20 | 0.43 |
> ----------------------------------------------
>
> I have not tried SQL, but you get the picture :-).
>
> Kind regards,
> ==================
> Paul M. Dorfman
> Jacksonville, FL
> ==================
=====
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@fhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
__________________________________________________
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/
_________________________________________________________________
MSN Photos is the easiest way to share and print your photos:
http://photos.msn.com/support/worldwide.aspx