Date: Fri, 5 Apr 2002 17:27:05 -0800 Dale McLerran "SAS(r) Discussion" Dale McLerran Re: a interesting question. To: "Dorfman, Paul" <10B0CB7B15058-01@mail.bcbsfl.com> text/plain; charset=us-ascii

I tried Paul's programs below and got results which were not consistent with the desired output. However, it seems to me that Paul has framed the general approach to an efficient algorithm, even if his code does not return the desired solution. We can still set up a small table of values of variable A, only we need to keep track of the record which contains the last occurence of a value in column A. We can then loop through the data to see if a value in column B is found in A at a record number further into the data. If no such record is found, then output the current record.

Ya Huang did get an implementation which correctly returned the desired results, however his approach is highly inefficient, as Paul notes. I present a modified algorithm which returns the same results as Ya gets. I have placed my approach and Ya's approach in a macro where we can modify the number of observations which we generate. Invocations of this macro with various different database sizes demonstrates the efficiency of the modified table approach compared with Ya's method.

%macro test(n=);

data w ; do n = 1 to &n ; a = int(ranuni(0)*5000); b = int(ranuni(0)*5000); output ; end ; drop n; run ;

* Ya Huang algorithm ; data z1 (keep=n a b); array aa(&n) _temporary_; array bb(&n) _temporary_; array keep(&n) _temporary_; do i=1 by 1 until(end); set w end=end; aa(i)=a; bb(i)=b; end; do j=1 to i; keep(j)=0; do k=j+1 to i; if bb(j)=aa(k) then do; keep(j)=1; leave; end; end; end; do l=1 to i-1; if keep(l)=0 then do; a=aa(l); b=bb(l); n=l; output; end; end; run;

* Modified Dorfman algorithm; data z2; array h (0:5000) _temporary_; do p=1 to __n; set w nobs=__n; h(a)=p; end; do p=1 to __n-1; set w; if p>=h(b) then output; end; stop; drop i p; run;

* Demonstrate equivalency of Ya's results and modified Dorfman results; proc compare base=z2 compare=z1(drop=n); run; %mend test;

options mprint; %test(n=1000) %test(n=10000) %test(n=100000)

Dale

--- "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM> wrote: > Ya, > > Your second approach is of course much more efficient than first one. > I > would only question the necessity of creating three arrays and making > three > passes through the data (1 disk, 2 memory). If we read the file in > reverse, > the problem boils down to the following single-pass scheme: > > 0. Create an O(1) lookup table. > 1. Read the next record. > 2. Search for the key in the table. > 3. If not found then output. > 4. Insert the key in the table. > 5. Go to 1. > > O(1) means using a direct-address table. In your test data, the keys > are > 100-range integers, which instantly points to "key-indexing" in the > brain-resident lookup table containing lookup methods: > > %let n = 15000 ; > > data w ; > do n = 1 to &n ; > a = int(ranuni(0)*100); > b = int(ranuni(0)*100); > output ; > end ; > run ; > > data z ; > array h (0 : 100) _temporary_ ; > do p = _n_ to 1 by -1 ; > set w point = p nobs = _n_ ; > if h (b) = . then output ; > h (a) = 1 ; > end ; > stop ; > run ; > > Note that the code looks somewhat more concise. Now let us see how > much > faster it executes (real time under Windows NT, SAS 8.1): > > --------------------------------------- > N | 3 arrays | K-X | > ----------+--------------+------------| > 15000 | 0.39 | 0.15 | > 150000 | 3.81 | 0.16 | > 1500000 | 32.25 | 0.20 | > --------------------------------------- > > The method is good for any number of records, as long as the key > range does > not exceed a reasonable number, say, 7 digits (10 million elements in > array > H). Now, what if it does not? Say, > > data w ; > do n = 1 to &n ; > a = 1e13 + int(ranuni(1)*100); > b = 1e13 + int(ranuni(1)*100); > output ; > end ; > run ; > > Then we have to replace the key-indexed array by the old good friend > the > hash table. The null step below is only needed to determine an > optimum hash > modulo divisor. In the rest, it is almost as easy to administer as > the > key-indexing: > > data _null_; > do p = ceil(&n / 0.9) by 1 until (j = u + 1); > u = ceil (sqrt(p)); > do j = 2 to u until (mod(p,j) = 0); > end; > end; > call symput('h', put(p, best.-l)) ; > run; > > data z ; > array h (0 : &h) _temporary_ ; > do p = nobs to 1 by -1 ; > set w point = p nobs = nobs ; > do _n_ = mod(b,&h) by 1 until ( h (_n_) = b or h (_n_) = .) ; > if _n_ > &h then j = 0 ; > end ; > if h (_n_) = b then continue ; > h (_n_) = a ; > output ; > end ; > stop ; > run ; > > The reason I can use the simplest collision resolution policy, i.e. > the > linear probing, and get away with the load factor of 0.9 ( while > normally > linear probing requires 0.5-0.6 for an acceptable performance) is > that by > the very nature of this case, the table stays half-empty until about > 2/3 > through the file, and searching the crowded table is thus limited to > the > keys at the end of the file. Of course, with the hash, the limiting > factor > is not the range of the keys but their number. The hash is just a tad > slower > than the key-indexed table: > > ---------------------------------------------- > N | 3 arrays | K-X | Hash | > ----------+--------------+------------+------| > 15000 | 0.39 | 0.15 | 0.28 | > 150000 | 3.81 | 0.16 | 0.29 | > 1500000 | 32.25 | 0.20 | 0.43 | > ---------------------------------------------- > > I have not tried SQL, but you get the picture :-). > > Kind regards, > ================== > Paul M. Dorfman > Jacksonville, FL > ==================

===== --------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@fhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

__________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/

Back to: Top of message | Previous page | Main SAS-L page