LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2002, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sat, 6 Apr 2002 07:06:22 +0000
Reply-To:     sashole@bellsouth.net
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject:      Re: a interesting question.
Comments: To: stringplayer_2@YAHOO.COM
Content-Type: text/plain; format=flowed

I tried Paul's programs below and got results which were not consistent with the desired output. However, it seems to me that Paul has framed the general approach to an efficient algorithm, even if his code does not return the desired solution.

Dale,

Paul has just goofed. I assume that you mean the key-indexing variant when you say 'inconsistent'. You cannot say 'inconsistent' about an algorithm as simple as the key-indexing without pointing at the cause. The cause is, I erroneously posted the code that I myself found wrong. Now, it is instructional to see what is wrong with

data z ; array h (0 : 100) _temporary_ ; do p = _n_ to 1 by -1 ; set w point = p nobs = _n_ ; if h (b) = . then output ; h (a) = 1 ; end ; stop ; run ;

Running it clearly shows that the output consists of 1 record, which clearly indicates not that the code is inconsistent but that it is totally out of whack - which is usually much easier to debug. The culprit is _N_, which I initially tried to use in my zeal of avoiding to utilize any more new variables than necessary. Here the parsimony shows its ugly face, and it is clear why. The first reference to _N_, when _N_ still equals 1, occurs before it is populated with the number of observations from the descriptor. Thus the initial value of P is 1, and hence the loop iterates just once. End of story. Replacing _N_ with anything else, for example, NOBS, fixes the problem - which I subsequently did, but failed to replace in the post. And I did not find anything inconsistent with the hashing algorithm, either. Have you? In any case, a second pass through the input is not called for.

Kind regards, =================== Paul M. Dorfman Jacksonville, FL ===================

We can still set up a small table of values of variable A, only we need to keep track of the record which contains the last occurence of a value in column A. We can then loop through the data to see if a value in column B is found in A at a record number further into the data. If no such record is found, then output the current record.

Ya Huang did get an implementation which correctly returned the desired results, however his approach is highly inefficient, as Paul notes. I present a modified algorithm which returns the same results as Ya gets. I have placed my approach and Ya's approach in a macro where we can modify the number of observations which we generate. Invocations of this macro with various different database sizes demonstrates the efficiency of the modified table approach compared with Ya's method.

%macro test(n=);

data w ; do n = 1 to &n ; a = int(ranuni(0)*5000); b = int(ranuni(0)*5000); output ; end ; drop n; run ;

* Ya Huang algorithm ; data z1 (keep=n a b); array aa(&n) _temporary_; array bb(&n) _temporary_; array keep(&n) _temporary_; do i=1 by 1 until(end); set w end=end; aa(i)=a; bb(i)=b; end; do j=1 to i; keep(j)=0; do k=j+1 to i; if bb(j)=aa(k) then do; keep(j)=1; leave; end; end; end; do l=1 to i-1; if keep(l)=0 then do; a=aa(l); b=bb(l); n=l; output; end; end; run;

* Modified Dorfman algorithm; data z2; array h (0:5000) _temporary_; do p=1 to __n; set w nobs=__n; h(a)=p; end; do p=1 to __n-1; set w; if p>=h(b) then output; end; stop; drop i p; run;

* Demonstrate equivalency of Ya's results and modified Dorfman results; proc compare base=z2 compare=z1(drop=n); run; %mend test;

options mprint; %test(n=1000) %test(n=10000) %test(n=100000)

Dale

--- "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM> wrote: > Ya, > > Your second approach is of course much more efficient than first one. > I > would only question the necessity of creating three arrays and making > three > passes through the data (1 disk, 2 memory). If we read the file in > reverse, > the problem boils down to the following single-pass scheme: > > 0. Create an O(1) lookup table. > 1. Read the next record. > 2. Search for the key in the table. > 3. If not found then output. > 4. Insert the key in the table. > 5. Go to 1. > > O(1) means using a direct-address table. In your test data, the keys > are > 100-range integers, which instantly points to "key-indexing" in the > brain-resident lookup table containing lookup methods: > > %let n = 15000 ; > > data w ; > do n = 1 to &n ; > a = int(ranuni(0)*100); > b = int(ranuni(0)*100); > output ; > end ; > run ; > > data z ; > array h (0 : 100) _temporary_ ; > do p = _n_ to 1 by -1 ; > set w point = p nobs = _n_ ; > if h (b) = . then output ; > h (a) = 1 ; > end ; > stop ; > run ; > > Note that the code looks somewhat more concise. Now let us see how > much > faster it executes (real time under Windows NT, SAS 8.1): > > --------------------------------------- > N | 3 arrays | K-X | > ----------+--------------+------------| > 15000 | 0.39 | 0.15 | > 150000 | 3.81 | 0.16 | > 1500000 | 32.25 | 0.20 | > --------------------------------------- > > The method is good for any number of records, as long as the key > range does > not exceed a reasonable number, say, 7 digits (10 million elements in > array > H). Now, what if it does not? Say, > > data w ; > do n = 1 to &n ; > a = 1e13 + int(ranuni(1)*100); > b = 1e13 + int(ranuni(1)*100); > output ; > end ; > run ; > > Then we have to replace the key-indexed array by the old good friend > the > hash table. The null step below is only needed to determine an > optimum hash > modulo divisor. In the rest, it is almost as easy to administer as > the > key-indexing: > > data _null_; > do p = ceil(&n / 0.9) by 1 until (j = u + 1); > u = ceil (sqrt(p)); > do j = 2 to u until (mod(p,j) = 0); > end; > end; > call symput('h', put(p, best.-l)) ; > run; > > data z ; > array h (0 : &h) _temporary_ ; > do p = nobs to 1 by -1 ; > set w point = p nobs = nobs ; > do _n_ = mod(b,&h) by 1 until ( h (_n_) = b or h (_n_) = .) ; > if _n_ > &h then j = 0 ; > end ; > if h (_n_) = b then continue ; > h (_n_) = a ; > output ; > end ; > stop ; > run ; > > The reason I can use the simplest collision resolution policy, i.e. > the > linear probing, and get away with the load factor of 0.9 ( while > normally > linear probing requires 0.5-0.6 for an acceptable performance) is > that by > the very nature of this case, the table stays half-empty until about > 2/3 > through the file, and searching the crowded table is thus limited to > the > keys at the end of the file. Of course, with the hash, the limiting > factor > is not the range of the keys but their number. The hash is just a tad > slower > than the key-indexed table: > > ---------------------------------------------- > N | 3 arrays | K-X | Hash | > ----------+--------------+------------+------| > 15000 | 0.39 | 0.15 | 0.28 | > 150000 | 3.81 | 0.16 | 0.29 | > 1500000 | 32.25 | 0.20 | 0.43 | > ---------------------------------------------- > > I have not tried SQL, but you get the picture :-). > > Kind regards, > ================== > Paul M. Dorfman > Jacksonville, FL > ==================

===== --------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@fhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

__________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/

_________________________________________________________________ MSN Photos is the easiest way to share and print your photos: http://photos.msn.com/support/worldwide.aspx


Back to: Top of message | Previous page | Main SAS-L page