Date: Tue, 12 Jan 1999 16:18:42 -0500
Reply-To: "Paul M. Dorfman" <sashole@EARTHLINK.NET>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Paul M. Dorfman" <sashole@EARTHLINK.NET>
Organization: None
Subject: Re: Sorting a big array
Content-Type: text/plain; charset=koi8-r
Alexander Martchenko <amartchenko@NETSCAPE.NET>, in part, wrote:
>I have a large (near 500,000) _temporary_ array that gets populated with
>10-digit positive integer account numbers somewhere in the middle of a DATA
>step. I'd like to have it sorted to do something else in the same step.
>Something like
>
>DATA ...;
> ARRAY R(500000) _TEMPORARY_;
> ...some logic...
> ...populate array R based on that logic...
> !!! here I need to sort R ascending...
> ...do something with sorted R...
> ...it's no longer sorted at this point...
> !!! here I need to sort R descending...
> ...do something with sorted R...
>RUN;
>
>I know I could write R and the rest of the stuff to disk, use PROC SORT, read
>everything back, process, do all that again for descending order, etc. It will
>work but feels clumsy both logically and from the standpoint of i/o. I don't
>recall any SAS routine to sort an array but remember some sas-l discussions on
>the topic awile ago. So before I start that i/o exercise, I wonder if anyone
>has any ideas how to sort an array such as R very efficiently (i.e. on par
>with PROC SORT) _without_ leaving the DATA step.
>I'll very much appreciate any suggestions, thoughts, code samples, etc.
Alex,
It can be done in a number of ways, but in your particular situation, when
you are dealing with natural keys, I would opt for
least-significant-digit-first (LSD-first) radix sort rather than quicksort.
Before trying to beat PROC SORT, let us first establish its benchmark by
simulating your "i/o exercise":
139 DATA RDISK (KEEP=RDISK);
140 ARRAY R(500000) _TEMPORARY_;
141 DO I=1 TO DIM(R);
142 R(I) = CEIL(RANUNI(1)*1E10);
143 END;
144 T = TIME();
145 DO I=1 TO DIM(R);
146 RDISK = R(I);
147 OUTPUT;
148 END;
149 T = ROUND(TIME() - T,.001);
150 PUT 'ARRAY UNLOAD TIME ' T 'SECONDS.';
151 RUN;
ARRAY UNLOAD TIME 0.87 SECONDS.
NOTE: The data set WORK.RDISK has 500000 observations and 1 variables.
NOTE: The DATA statement used 3.08 seconds.
152
153 PROC SORT; BY RDISK; RUN;
NOTE: The data set WORK.RDISK has 500000 observations and 1 variables.
NOTE: The PROCEDURE SORT used 10.22 seconds.
154
155 DATA _NULL_;
156 ARRAY R(500000) _TEMPORARY_;
157 DO I=1 TO DIM(R);
158 SET RDISK;
159 R(I) = RDISK;
160 END;
161 RUN;
NOTE: The DATA statement used 1.26 seconds.
So, it takes 12.35 seconds, in all, on NT/233/64M. Now, here is the basic
idea of LSD-first radix sort (it is kind of counterintuitive): Form a
partial key by extracting the _rightmost_ 5 digits of each 10-digit item
and use distribution counting to sort the array moving whole array elements
accordingly to another array S(500000) _TEMPORARY_. Then do the same using
_leftmost_ 5 digits as a key moving items back to R. Why use distribution
counting to do partial sorts? Because it is 1) the fastest known sorting
method requiring no key comparisons; 2) it is stable, i.e., it retains the
relative order of items being sorted. The latter property is critical here
- otherwise LSD-first sort will not work. And why use 5-digit partial keys?
Because this way, the entire sort will be completed in two passes only, and
one (or, in the implementation below, two) array holding 100,000 items will
be required for digit counts - which should not pose any memory problems.
Code:
162
163 DATA _NULL_;
164 ARRAY R (500000) _TEMPORARY_; *to be sorted;
165 ARRAY S (500000) _TEMPORARY_; *to hold LSD-sorted items
after 1st pass;
166 ARRAY CR(0:99999) _TEMPORARY_; *to hold 1st pass
counts/running sums;
167 ARRAY CS(0:99999) _TEMPORARY_; *to hold 2nd pass counts
while filling S;
168
169 DO I=1 TO DIM(R);
170 R(I) = CEIL(RANUNI(1)*1E10); *populate R with random
10-digit keys;
171 END;
172
173 T = TIME(); *clock on;
174
175 DO I=1 TO DIM(R); *Populate CR with counts of
last 5 digit keys.;
176 CR (R(I)-1E5*INT(R(I)*1E-5)) ++ 1; *At end of loop CR[N] is
number of keys;
177 END; *equal to N.;
178 DO I=1 TO HBOUND(CR); *Accumulate counts. At end of
loop CR[N] is;
179 CR(I) ++ CR(I-1); *number of last-5-digits keys
less than or;
180 END; *equal to N.;
181 DO J=DIM(R) TO 1 BY -1; *Populate S with items sorted
by last 5 digits;
182 KR = R(J) - 1E5*INT(R(J)*1E-5); *and at the same time
populate CS with counts;
183 I = CR(KR); *based on first-5-digit
keys.;
184 S(I) = R(J);
185 CS (INT(S(I)*1E-5)) ++ 1;
186 CR(KR) = I - 1;
187 END;
188 DO I=1 TO HBOUND(CR); *Accumulate counts. At end of
this loop CS[N];
189 CS(I) ++ CS(I-1); *is number of fist-5-digit
keys less than;
190 END; *or equal to N;
191 DO J=DIM(S) TO 1 BY -1; *Based on running sums stored
in CS move items;
192 KS = INT(S(J)*1E-5); *back to original array R. At
end of this loop;
193 I = CS(KS); *R is completely sorted.;
194 R(I) = S(J);
195 CS(KS) = I - 1;
196 END;
197
198 T = ROUND(TIME() - T,.001); *Compute time LSD-first radix
sort has taken;
199 PUT 'RADIX SORT TIME ' T 'SECONDS.'; *and display it.;
200
201 Z = 1; *Check if sorting is
complete.;
202 DO I=1 TO DIM(R)-1 WHILE(Z);
203 IF R(I) > R(I+1) THEN Z = 0;
204 END;
205 IF Z THEN PUT 'SORTED!';
206 ELSE PUT 'NOT SORTED!';
207 RUN;
RADIX SORT TIME 5.98 SECONDS.
SORTED!
NOTE: The DATA statement used 8.57 seconds.
So, LSD-first radix sort (5.98 seconds) runs more than 2 times faster than
the "i/o exercise" (12.35 seconds) beating the PROC SORT phase alone by
more than 40 percent. Note that the advantage of LSD-first radix sort over
PROC SORT is the greater, the more items have to be sorted: With R holding
1000,000 elements PROC SORT runs in 30.53 seconds, whilst radix sort gets
there in 11.81 seconds. The reason is, of course, that PROC SORT runs in
O(N*logN) time at best, while LSD-first radix sort runs in O(n) time. As we
know, for sorts this speed cannot be beaten even theoretically. However, at
low N (below 50,000) the overhead of extracting radices starts to show up,
at which point both sorts run at approximately the same pace.
Alex, I am intentionally omitting the code for sorting in descending order.
It is not difficult to figure out, but the effort of altering the code
above to accommodate the reverse order requires a firm understanding of how
the algorithm works. However, if you do not think it is fun, please drop me
a note.
Kind regards,
Paul M. Dorfman
IMA Plus
Jacksonville, FL
|