```Date: Tue, 12 Jan 1999 16:18:42 -0500 Reply-To: "Paul M. Dorfman" Sender: "SAS(r) Discussion" From: "Paul M. Dorfman" Organization: None Subject: Re: Sorting a big array Content-Type: text/plain; charset=koi8-r Alexander Martchenko , in part, wrote: >I have a large (near 500,000) _temporary_ array that gets populated with >10-digit positive integer account numbers somewhere in the middle of a DATA >step. I'd like to have it sorted to do something else in the same step. >Something like > >DATA ...; > ARRAY R(500000) _TEMPORARY_; > ...some logic... > ...populate array R based on that logic... > !!! here I need to sort R ascending... > ...do something with sorted R... > ...it's no longer sorted at this point... > !!! here I need to sort R descending... > ...do something with sorted R... >RUN; > >I know I could write R and the rest of the stuff to disk, use PROC SORT, read >everything back, process, do all that again for descending order, etc. It will >work but feels clumsy both logically and from the standpoint of i/o. I don't >recall any SAS routine to sort an array but remember some sas-l discussions on >the topic awile ago. So before I start that i/o exercise, I wonder if anyone >has any ideas how to sort an array such as R very efficiently (i.e. on par >with PROC SORT) _without_ leaving the DATA step. >I'll very much appreciate any suggestions, thoughts, code samples, etc. Alex, It can be done in a number of ways, but in your particular situation, when you are dealing with natural keys, I would opt for least-significant-digit-first (LSD-first) radix sort rather than quicksort. Before trying to beat PROC SORT, let us first establish its benchmark by simulating your "i/o exercise": 139 DATA RDISK (KEEP=RDISK); 140 ARRAY R(500000) _TEMPORARY_; 141 DO I=1 TO DIM(R); 142 R(I) = CEIL(RANUNI(1)*1E10); 143 END; 144 T = TIME(); 145 DO I=1 TO DIM(R); 146 RDISK = R(I); 147 OUTPUT; 148 END; 149 T = ROUND(TIME() - T,.001); 150 PUT 'ARRAY UNLOAD TIME ' T 'SECONDS.'; 151 RUN; ARRAY UNLOAD TIME 0.87 SECONDS. NOTE: The data set WORK.RDISK has 500000 observations and 1 variables. NOTE: The DATA statement used 3.08 seconds. 152 153 PROC SORT; BY RDISK; RUN; NOTE: The data set WORK.RDISK has 500000 observations and 1 variables. NOTE: The PROCEDURE SORT used 10.22 seconds. 154 155 DATA _NULL_; 156 ARRAY R(500000) _TEMPORARY_; 157 DO I=1 TO DIM(R); 158 SET RDISK; 159 R(I) = RDISK; 160 END; 161 RUN; NOTE: The DATA statement used 1.26 seconds. So, it takes 12.35 seconds, in all, on NT/233/64M. Now, here is the basic idea of LSD-first radix sort (it is kind of counterintuitive): Form a partial key by extracting the _rightmost_ 5 digits of each 10-digit item and use distribution counting to sort the array moving whole array elements accordingly to another array S(500000) _TEMPORARY_. Then do the same using _leftmost_ 5 digits as a key moving items back to R. Why use distribution counting to do partial sorts? Because it is 1) the fastest known sorting method requiring no key comparisons; 2) it is stable, i.e., it retains the relative order of items being sorted. The latter property is critical here - otherwise LSD-first sort will not work. And why use 5-digit partial keys? Because this way, the entire sort will be completed in two passes only, and one (or, in the implementation below, two) array holding 100,000 items will be required for digit counts - which should not pose any memory problems. Code: 162 163 DATA _NULL_; 164 ARRAY R (500000) _TEMPORARY_; *to be sorted; 165 ARRAY S (500000) _TEMPORARY_; *to hold LSD-sorted items after 1st pass; 166 ARRAY CR(0:99999) _TEMPORARY_; *to hold 1st pass counts/running sums; 167 ARRAY CS(0:99999) _TEMPORARY_; *to hold 2nd pass counts while filling S; 168 169 DO I=1 TO DIM(R); 170 R(I) = CEIL(RANUNI(1)*1E10); *populate R with random 10-digit keys; 171 END; 172 173 T = TIME(); *clock on; 174 175 DO I=1 TO DIM(R); *Populate CR with counts of last 5 digit keys.; 176 CR (R(I)-1E5*INT(R(I)*1E-5)) ++ 1; *At end of loop CR[N] is number of keys; 177 END; *equal to N.; 178 DO I=1 TO HBOUND(CR); *Accumulate counts. At end of loop CR[N] is; 179 CR(I) ++ CR(I-1); *number of last-5-digits keys less than or; 180 END; *equal to N.; 181 DO J=DIM(R) TO 1 BY -1; *Populate S with items sorted by last 5 digits; 182 KR = R(J) - 1E5*INT(R(J)*1E-5); *and at the same time populate CS with counts; 183 I = CR(KR); *based on first-5-digit keys.; 184 S(I) = R(J); 185 CS (INT(S(I)*1E-5)) ++ 1; 186 CR(KR) = I - 1; 187 END; 188 DO I=1 TO HBOUND(CR); *Accumulate counts. At end of this loop CS[N]; 189 CS(I) ++ CS(I-1); *is number of fist-5-digit keys less than; 190 END; *or equal to N; 191 DO J=DIM(S) TO 1 BY -1; *Based on running sums stored in CS move items; 192 KS = INT(S(J)*1E-5); *back to original array R. At end of this loop; 193 I = CS(KS); *R is completely sorted.; 194 R(I) = S(J); 195 CS(KS) = I - 1; 196 END; 197 198 T = ROUND(TIME() - T,.001); *Compute time LSD-first radix sort has taken; 199 PUT 'RADIX SORT TIME ' T 'SECONDS.'; *and display it.; 200 201 Z = 1; *Check if sorting is complete.; 202 DO I=1 TO DIM(R)-1 WHILE(Z); 203 IF R(I) > R(I+1) THEN Z = 0; 204 END; 205 IF Z THEN PUT 'SORTED!'; 206 ELSE PUT 'NOT SORTED!'; 207 RUN; RADIX SORT TIME 5.98 SECONDS. SORTED! NOTE: The DATA statement used 8.57 seconds. So, LSD-first radix sort (5.98 seconds) runs more than 2 times faster than the "i/o exercise" (12.35 seconds) beating the PROC SORT phase alone by more than 40 percent. Note that the advantage of LSD-first radix sort over PROC SORT is the greater, the more items have to be sorted: With R holding 1000,000 elements PROC SORT runs in 30.53 seconds, whilst radix sort gets there in 11.81 seconds. The reason is, of course, that PROC SORT runs in O(N*logN) time at best, while LSD-first radix sort runs in O(n) time. As we know, for sorts this speed cannot be beaten even theoretically. However, at low N (below 50,000) the overhead of extracting radices starts to show up, at which point both sorts run at approximately the same pace. Alex, I am intentionally omitting the code for sorting in descending order. It is not difficult to figure out, but the effort of altering the code above to accommodate the reverse order requires a firm understanding of how the algorithm works. However, if you do not think it is fun, please drop me a note. Kind regards, Paul M. Dorfman IMA Plus Jacksonville, FL ```

Back to: Top of message | Previous page | Main SAS-L page