|
Robert,
In principle, your own solution does not look badly at all. Let me reproduce
it if I understand it right. First, some test data (let us limit the sample
to mere 5000 observations), then the solution:
*** test data;
data a;
array v(1:200);
do i=1 to 5000;
do j=1 to dim(v); v(j) = ranuni(1); end;
output;
end;
run;
*** transpose-sort-transpose;
data nw (keep=n w);
array v(1:200);
set a;
n ++ 1;
do j=1 to dim(v);
w = v(j); output;
end;
run;
proc sort; by n w; run;
data s (keep=v1-v50 v151-v200);
array v (1:200);
do j=1 by 1 until (last.n);
set nw; by n;
v(j) = w;
end;
run;
This runs in (0.82+3.38+1.94)=6.14 CPU seconds under OS/390. Now let us turn
to the solution, indeed elegant, proposed by Ludwig Boltzmann (and, with a
significant distinction, by Mike Zdeb). He offers to execute the ORDINAL
function 100 times in each observation: 50 times to get the lowest 50
scores, and 50 times - to get the highest ones:
data b (keep=v1-v50 v151-v200);
set a;
array v(1:200);
array o(1:200) _temporary_;
do j=1 to 50,151 to 200; o(j) = ordinal(j,of v(*)); end;
do j=1 to 50,151 to 200; v(j) = o(j); end;
run;
Cool, ain't it? Unfortunately, it just proves once again that not all is
gold that glitters: Using ORDINAL is a typical example where a small number
of elegant DATA step instructions translates into a heck of a job for the
computer. And it should be no surprise, for the instruction ORDINAL(j,of
v(*)), however short, makes the machine scan the entire array. As a result,
the code above takes a whopping 158.42 CPU seconds to execute. Getting back
to Mike's suggestion, if the *entire* array were to be sorted using ORDINAL
(that is, the function were called 200 times per record), it would take
twice the time.
Bob Burnham proposes to use the bubble sort. It is indeed quite simple to
code, and performs noticeably better, for even though it has to make N scans
through the array, it does not scan the *entire* array each time. I mimicked
Bob's sort as
data s (keep=v1-v50 v151-v200);
array v (1:200);
set a;
do i=hbound(v) to lbound(v) by -1;
do j=1 to i;
if v(i) => v(j) then continue;
t = v(i); v(i) = v(j); v(j) = t;
end;
end;
run;
It takes 48.49 CPU seconds to do the job - much better than either ORDINAL
solution, but still a far cry from the "standard" transpose-sort-transpose.
Interestingly, bubble sort, due to its very nature, has more potential than
was realized above. Why is it call "bubble", anyway? Because each time the
outer loop iterates, the next largest value "bubbles" to the top of the
array. But if so, why should we execute the loop all 200 times when 50 times
in one direction and 50 times in the other direction will suffice - we only
care about 50 smallest and 50 largest values, so why sort the entire thing?
Therefore, Bob essentially falls in the same trap as Mike. The amended
bubble sort approach could look like this:
data s (keep=v1-v50 v151-v200);
array v (1:200);
set a;
do i=hbound(v) to hbound(v)-49 by -1;
do j=1 to i;
if v(i) <= v(j) then continue;
t = v(i); v(i) = v(j); v(j) = t;
end;
end;
do i=hbound(v) to hbound(v)-49 by -1;
do j=1 to i;
if v(i) => v(j) then continue;
t = v(i); v(i) = v(j); v(j) = t;
end;
end;
run;
This code executes in 38.56 CPU seconds, thus saving a good chunk off the
full-blown 48.49; however, it is still nowhere near the "standard" 6.14! So,
it not possible to beat it, is it? Of course it is, but then, just as Bob
notes, a principally different sorting method should be used. Perhaps any
sort running in O(N*log(N)) time will be competitive. I will leave the sas-l
public puzzled as to the nature of the SORT link routine below, but an
attentive programmer should be able to recognize the algorithm readily.
Those having succeeded tell those who have failed. Note that this is an
example, in a sense *opposite* to the ORDINAL approach - it contains a lot
of DATA step commands, however judging from the run-time, they must
translate in a relatively few machine instructions:
data s (keep=v1-v50 v151-v200);
array v (1:200);
set a;
link sort;
return;
sort:
%macro x (i,j); do; t=v(&i); v(&i)=v(&j); v(&j)=t; end; %mend x;
array z(0:1,0:50) _temporary_;
l = lbound(v); r = hbound(v);
do s=1 by 0 while (s);
i=(l + r)*.5;
if v(l) > v(i) then %x (l,i);
if v(l) > v(r) then %x (l,r);
if v(i) > v(r) then %x (i,r);
p=v(i); v(i)=v(l); v(l)=p; i=l; j=r+1;
do until (i => j);
do i=i+1 by +1 until (v(i) => p); end;
do j=j-1 by -1 until (v(j) <= p); end;
if i < j then %x (i,j);
end;
v(l)=v(j); v(j)=p;
if r-j => j-l > 9 then do s=s+1;
z(0,s)=j+1; z(1,s)=r; r=j-1;
end;
else if j-l => r-j > 9 then do s=s+1;
z(1,s)=j-1; z(0,s)=l; l=j+1;
end;
else if j-l > 9 => r-j then r=j-1;
else if r-j > 9 => j-l then l=j+1;
else do;
l=z(0,s); r=z(1,s); s=s-1;
end;
end;
do j=lbound(v)+1 to hbound(v);
if v(j-1) > v(j) then do;
p=v(j);
do i=j-1 to lbound(v) by -1;
if p => v(i) then leave;
v(i+1)=v(i);
end;
v(i+1)=p;
end;
end;
run;
Although this code performs a full-blown array sort, the entire thing takes
4.04 CPU seconds - about 50% improvement over the "standard". Surely it is
not enough to justify coding the sort routine above from scratch (it is not
difficult but takes *a lot* more time and programming expertise than putting
the "standard" SAS code together). On the other hand, 1)one can get the code
from sas-l (you just did) 2) with a much wider array and more records, the
performance advantage would be much more significant.
Kind regards,
====================
Paul M. Dorfman
Jacksonville, Fl
====================
>From: "robert lokhamp" <rlokhamp@hotmail.com>
>What do you think would be a more efficient way to do the following. I've
>got a SAS dataset with 200 (numeric) scores, V1-v200. I need to move 50
>lowest scores to variables V1-v50, and 50 highest scores to V151-V200, and
>drop the rest. The dataset has approximately 100,000 observations. So far,
>I've been transposing the dataset 'vertically', then sorting it, then
>transposing it back. At this point the dataset is like original but V1-V200
>are sorted in each obs. Then I would just drop the mid-vars. Gets the job
>done but is pretty slow and feels kinda tacky. Any other ways (more
>elegant, faster, etc)?
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com
|