Date: Fri, 9 Jul 2010 19:13:04 0400
ReplyTo: Paul Dorfman <sashole@BELLSOUTH.NET>
Sender: "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU>
From: Paul Dorfman <sashole@BELLSOUTH.NET>
Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX
function (was: Pass date values))
Kevin,
Frankly, your assertion that "Actually it's quite easy to verify sortedness
during the binary search" has got me greatly puzzled. I admit that I cannot
conceive how this is possible, let alone being quite easy. What I do find
quite easy is to arrange search keys in such a way that the binary search
proceeds without a glitch, hits the "right" keys, gives no token that some
other keys are unsorted, and of course results in a wrong answer. Here is a
simple example:
data _null_ ;
array a [7] _temporary_ (1 3 4 5 6 7 9) ;
key = 4 ;
l = lbound (a) ;
h = hbound (a) ;
f = . ;
do until (l > h  f) ;
m = floor ((h + l) * .5) ;
putlog l= m= h= ;
if key < a[m] then h = m  1 ;
else if key > a[m] then l = m + 1 ;
else f = 1 ;
end ;
putlog f= m= ;
run ;
This is a typical binary search locating the index M of the key if it is
found and F=. if not. The first putlog tracks the end and midpoints as the
algorithm iterates. As expected, the step prints:
l=1 m=4 h=7
l=1 m=2 h=3
l=3 m=3 h=3
f=1 m=3
and as expected, F=1 and a[3]=4. Now let us change the array to
array a [7] _temporary_ (1 3 6 5 4 7 9) ;
i.e. swap 4 and 6, and run the step again. Now it prints:
l=1 m=4 h=7
l=1 m=2 h=3
l=3 m=3 h=3
f=. m=3
F=. indicates that key=4 is not found, which is false  it is in the array.
Otherwise, there is no slightest indication that anything went wrong, as the
algorithm, converging, hits exactly the same keys. Of course, it is clear
why this occurs: the entire upper array part starting with m=4 is rejected
during the very first iteration, the rest of the algorithm operates only on
the perfectly sorted keys from m=1 to m=3.
By induction, it is always possible to concoct a disordered arrangement of
keys in such a way that the binary search algorithm will hit only the keys
which in relation to each other are perfectly sorted, while the keys in
between are not. Hence the scheme will proceed happily down the list
yielding a wrong result, while the keys it hits in the process will give no
clue how other keys in the array might be arranged.
Now, the only means "to verify sortedness during the binary search" are the
keys the binary search inspects while it converges. If all such keys are
ordered relative to each other, how can we make any conclusion about the
"sortedness" of the rest of the keys?
You also write: "This approach does not *guarantee* that the data is
actually 100% sorted, but is adequate in my opinion.". I cannot see how one
can be sure of any result applying the binary search to data that might be
only, say, 99% sorted. The above example shows that swapping only 2 keys in
otherwise completely sorted array is enough to knock the things out of
whack.
I would agree if you said that in a totally random data, the probability
that the keys the binary search hits happen to be in relative order is
negligibly small for a large data collection. For example, if we have 1
million (unique) keys, a sorted arrangement of log2(1e6)=26 keys is only one
among 26! possible arrangements of the 26 keys. So, in this case the binary
search can be first used to discover that the data are NOT in order and act
accordingly. But in this case, why bother with the binary search? Just pick
a random sample and see if its keys are in order.
A big problem with such "negative" test is that since we know nothing about
data distribution a priori, the odds of getting unsorted data if we fail to
check the entire set for "sortedness" are unacceptably high, especially in
the real world where "almost" sorted data are encountered much more often
than well randomized data.
Personally, I would never apply any technique based on the assumption that
the data are sorted without 100 per cent guarantee that the assumed order is
there. Either I would have a complete idea of and full confidence in the
data preparation process (proc sort/sql/summary/freq, sorted hash, system
sort utility or order by in an RDBMS before the data got to SAS), or I would
do a complete serial pass to ensure the order.
I realize that whoever uses SORTEDBY as an explicit option bears
responsibility, but it is also reasonable to expect that if SAS relies on
data being specifically shaped before any of its internal algorithms could
be used, it would at least give a warning upon finding out that the data do
not conform to the shape.
In fact, it is a longstanding SAS tradition. Any SAS process using BY KEY
will stop with an error once it is discovered that KEY is not in order.
Would we expect, by the same token, that if SAS used the binary search every
time SORTEDBY was set, it would check whether the algorithm can be applied
correctly? For one, I would. And I would certainly find it impossible to
fancy that, upon seeing SORTEDBY set, SAS would use the binary search
without any regard to whether the algorithm's prerequisites are met and
return a wrong result without so much as a warning.
In the case of BY, it is possible verify that the keys are in order during
the process itself, because any algorithm using BY (say, MERGE) is
sequential. However, as we have seen above, the binary search, running
against possibly unsorted data, by its own nature provides no means of
detecting *in the process* that some keys are out of order  while ALL keys
being completely in order is the fundamental assumption, without which the
correctness of the binary search cannot be guaranteed. Therefore, something
else would have to be used to verify that the input data are in order, and
that can be done only via at least one sequential pass  which completes the
vicious circle.
It does not mean I am against the "substantially beneficial performance
enhancement"  au contraire! I am all for it, but before SAS should
implement it, they should first provide some internal means to either (a)
autodistinguish SORTEDBY set by a SAS procedure from one set via the data
set option and use he binary search only in the former case, or (b) create
an IKWIAD option ("I know what I am doing") to allow the user to accept full
responsibility  and perhaps to generate a warning when the option is set to
YES.
Kind regards

Paul Dorfman
Jax, FL

On Fri, 9 Jul 2010 13:01:59 0500, Kevin Myers <KevinMyers@AUSTIN.RR.COM>
wrote:
>Paul 
>
>Actually it's quite easy to verify sortedness during the binary search,
>which also prevents any potential infinite looping. This approach does not
>*guarantee* that the data is actually 100% sorted, but is adequate in my
>opinion. There are other areas where SAS empoys a similar level of "trust"
>in the SORTEDBY metadata (as exhibited by your SORT example), though I
>believe there may be some options for controlling this behavior.
>
>Finally, the SORTEDBY metadata is always correct if SAS software put it
>there automatically. The only case where it's wrong is if the user forces
>the SORTEDBY attribute onto a dataset that isn't actually sorted (as in
your
>example). And really, if they did that, then that's their problem, and
that
>shouldn't hold the rest of us back from a substantially beneficial
>performance enhancement.
>
>s/KAM
>
>
> Original Message 
>From: "Paul Dorfman" <sashole@BELLSOUTH.NET>
>To: <SASL@LISTSERV.UGA.EDU>; "Kevin A. Myers" <KevinMyers@AUSTIN.RR.COM>
>Sent: Friday, July 09, 2010 12:36
>Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX
>function (was: Pass date values))
>
>
>> On Thu, 8 Jul 2010 12:23:10 0500, Kevin Myers <KevinMyers@AUSTIN.RR.COM>
>> wrote:
>>
>>>In the case of a sorted data set, I would *hope* that SAS would search
for
>>>the first matching record by using a *binary*, rather than sequential,
>>>search. If that isn't the case, then someone needs to suggest using a
>>>binary search to locate the first matching observation as a performance
>>>enhancement.
>>
>> Kevin,
>>
>> One problem with this is that the only indication that a data set *might*
>> be
>> sorted is SORTEDBY=, but its being set in no way guarantees that the file
>> is
>> actually physically ordered by the key(s) listed:
>>
>> data test (sortedby = key) ;
>> do key = 9, 1, 3, 2, 8, 5 ;
>> output ;
>> end ;
>> run ;
>>
>> Look at the table TEST's properties, and the metadata plainly tells you
>> the
>> data set is sorted by KEY. (It reminds me of the old Bob Virgile's
puzzle,
>> where a file similar to the above was explicitly sorted and the ensuing
>> step
>> whose BY statement relied on the sort order would fail because the
>> SORTEDBY=
>> fooled SORT used without FORCE option into doing nothing:
>>
>> 8 proc sort data = test ;
>> 9 by key ;
>> 10 run ;
>> NOTE: Input data set is already sorted, no sorting done.)
>>
>> So if SAS were to infer from SORTEDBY= that the file is indeed sorted and
>> therefore "binary search" it, the search algorithm will either (a) enter
>> an
>> infinite loop, in which case you have at least some indication of the
>> trouble, albeit at an ugly expense, or (b) finish perfectly well  in
>> which
>> case it will produce an erroneous result giving no token whatsoever that
>> something wrong has happened and which is much more dangerous.
>>
>> Of course the underlying software can easily check beforehand if the file
>> is
>> in fact physically in order, but this will be tantamount to a whole pass
>> through the file, thereby defeating the reason for using the binary
search
>> in the first place.
>>
>> The only way out of the SORTEDBY= trap would be for SAS to create a
system
>> option banning SORTEDBY from being set by anything but a successfully
>> executed SORT or ORDER BY. Then if SORTEDBY= were set the binary search
>> against the keys listed under it could be used securely.
>>
>> Kind regards
>> 
>> Paul Dorfman
>> Jax, FL
>> 
>>
