Date: Fri, 9 Jul 2010 19:13:04 -0400 Paul Dorfman "SAS(r) Discussion" Paul Dorfman Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX function (was: Pass date values)) To: "Kevin A. Myers"

Kevin,

Frankly, your assertion that "Actually it's quite easy to verify sorted-ness during the binary search" has got me greatly puzzled. I admit that I cannot conceive how this is possible, let alone being quite easy. What I do find quite easy is to arrange search keys in such a way that the binary search proceeds without a glitch, hits the "right" keys, gives no token that some other keys are unsorted, and of course results in a wrong answer. Here is a simple example:

data _null_ ; array a [7] _temporary_ (1 3 4 5 6 7 9) ;

key = 4 ;

l = lbound (a) ; h = hbound (a) ; f = . ; do until (l > h | f) ; m = floor ((h + l) * .5) ; putlog l= m= h= ; if key < a[m] then h = m - 1 ; else if key > a[m] then l = m + 1 ; else f = 1 ; end ;

putlog f= m= ; run ;

This is a typical binary search locating the index M of the key if it is found and F=. if not. The first putlog tracks the end- and mid-points as the algorithm iterates. As expected, the step prints:

l=1 m=4 h=7 l=1 m=2 h=3 l=3 m=3 h=3 f=1 m=3

and as expected, F=1 and a[3]=4. Now let us change the array to

array a [7] _temporary_ (1 3 6 5 4 7 9) ;

i.e. swap 4 and 6, and run the step again. Now it prints:

l=1 m=4 h=7 l=1 m=2 h=3 l=3 m=3 h=3 f=. m=3

F=. indicates that key=4 is not found, which is false - it is in the array. Otherwise, there is no slightest indication that anything went wrong, as the algorithm, converging, hits exactly the same keys. Of course, it is clear why this occurs: the entire upper array part starting with m=4 is rejected during the very first iteration, the rest of the algorithm operates only on the perfectly sorted keys from m=1 to m=3.

By induction, it is always possible to concoct a disordered arrangement of keys in such a way that the binary search algorithm will hit only the keys which in relation to each other are perfectly sorted, while the keys in between are not. Hence the scheme will proceed happily down the list yielding a wrong result, while the keys it hits in the process will give no clue how other keys in the array might be arranged.

Now, the only means "to verify sorted-ness during the binary search" are the keys the binary search inspects while it converges. If all such keys are ordered relative to each other, how can we make any conclusion about the "sorted-ness" of the rest of the keys?

You also write: "This approach does not *guarantee* that the data is actually 100% sorted, but is adequate in my opinion.". I cannot see how one can be sure of any result applying the binary search to data that might be only, say, 99% sorted. The above example shows that swapping only 2 keys in otherwise completely sorted array is enough to knock the things out of whack.

I would agree if you said that in a totally random data, the probability that the keys the binary search hits happen to be in relative order is negligibly small for a large data collection. For example, if we have 1 million (unique) keys, a sorted arrangement of log2(1e6)=26 keys is only one among 26! possible arrangements of the 26 keys. So, in this case the binary search can be first used to discover that the data are NOT in order and act accordingly. But in this case, why bother with the binary search? Just pick a random sample and see if its keys are in order.

A big problem with such "negative" test is that since we know nothing about data distribution a priori, the odds of getting unsorted data if we fail to check the entire set for "sorted-ness" are unacceptably high, especially in the real world where "almost" sorted data are encountered much more often than well randomized data.

Personally, I would never apply any technique based on the assumption that the data are sorted without 100 per cent guarantee that the assumed order is there. Either I would have a complete idea of and full confidence in the data preparation process (proc sort/sql/summary/freq, sorted hash, system sort utility or order by in an RDBMS before the data got to SAS), or I would do a complete serial pass to ensure the order.

I realize that whoever uses SORTEDBY as an explicit option bears responsibility, but it is also reasonable to expect that if SAS relies on data being specifically shaped before any of its internal algorithms could be used, it would at least give a warning upon finding out that the data do not conform to the shape.

In fact, it is a long-standing SAS tradition. Any SAS process using BY KEY will stop with an error once it is discovered that KEY is not in order. Would we expect, by the same token, that if SAS used the binary search every time SORTEDBY was set, it would check whether the algorithm can be applied correctly? For one, I would. And I would certainly find it impossible to fancy that, upon seeing SORTEDBY set, SAS would use the binary search without any regard to whether the algorithm's prerequisites are met and return a wrong result without so much as a warning.

In the case of BY, it is possible verify that the keys are in order during the process itself, because any algorithm using BY (say, MERGE) is sequential. However, as we have seen above, the binary search, running against possibly unsorted data, by its own nature provides no means of detecting *in the process* that some keys are out of order - while ALL keys being completely in order is the fundamental assumption, without which the correctness of the binary search cannot be guaranteed. Therefore, something else would have to be used to verify that the input data are in order, and that can be done only via at least one sequential pass - which completes the vicious circle.

It does not mean I am against the "substantially beneficial performance enhancement" - au contraire! I am all for it, but before SAS should implement it, they should first provide some internal means to either (a) auto-distinguish SORTEDBY set by a SAS procedure from one set via the data set option and use he binary search only in the former case, or (b) create an IKWIAD option ("I know what I am doing") to allow the user to accept full responsibility - and perhaps to generate a warning when the option is set to YES.

Kind regards ------------ Paul Dorfman Jax, FL ------------

On Fri, 9 Jul 2010 13:01:59 -0500, Kevin Myers <KevinMyers@AUSTIN.RR.COM> wrote:

>Paul - > >Actually it's quite easy to verify sorted-ness during the binary search, >which also prevents any potential infinite looping. This approach does not >*guarantee* that the data is actually 100% sorted, but is adequate in my >opinion. There are other areas where SAS empoys a similar level of "trust" >in the SORTEDBY metadata (as exhibited by your SORT example), though I >believe there may be some options for controlling this behavior. > >Finally, the SORTEDBY metadata is always correct if SAS software put it >there automatically. The only case where it's wrong is if the user forces >the SORTEDBY attribute onto a dataset that isn't actually sorted (as in your >example). And really, if they did that, then that's their problem, and that >shouldn't hold the rest of us back from a substantially beneficial >performance enhancement. > >s/KAM > > >----- Original Message ----- >From: "Paul Dorfman" <sashole@BELLSOUTH.NET> >To: <SAS-L@LISTSERV.UGA.EDU>; "Kevin A. Myers" <KevinMyers@AUSTIN.RR.COM> >Sent: Friday, July 09, 2010 12:36 >Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX >function (was: Pass date values)) > > >> On Thu, 8 Jul 2010 12:23:10 -0500, Kevin Myers <KevinMyers@AUSTIN.RR.COM> >> wrote: >> >>>In the case of a sorted data set, I would *hope* that SAS would search for >>>the first matching record by using a *binary*, rather than sequential, >>>search. If that isn't the case, then someone needs to suggest using a >>>binary search to locate the first matching observation as a performance >>>enhancement. >> >> Kevin, >> >> One problem with this is that the only indication that a data set *might* >> be >> sorted is SORTEDBY=, but its being set in no way guarantees that the file >> is >> actually physically ordered by the key(s) listed: >> >> data test (sortedby = key) ; >> do key = 9, 1, 3, 2, 8, 5 ; >> output ; >> end ; >> run ; >> >> Look at the table TEST's properties, and the metadata plainly tells you >> the >> data set is sorted by KEY. (It reminds me of the old Bob Virgile's puzzle, >> where a file similar to the above was explicitly sorted and the ensuing >> step >> whose BY statement relied on the sort order would fail because the >> SORTEDBY= >> fooled SORT used without FORCE option into doing nothing: >> >> 8 proc sort data = test ; >> 9 by key ; >> 10 run ; >> NOTE: Input data set is already sorted, no sorting done.) >> >> So if SAS were to infer from SORTEDBY= that the file is indeed sorted and >> therefore "binary search" it, the search algorithm will either (a) enter >> an >> infinite loop, in which case you have at least some indication of the >> trouble, albeit at an ugly expense, or (b) finish perfectly well - in >> which >> case it will produce an erroneous result giving no token whatsoever that >> something wrong has happened and which is much more dangerous. >> >> Of course the underlying software can easily check beforehand if the file >> is >> in fact physically in order, but this will be tantamount to a whole pass >> through the file, thereby defeating the reason for using the binary search >> in the first place. >> >> The only way out of the SORTEDBY= trap would be for SAS to create a system >> option banning SORTEDBY from being set by anything but a successfully >> executed SORT or ORDER BY. Then if SORTEDBY= were set the binary search >> against the keys listed under it could be used securely. >> >> Kind regards >> ------------ >> Paul Dorfman >> Jax, FL >> ------------ >>

Back to: Top of message | Previous page | Main SAS-L page