```Date: Fri, 9 Jul 2010 18:38:58 -0500 Reply-To: Kevin Myers Sender: "SAS(r) Discussion" From: Kevin Myers Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX function (was: Pass date values)) Comments: To: Paul Dorfman Content-Type: text/plain; format=flowed; charset="iso-8859-1"; reply-type=original Paul - Yes, your example is exactly right, but that doesn't mean the proposed enhancement would be any less safe than the current behavior. What I meant by "actually it's quite easy..." is that on each iteration of the binary search loop, you can verify that key values in earlier observations that are visited by the algorithm are less than or equal to key values in later observations. But again, this is true *ONLY* for observations that are actually *read* by the binary search algorithm. If the algorithm doesn't happen to actually hit upon any out of order values, and if such values do in fact exist, then you're exactly right, the proposed algorithm could in fact return inaccurate results, as you have illustrated. All I was trying to say is that the binary search could detect values that are out of order among the actual values that are actually processed. However, please note that this is *NOT* any worse that they existing behavior!!! Consider what would happen with current behavior if the following (unsorted) list of key values is evaluated based on WHERE x=4 with SORTED BY=x metadata in place: x - 1 2 3 5 4 6 In this scenario, the *existing* where clause optimization would also *FAIL* to find the observation with x=4, and would *NOT* report any sort order errors. So further optimization using a binary search to locate the first matching value for the WHERE clause would NOT be any worse in terms of potential for failure than the existing algorithm. Both the existing algorithm and the proposed algorithm can only report errors on the observations that they actually visit, not on others which are not processed. But then the whole point of optimization is to process as few observations as possible. s/KAM ----- Original Message ----- From: "Paul Dorfman" To: ; "Kevin A. Myers" Sent: Friday, July 09, 2010 18:13 Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX function (was: Pass date values)) > Kevin, > > Frankly, your assertion that "Actually it's quite easy to verify > sorted-ness > during the binary search" has got me greatly puzzled. I admit that I > cannot > conceive how this is possible, let alone being quite easy. What I do find > quite easy is to arrange search keys in such a way that the binary search > proceeds without a glitch, hits the "right" keys, gives no token that some > other keys are unsorted, and of course results in a wrong answer. Here is > a > simple example: > > data _null_ ; > array a [7] _temporary_ (1 3 4 5 6 7 9) ; > > key = 4 ; > > l = lbound (a) ; > h = hbound (a) ; > f = . ; > do until (l > h | f) ; > m = floor ((h + l) * .5) ; > putlog l= m= h= ; > if key < a[m] then h = m - 1 ; > else if key > a[m] then l = m + 1 ; > else f = 1 ; > end ; > > putlog f= m= ; > run ; > > This is a typical binary search locating the index M of the key if it is > found and F=. if not. The first putlog tracks the end- and mid-points as > the > algorithm iterates. As expected, the step prints: > > l=1 m=4 h=7 > l=1 m=2 h=3 > l=3 m=3 h=3 > f=1 m=3 > > and as expected, F=1 and a[3]=4. Now let us change the array to > > array a [7] _temporary_ (1 3 6 5 4 7 9) ; > > i.e. swap 4 and 6, and run the step again. Now it prints: > > l=1 m=4 h=7 > l=1 m=2 h=3 > l=3 m=3 h=3 > f=. m=3 > > F=. indicates that key=4 is not found, which is false - it is in the > array. > Otherwise, there is no slightest indication that anything went wrong, as > the > algorithm, converging, hits exactly the same keys. Of course, it is clear > why this occurs: the entire upper array part starting with m=4 is rejected > during the very first iteration, the rest of the algorithm operates only > on > the perfectly sorted keys from m=1 to m=3. > > By induction, it is always possible to concoct a disordered arrangement of > keys in such a way that the binary search algorithm will hit only the keys > which in relation to each other are perfectly sorted, while the keys in > between are not. Hence the scheme will proceed happily down the list > yielding a wrong result, while the keys it hits in the process will give > no > clue how other keys in the array might be arranged. > > Now, the only means "to verify sorted-ness during the binary search" are > the > keys the binary search inspects while it converges. If all such keys are > ordered relative to each other, how can we make any conclusion about the > "sorted-ness" of the rest of the keys? > > You also write: "This approach does not *guarantee* that the data is > actually 100% sorted, but is adequate in my opinion.". I cannot see how > one > can be sure of any result applying the binary search to data that might be > only, say, 99% sorted. The above example shows that swapping only 2 keys > in > otherwise completely sorted array is enough to knock the things out of > whack. > > I would agree if you said that in a totally random data, the probability > that the keys the binary search hits happen to be in relative order is > negligibly small for a large data collection. For example, if we have 1 > million (unique) keys, a sorted arrangement of log2(1e6)=26 keys is only > one > among 26! possible arrangements of the 26 keys. So, in this case the > binary > search can be first used to discover that the data are NOT in order and > act > accordingly. But in this case, why bother with the binary search? Just > pick > a random sample and see if its keys are in order. > > A big problem with such "negative" test is that since we know nothing > about > data distribution a priori, the odds of getting unsorted data if we fail > to > check the entire set for "sorted-ness" are unacceptably high, especially > in > the real world where "almost" sorted data are encountered much more often > than well randomized data. > > Personally, I would never apply any technique based on the assumption that > the data are sorted without 100 per cent guarantee that the assumed order > is > there. Either I would have a complete idea of and full confidence in the > data preparation process (proc sort/sql/summary/freq, sorted hash, system > sort utility or order by in an RDBMS before the data got to SAS), or I > would > do a complete serial pass to ensure the order. > > I realize that whoever uses SORTEDBY as an explicit option bears > responsibility, but it is also reasonable to expect that if SAS relies on > data being specifically shaped before any of its internal algorithms could > be used, it would at least give a warning upon finding out that the data > do > not conform to the shape. > > In fact, it is a long-standing SAS tradition. Any SAS process using BY KEY > will stop with an error once it is discovered that KEY is not in order. > Would we expect, by the same token, that if SAS used the binary search > every > time SORTEDBY was set, it would check whether the algorithm can be applied > correctly? For one, I would. And I would certainly find it impossible to > fancy that, upon seeing SORTEDBY set, SAS would use the binary search > without any regard to whether the algorithm's prerequisites are met and > return a wrong result without so much as a warning. > > In the case of BY, it is possible verify that the keys are in order during > the process itself, because any algorithm using BY (say, MERGE) is > sequential. However, as we have seen above, the binary search, running > against possibly unsorted data, by its own nature provides no means of > detecting *in the process* that some keys are out of order - while ALL > keys > being completely in order is the fundamental assumption, without which the > correctness of the binary search cannot be guaranteed. Therefore, > something > else would have to be used to verify that the input data are in order, and > that can be done only via at least one sequential pass - which completes > the > vicious circle. > > It does not mean I am against the "substantially beneficial performance > enhancement" - au contraire! I am all for it, but before SAS should > implement it, they should first provide some internal means to either (a) > auto-distinguish SORTEDBY set by a SAS procedure from one set via the data > set option and use he binary search only in the former case, or (b) create > an IKWIAD option ("I know what I am doing") to allow the user to accept > full > responsibility - and perhaps to generate a warning when the option is set > to > YES. > > Kind regards > ------------ > Paul Dorfman > Jax, FL > ------------ > > On Fri, 9 Jul 2010 13:01:59 -0500, Kevin Myers > wrote: > >>Paul - >> >>Actually it's quite easy to verify sorted-ness during the binary search, >>which also prevents any potential infinite looping. This approach does >>not >>*guarantee* that the data is actually 100% sorted, but is adequate in my >>opinion. There are other areas where SAS empoys a similar level of >>"trust" >>in the SORTEDBY metadata (as exhibited by your SORT example), though I >>believe there may be some options for controlling this behavior. >> >>Finally, the SORTEDBY metadata is always correct if SAS software put it >>there automatically. The only case where it's wrong is if the user forces >>the SORTEDBY attribute onto a dataset that isn't actually sorted (as in > your >>example). And really, if they did that, then that's their problem, and > that >>shouldn't hold the rest of us back from a substantially beneficial >>performance enhancement. >> >>s/KAM >> >> >>----- Original Message ----- >>From: "Paul Dorfman" >>To: ; "Kevin A. Myers" >>Sent: Friday, July 09, 2010 12:36 >>Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX >>function (was: Pass date values)) >> >> >>> On Thu, 8 Jul 2010 12:23:10 -0500, Kevin Myers >>> >>> wrote: >>> >>>>In the case of a sorted data set, I would *hope* that SAS would search > for >>>>the first matching record by using a *binary*, rather than sequential, >>>>search. If that isn't the case, then someone needs to suggest using a >>>>binary search to locate the first matching observation as a performance >>>>enhancement. >>> >>> Kevin, >>> >>> One problem with this is that the only indication that a data set >>> *might* >>> be >>> sorted is SORTEDBY=, but its being set in no way guarantees that the >>> file >>> is >>> actually physically ordered by the key(s) listed: >>> >>> data test (sortedby = key) ; >>> do key = 9, 1, 3, 2, 8, 5 ; >>> output ; >>> end ; >>> run ; >>> >>> Look at the table TEST's properties, and the metadata plainly tells you >>> the >>> data set is sorted by KEY. (It reminds me of the old Bob Virgile's > puzzle, >>> where a file similar to the above was explicitly sorted and the ensuing >>> step >>> whose BY statement relied on the sort order would fail because the >>> SORTEDBY= >>> fooled SORT used without FORCE option into doing nothing: >>> >>> 8 proc sort data = test ; >>> 9 by key ; >>> 10 run ; >>> NOTE: Input data set is already sorted, no sorting done.) >>> >>> So if SAS were to infer from SORTEDBY= that the file is indeed sorted >>> and >>> therefore "binary search" it, the search algorithm will either (a) enter >>> an >>> infinite loop, in which case you have at least some indication of the >>> trouble, albeit at an ugly expense, or (b) finish perfectly well - in >>> which >>> case it will produce an erroneous result giving no token whatsoever that >>> something wrong has happened and which is much more dangerous. >>> >>> Of course the underlying software can easily check beforehand if the >>> file >>> is >>> in fact physically in order, but this will be tantamount to a whole pass >>> through the file, thereby defeating the reason for using the binary > search >>> in the first place. >>> >>> The only way out of the SORTEDBY= trap would be for SAS to create a > system >>> option banning SORTEDBY from being set by anything but a successfully >>> executed SORT or ORDER BY. Then if SORTEDBY= were set the binary search >>> against the keys listed under it could be used securely. >>> >>> Kind regards >>> ------------ >>> Paul Dorfman >>> Jax, FL >>> ------------ >>> > ```

Back to: Top of message | Previous page | Main SAS-L page