Date: Mon, 7 Jun 2004 15:55:19 -0400
Reply-To: "Lustig, Roger" <roger.lustig@CITIGROUP.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Lustig, Roger" <roger.lustig@CITIGROUP.COM>
Subject: Re: WHERE: how and why?
Content-Type: text/plain; charset="iso-8859-1"
Christian:
Thanks for your input! As you may have seen, Paul Dorfman and I have been playing around with this a little.
Your idea is interesting: that exactly one filter (the most restrictive) should go in the WHERE, and the rest in IFs. That's not too far from my own experience.
Now, there are other sorts of filter, notably those with ORs in them. At the level of a chain of ORs, the items should be ordered by *ascending* restrictiveness. They also have to be in one statement. Fortunately, WHERE statements seem to evaluate only as far as they need to (unlike IFs).
Tests I ran today suggest that even a very complicated WHERE expression can be more efficient than the WHERE/IF approach. I looked at 10 million records, each one 88 bytes wide. Most of the WHERE statement looked for various combinations of a numeric code and one or more substrings of a 40-character string. The top-level logical operator was an AND, so the expression could be separated into two sub-expressions. By optimizing the expression for exclusiveness at each level, I was able to read in the selected records in 77 seconds with a WHERE. WHERE/IF (the WHERE portion accepting a numeric that was IN a list of 12) took 113 seconds. Strikingly, a two-IFs solution took only 108.
Using a very un-optimized order for the logical expression, I was able to waste 183 seconds using a WHERE, 195 using the WHERE/IF. Here, two IFs did best: 171 seconds. Combining them into a single IF took 174.
All of which is to say that the factors to be considered are:
--optimization of WHERE expression
--separability of expression into blocks separated by AND
--greater cost of evaluating in the buffer (WHERE) instead of the PDV (IF)
--cost of loading data into the PDV (i.e., width of record)
--cost of evaluating the whole expression (IF) instead of just the necessary part (WHERE)
More when I've played around some more.
Roger
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]On Behalf Of
Christian Lensbjerg
Sent: Monday, June 07, 2004 2:47 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: WHERE: how and why?
Hi
Without posting the proper documentation, every experiment I've conducted,
have shown that the best (or at least as good as the alternatives)
performing way to subset, when you
a) use the data step
&
b) don't use indexes
(I mostly subset in queries on day-to-day updated data, hence the cost of
indexing exceeds the gain)
is this:
data output;
set db.input(keep=relevant variables var1-var4);
where var1='value';
if var2='value' then if var3='value' then if var4='value' then output;
*and so on;
run;
Needless to say, the criterion should be sorted in descending order, by
observation-subsetability.
If You're going to sort the data afterwards, and don't intend to mess
further with the data during the datastep - I often find it worth doing it
all in one proc sort, with one long composed where-statement.
Best regards
Christian Lensbjerg