|
Greetings to all and sundry (and Al and Sundhri)!
I'm trying to subset data as efficiently as possible. Any comments,
experience, etc. will be welcome.
Suppose I have a complicated subsetting expression for a DATA step, say:
WHERE type in (3,4,5)
AND
(
bankname in:('CHASE','PROVID','RIGGS')
or
<a whole bunch of other criteria>
);
Furthermore, suppose that the first criterion (the type code) will get
rid of, say, 80% of the records, and some proportion of the remaining
20% will be filtered out by the other criteria. Those other criteria
are likely to require a good deal of processing time per observation.
Is there a way to make further evaluation contingent on passing previous
filters?
Some possibilities:
--This already happens, because WHERE-expression evaluation includes
short-circuiting.
--Multiple WHERE clauses are applied in order, not all at once (I doubt
this):
WHERE type in (3,4,5);
WHERE SAME and <all the other stuff>;
--A view:
DATA mytype/VIEW=mytype;
set master.enormous;
WHERE type in (3,4,5);
run;
DATA mysubset;
set mytype;
WHERE <all the other stuff>;
I've used the view approach, and certainly haven't been slowed down by
it, but other resource constraints prohibit my re-running those jobs for
benchmarking.
[On a related topic: SAS documentation, both V8 (CD) and V9.1 (online),
is curiously mum about WHERE ALSO, which is a synonym for WHERE SAME
AND. It shows up in a few places, such as PROC REPORT and VIEWTABLE,
but it's valid in DATA and PROC steps, and can come in very handy if
one's a) writing a macro and b) lazy. WHERE SAME AND is touted as being
useful for PROCs with internal RUN boundaries, such as REG and GPLOT.]
Any ideas?
Roger
|