Date: Tue, 8 Nov 2005 22:31:58 -0500
Reply-To: Richard Ristow <wrristow@MINDSPRING.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@MINDSPRING.COM>
Subject: Re: An Alternative to LAG function
In-Reply-To: <rfednUK_X9eRyPjeRVn-qQ@rogers.com>
Content-Type: text/plain; charset="us-ascii"; format=flowed
At 06:19 PM 10/30/2005, Arthur Tabachneck wrote:
>Toby (our favorite AI bot) responded off-line with the following:
>
>>Warren Sarle got ahold of Paul Dorfman and me on this subject and
>>stated
>>roughly that since SI is redoing the underlying code of the data step
>>from
>>the ground up they would be more than happy to include new
>>functionality and
>>improve th eold (such as the lag function). however, he stated that
>>there
>>needed to be a whole lot of input from the SAS Guru's to help them
>>figure
>>out these functions should work, for example how should the lag
>>function
>>work when you have multiple data sets being merged together?
I received Warren's inquiry, and responded to it, as well; see below.
This is from a different point of view: addressing the meaning of
"previous" given the complex input logic allowed in the DATA step. I
explicitly compare with SPSS's logic.
>All we need is for somebody to explain EXACTLY what it ought to do
>when there are multiple input SAS data sets, multiple input relational
>database tables (Oracle, DB2, etc.), multiple output SAS data sets,
>and multiple output relational database tables.
>
>Seriously. Can anybody help us?
OK, here's a naive answer to a subtle question:
LAG(X) should return the value of X from the immediately preceding
record.
LAG2(X), or LAG(X,2) should return the value of X from the second
preceding record, i.e. the record immediately preceding the immediately
preceding record. (etc.)
Though naive, it focuses the question: find reasonable, unambiguous
meanings for "record", "preceding", and "immediately". Here's an answer
which I'm pretty sure are unambiguous, and which I'll argue is
reasonable. It does depend on a processing model which the DATA step
and the SPSS transformation program share:
The code of a SAS DATA step, or an SPSS transformation program, is the
interior of a loop. (I've helped experienced programmers who flummexed
because they weren't aware of that implicit loop.) Then, let "previous"
be previous pass through the implicit loop. Operationalized,
LAG(X) is the value that X had at the end of the previous loop pass.
or, more precisely in SAS terms,
LAG(X) is the value that X had just before the previous clearing of the
PDV.
Now, comments:
First, for however, much it matters, this LAG, like SPSS's, can only
take a variable as argument (SAS's present one takes an arbitrary
expression, I believe.)
Second, a classic, perhaps 'basic', use of the DATA step begins
DATA ...;
SET ...;
(or 'MERGE' or 'INPUT' in place of 'SET').
Here, the implicit loop has a very simple meaning: there's an 'engine'
that produces a sequential file, one record per pass, and the DATA step
implicit loop processes one record per pass. I'd say that in this case,
the LAG I propose means exactly what you'd expect; and is, among other
things, equivalent to SPSS's LAG. There's no confusion if the LAG is
inside a conditional or other construct, either. And in the code
DATA FOO;
SET BAR;
BY BLORT;
...
FIRST.BLORT=1
if and only if
BLORT NE LAG(BLORT);
which I think is one test of "reasonable" for LAG. (This is SAS
comparison, where "value NE <missing>" is 'true', rather than SPSS
comparison, where "value NE <missing>" is 'missing'.)
In SPSS, this settles the matter, since a transformation program can
only be used this way: it's begun by an 'engine' that produces a record
at a time (GET FILE, ADD FILES, MATCH FILES, DATA LIST, or an INPUT
PROGRAM).
In SAS, 'SET', 'MERGE', 'INPUT' are executables, and that raises a lot
of complications. I'm sure that's what you were thinking of, when
asking,
>All we need is [to know] what it ought to do when there are multiple
>input SAS data sets, multiple input relational database tables
>(Oracle, DB2, etc.), multiple output SAS data sets, and multiple
>output relational database tables.
I don't think "multiple output" raises so many questions. I've done a
lot with OUTPUT statements (they're one of my favorite SAS features),
both to write many files or records, and to coalesce many records into
one. But LAG, as I'm describing it, is an input construct only. At
least in the DATA step, with 'basic' input as I've described it, I
don't think multiple outputs would raise any difficulty. Nor would
multiple inputs, as now with MERGE or multiple data-set SETs, as long
as they resolved into an 'engine' for one record per call, and that
'engine' is called once per DATA step pass. That's why there's no
problem in SPSS: what I'm describing is the only input logic allowed.
Goodness knows, you can do a lot of other things in SAS already. You
can, for example, write your own "DATA step" loop within a single DATA
step pass:
DO UNTIL END_IT = 1;
SET ...
<whatever you like>;
OUTPUT;
END;
STOP;
And my "LAG" is useless; it returns '.' for any call.
With the "POINT" option, or anything similar, it's likely my LAG won't
be what you want. But with "POINT", you're explicitly throwing away the
notion of a "previous record". It could work well, though, as long as
you do use the implicit loop. Here's chain following, rather nicely:
NEXT_KEY is a variable in the input.
DATA .....
IF LAG(NEXT_KEY) = '.' GET_IT = 1.
SET ... POINT=GET_IT.
(Interesting to try tree traversal this way. It probably wouldn't work
well; tree traversal works best with a stack, or an array that
simulates one.)