| Date: | Fri, 31 Aug 2001 16:41:32 -0400 |
| Reply-To: | Ian Whitlock <WHITLOI1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Ian Whitlock <WHITLOI1@WESTAT.COM> |
| Subject: | Re: Automatically RETAINed ?? |
|
| Content-Type: | text/plain; charset="iso-8859-1" |
|---|
Subject: Automatically RETAINed ??
Summary: When can variable values change in a DATA step?
Respondent: IanWhitlock@westat.com
Peetie Wheatstraw [peetie_wheatstraw@HOTMAIL.COM] asked a question
about retaining variables from a SAS data set. There then followed a
lively discussion. I did not participate before this, merely because
I was busy. Moreover, I think the question was answered accurately by
my colleague Quentin McMullen, but I would like to rephrase the
situation. (Incidentally Quentin gave an excellent talk on handling
missing values at the Westat SAS User's group meeting this morning,
which is based on a poster presentation he will give at NESUG next
month.)
Variables from SAS data sets are RETAINed. Now what does retain mean?
SAS usually sets variables to missing at the top of each iteration of
the implied DATA step loop. A variable is retained if this usual
activity is not done. (Unfortunately much of the confusion shown in
this discussion can be traced, I think, to poor documentation about
RETAIN. The version 8 on-line documentation statement for RETAIN
>>>>
Causes a variable that is created by an INPUT or assignment statement
to retain its value from one iteration of the DATA step to the next
<<<<
is about as misleading as possible without being totally wrong. Note
that the difficulty is that RETAIN cannot be explained on its own
terms; it is only in understanding what standard DATA step processing
is about that one can appreciate the full meaning of RETAIN.)
So how can values change for a variable from a SAS data set?
1) variables from a SAS data set are initialized once at the
beginning of execution to missing.
2) if the variable comes from a SAS data set its value will
change each time that data set is read.
3) if a user makes an assignment (or some explicit action to
change the value) the value will change.
4) if the variable is in a data set participating in by-processing
the variable will be set to missing at the beginning of each
by-group.
5) if the variable comes from a SAS data set participating in a SET
statement the value will set to missing every time the buffer is
switched for that SET statement.
I think all of these points have been discussed and illustrated, but
not necessarily in one place. I then offer the following code based
in part on Peetie's original example, but extended in light of the
above conditions.
data a;
input id xa ;
cards ;
1 1
3 1
3 2
;
data b;
input id xb ;
cards ;
1 2
1 3
2 1
;
data c ;
xc = 66 ;
run ;
data _null_ ;
length id xa xb xc 8 ;
put "At top: " _all_ ;
if _n_ = 1 then set c ;
set a ( in = a ) b ( in = b ) ;
by id;
put "After SET 1:" _all_;
if _n_ = 2 then xa = 9 ;
run;
Here is part of the log.
At top: id=. xa=. xb=. xc=. a=0 b=0 FIRST.id=1 LAST.id=1 _ERROR_=0 _N_=1
After SET 1:id=1 xa=1 xb=. xc=66 a=1 b=0 FIRST.id=1 LAST.id=0 _ERROR_=0
_N_=1
At top: id=1 xa=1 xb=. xc=66 a=1 b=0 FIRST.id=1 LAST.id=0 _ERROR_=0
_N_=2
After SET 1:id=1 xa=. xb=2 xc=66 a=0 b=1 FIRST.id=0 LAST.id=0 _ERROR_=0
_N_=2
At top: id=1 xa=9 xb=2 xc=66 a=0 b=1 FIRST.id=0 LAST.id=0 _ERROR_=0
_N_=3
After SET 1:id=1 xa=9 xb=3 xc=66 a=0 b=1 FIRST.id=0 LAST.id=1 _ERROR_=0
_N_=3
At top: id=1 xa=9 xb=3 xc=66 a=0 b=1 FIRST.id=0 LAST.id=1 _ERROR_=0
_N_=4
After SET 1:id=2 xa=. xb=1 xc=66 a=0 b=1 FIRST.id=1 LAST.id=1 _ERROR_=0
_N_=4
At top: id=2 xa=. xb=1 xc=66 a=0 b=1 FIRST.id=1 LAST.id=1 _ERROR_=0
_N_=5
After SET 1:id=3 xa=1 xb=. xc=66 a=1 b=0 FIRST.id=1 LAST.id=0 _ERROR_=0
_N_=5
At top: id=3 xa=1 xb=. xc=66 a=1 b=0 FIRST.id=1 LAST.id=0 _ERROR_=0
_N_=6
After SET 1:id=3 xa=2 xb=. xc=66 a=1 b=0 FIRST.id=0 LAST.id=1 _ERROR_=0
_N_=6
At top: id=3 xa=2 xb=. xc=66 a=1 b=0 FIRST.id=0 LAST.id=1 _ERROR_=0
_N_=7
I leave it to you to verify that every value is explained by one of
the rules mentioned. It would also be good to run the program without
the BY-statement since this causes some of the confusion. This does
not prove the reasons are correct or that they are the only reasons.
However, it should help convince one of these facts. If anyone can
show a violation of an of these principles I would like to see the
example.
Note that XC stays 66 after the first reading in spite of conditions 4
and 5 because they do not apply. Note that XA had the value 9 assigned
during _N_ = 2 and stayed that way throughout _N_ = 3 because neither
condition 4 nor 5 applied.
IanWhitlock@westat.com
|