Date: Fri, 17 May 2002 17:07:35 -0400
Reply-To: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject: Re: determining variable type
Content-Type: text/plain; charset=iso-8859-1
> From: Quentin McMullen [mailto:QuentinMcMullen@WESTAT.COM]
> > From: Ace [mailto:b.rogers@VIRGIN.NET]
>
> I disagree here. My understanding is that the PDV is created
> during compile time. And in order to create the PDV, SAS must know each
> variable's name, type, length, etc. So when you code x=1 during compile
time
> SAS decides x is numeric because you are assigning it a numeric value.
> When you code y="1" SAS decides during compile time it is character of
> length 1 because you are assigning a character value of length 1. So
during
> compile time whenever the first reference to a variable is made
> (regardless of whether that is an assignment, a comparison, or on a set
statement),
> SAS decides it's name, type, and length.
Quentin,
All of this is generally correct, but there is a number of subtler points.
First, the oddity you have observed is an oddity indeed. A simple test
shows:
339 data _null_ ;
340 if upcase(v) = '1' then ;
341 if v = 'A' then ;
342 run;
NOTE: Numeric values have been converted to character values at the places
given by:
(Line):(Column).
340:14
NOTE: Character values have been converted to numeric values at the places
given by:
(Line):(Column).
341:19
NOTE: Variable v is uninitialized.
NOTE: Invalid numeric data, 'A' , at line 341 column 19.
v=. _ERROR_=1 _N_=1
So, as you say, contrary to expectaions, V is created as a numeric variable.
Whether it a design flaw or sober design decision, this is the way SAS works
at the moment: A variable first referenced as a function argument is stored
in the symbol table as numeric regardless of the function type. It makes the
entire expression, in this case, upcase(v), numeric, hence the implicit
conversion of '1' to 1 rendering the comparison operative. SAS attempts the
same at line 341, but now since the literal is not a digit, the conversion
fails. Removing upcase() of course removes the problem.
Secondly, the compiler reacts differently to type conflicts depending on
their "severity":
1) Expression conflict. An expression attempts to change the variable type.
Compiler ignores it. An attempt is made to resovle the conflict at the
run-time by an implicit conversion. If it is impossible, as above, it
results in a run-time error.
2) Declarative conflict. A declaration attempts to change the variable type
already set. In this context, by declaration I mean LENGTH, ATTRIB with
length=, RETAIN. Compile fails, the exact message depending on the
statement. With LENGTH or ATTRIB, we have
610 data _null_ ;
611 if v = 1 then ;
612 length v $1. ;
ERROR: Data type conflict for variable v.
613 run;
Note that if we swap the lines 611 and 612, it becomes an expression
conflict, and all we get is a run-time conversion. For RETAIN, SAS has
another compiler message in store:
628 data _null_ ;
629 if v = 1 then ;
630 retain v '1' ;
ERROR: '1' and v are incompatible for retain.
631 run;
3) Array conflict. An attempt is made to incorporate variables of different
types in one array. Compile fails.
668 data _null_ ;
669 v = 1 ;
670 q = '2' ;
671 array a(*) $2. v q ;
ERROR: All variables in array list must be the same type, i.e., all numeric
or character.
672 run;
Note that although ARRAY is a declarative statement, I have segregated it
into a separate category because (if its elements are not type-mixed), it is
designed to accommodate the type of variable(s) it is instructed to
incorporate. Even if the array type is explicitly specified, it quietly
disregards it and defers to the type of the variables already declared:
663 data _null_ ;
664 v = 1 ;
665 q = 2 ;
666 array a(*) $2. v q ;
667 run;
4) Descriptor conflict. The type of a variable the compiler reads from the
descriptor of an input data set named in the step, has already been set as
opposite. The behaviour exactly mirrors that of the declarative conflict,
yet the message is different again:
703 data a ;
704 retain v 1 ;
705 run ;
706 data _null_ ;
707 if v = '1' ;
708 stop ;
709 set a ;
ERROR: Variable v has been defined as both character and numeric.
710 run;
If we could regard file reading statements referring to input data set as
declarations, this category, by the type of behaviour, could be coalesced
with the declarative conflicts. But I decided to put it separately because
(a) the message is different (b) the data set name can be implied (i.e. not
declared), the consequences being the same. That is, if in the step above, I
had coded just SET instead of SET A, A would have been implied by default
without any declaration - and with the same result.
5) Format conflict. An attempt is made to associate a format (straight or
through ATTRIB) of a certain type on a variable, whose type has been already
set to opposite. This one behaves pretty much similar to an expression
conflict. Compiler ignores it. At the run-time, an attempt is made to find a
same-named format of the opposite type. If it does not exist, the step
abends:
888 data _null_ ;
889 v = '1' ;
890 stop ;
891 format v z1. ;
---
48
ERROR 48-59: The format $Z was not found or could not be loaded.
892 run;
If it does exist, the behaviour bifurcates. If the variable is numeric, we
get a warning, and the format is converted to its numeric counterpart (and
thus stored, if an output data set is named or implied), and the latter is
used for printing:
914 data b ;
WARNING: Variable v has already been defined as numeric.
915 v = 1 ;
916 format v $1. ;
917 put v = ;
918 run;
v=1
NOTE: The data set WORK.B has 1 observations and 1 variables.
If the variable is character to begin with, no warning is issued, and the
format is stored as declared (i.e. as numeric), yet the value is printed
using its character counterpart:
924 data b ;
925 v = 'A' ;
926 format v 1. ;
927 put v = ;
928 run;
v=A
NOTE: The data set WORK.B has 1 observations and 1 variables.
> > In general, however, I'd always recommend that manual read-loops are
> > handled with a single read before the loop and a do while(not
> > end_of_file) type logic. This was how I was taught structured
> > programming 20-odd years back and has served me well since.
> >
> > To implement this in SAS requires a linked block, as two
> seperate SET statements would open the input dataset twice, which is not
what's
> > required here. So something like the following would do the trick:
> >
> > data b;
> > link readit;
> > do while (^eof & upcase(var1)^='A');
> > link readit;
> > end;
> > return;
> > readit:
> > set a end=eof ;
> > return;
> > run;
> >
> I guess if you are using do-while with a set statement inside
> this is a way to avoid an infinite loop. But with the do-until structure
> you don't have to do this, since SAS will stop once it tries to read past
> the end of the file. In some sense, this is more "natural" SAS processing
> (e.g. you avoid "SAS stopped due to looping" notes, etc.).
Exactly, but this 'prime read' mantra has a long beard. It was born in times
where certain tongues, and foremost, COBOL, were syntax deficient and only
allowed a repetitive structure of the DO WHILE type, i.e. with test at the
top. Ironically (or rather moronically), the COBOL loop
perform until <condition>
<...body of the loop ...>
end-perform
actually test the <not condition> at the top of the loop, i.e. it is similar
to DO WHILE. Furthermore, in COBOL (and PL/I), end-of-file is set when a
read is attempted against an empty buffer. So, if a read instruction is
placed at the top of the loop's body (where it logically belongs), and the
file processing follows, the stuff processed after the last read will be
garbage from the empty buffer. Thus it became customary to code
read file at end set eof to true
perform until eof
<... do stuff ...>
read file at end set eof to true
end-perform
Recalling that above, UNTIL EOF actually means WHILE NOT EOF, it is clear
that when the file is empty, the first read sets end-of-file, and the loop
does not iterate once. Otherwise when the last record has been read and set
end-of-file, control is passed to the top of the loop, and it quits. In
COBOLII, IBM could not fix the until-but-really-while idiocy because of tons
of legacy code written this way, so they came up with the TEST AFTER clause,
as opposed to TEST BEFORE, which remained a default. Now, the "prime read"
has become unnecessary, because one could code
perform with test after until eof
read file at end set eof to true
not at end
< ... do stuff ...>
end-perform
Garbage processing is thus bypassed using "not at end". Since the "prime
read" had practically become a shop-standard religion, it is easy to imagine
that when some, usually yonger, folks started coding with test after, it
sparkled a real shop-standard war exacerbated by the notorious intolerance
of general COBOL population towards other people coding styles. The decline
of COBOL's market share, especially in decision support systems, made the
frictions of this sort much less intense, but even now, the old animosity is
from time to time rekindled in the COBOL discussion group as a rather
menacing, 100-post hate-mail thread.
Now should not we be glad that the folks from SAS were smart enough to kill
this garbage (literally and figuratively) from onset: end-of-file is set as
soon as SAS detects that the buffer is empty internally, without any need to
probe it by a reading attempt, and prevents any further reading from an
empty buffer. It means that end-of-file is set either if the file is empty
to begin with, or as soon as the last record left the biffer and was moved
to memory. This way, we neither have to worry about the "prime read", nor
"placing the read on the bottom", or "not at end" nonsense. If it is
necessary to code an explicit file-reading loop, we simply write:
data b ;
do until (eof) ;
set a end = eof ;
< ... do stuff ... > ;
end ;
run ;
Because of the difference in the end of file timing, the "prime read" is
productive in COBOL, but counterproductive in SAS. Coding
data b ;
do while (not eof);
set a end = eof ;
< ... do stuff ... > ;
output ;
end ;
run ;
only results on the 'stopped due to looping' note, because when control is
passed to the top of the loop, it is passed tothe top of the impiled loop as
well, whereas the buffer is already empty. Trying to rectify it by coding,
as Bruce suggests, by coding
data b ;
link read ;
do while (not eof);
link read ;
< ... do stuff ... > ;
end ;
return ;
read: set a end = eof ;
return ;
run ;
will obviously result in the failure to <...do stuff...> with the last
input record. Of course, we can try fixing it, too, by placing another
<...do stuff...> after the prime read, but what is the advantage of
rummaging around like this to begin with? Not really understanding what it
has to do with structured programming, my guess would be that the "prime
read" concept was a large and prominent part of a course of "structured
programming". If it is PL/I, it really is a structured language, but the
"prime read" has nothing to do its structure, just stemming from the end of
file timing. In COBOL, the reason for the "prime read" is the same, except
that COBOL is a typical unstructured language - which is why, by the way, so
much time has been spent teaching "structured COBOL programming" courses
:-).
Kind regards,
=====================
Paul M. Dorfman
Jacksonville, FL
=====================
Blue Cross Blue Shield of Florida, Inc., and its subsidiary and
affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.