LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (May 2002, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 17 May 2002 17:07:35 -0400
Reply-To:     "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject:      Re: determining variable type
Comments: To: Quentin McMullen <QuentinMcMullen@WESTAT.COM>
Content-Type: text/plain; charset=iso-8859-1

> From: Quentin McMullen [mailto:QuentinMcMullen@WESTAT.COM] > > From: Ace [mailto:b.rogers@VIRGIN.NET] > > I disagree here. My understanding is that the PDV is created > during compile time. And in order to create the PDV, SAS must know each > variable's name, type, length, etc. So when you code x=1 during compile time > SAS decides x is numeric because you are assigning it a numeric value. > When you code y="1" SAS decides during compile time it is character of > length 1 because you are assigning a character value of length 1. So during > compile time whenever the first reference to a variable is made > (regardless of whether that is an assignment, a comparison, or on a set statement), > SAS decides it's name, type, and length.

Quentin,

All of this is generally correct, but there is a number of subtler points.

First, the oddity you have observed is an oddity indeed. A simple test shows:

339 data _null_ ; 340 if upcase(v) = '1' then ; 341 if v = 'A' then ; 342 run;

NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 340:14 NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 341:19 NOTE: Variable v is uninitialized. NOTE: Invalid numeric data, 'A' , at line 341 column 19. v=. _ERROR_=1 _N_=1

So, as you say, contrary to expectaions, V is created as a numeric variable. Whether it a design flaw or sober design decision, this is the way SAS works at the moment: A variable first referenced as a function argument is stored in the symbol table as numeric regardless of the function type. It makes the entire expression, in this case, upcase(v), numeric, hence the implicit conversion of '1' to 1 rendering the comparison operative. SAS attempts the same at line 341, but now since the literal is not a digit, the conversion fails. Removing upcase() of course removes the problem.

Secondly, the compiler reacts differently to type conflicts depending on their "severity":

1) Expression conflict. An expression attempts to change the variable type. Compiler ignores it. An attempt is made to resovle the conflict at the run-time by an implicit conversion. If it is impossible, as above, it results in a run-time error.

2) Declarative conflict. A declaration attempts to change the variable type already set. In this context, by declaration I mean LENGTH, ATTRIB with length=, RETAIN. Compile fails, the exact message depending on the statement. With LENGTH or ATTRIB, we have

610 data _null_ ; 611 if v = 1 then ; 612 length v $1. ; ERROR: Data type conflict for variable v. 613 run;

Note that if we swap the lines 611 and 612, it becomes an expression conflict, and all we get is a run-time conversion. For RETAIN, SAS has another compiler message in store:

628 data _null_ ; 629 if v = 1 then ; 630 retain v '1' ; ERROR: '1' and v are incompatible for retain. 631 run;

3) Array conflict. An attempt is made to incorporate variables of different types in one array. Compile fails.

668 data _null_ ; 669 v = 1 ; 670 q = '2' ; 671 array a(*) $2. v q ; ERROR: All variables in array list must be the same type, i.e., all numeric or character. 672 run;

Note that although ARRAY is a declarative statement, I have segregated it into a separate category because (if its elements are not type-mixed), it is designed to accommodate the type of variable(s) it is instructed to incorporate. Even if the array type is explicitly specified, it quietly disregards it and defers to the type of the variables already declared:

663 data _null_ ; 664 v = 1 ; 665 q = 2 ; 666 array a(*) $2. v q ; 667 run;

4) Descriptor conflict. The type of a variable the compiler reads from the descriptor of an input data set named in the step, has already been set as opposite. The behaviour exactly mirrors that of the declarative conflict, yet the message is different again:

703 data a ; 704 retain v 1 ; 705 run ; 706 data _null_ ; 707 if v = '1' ; 708 stop ; 709 set a ; ERROR: Variable v has been defined as both character and numeric. 710 run;

If we could regard file reading statements referring to input data set as declarations, this category, by the type of behaviour, could be coalesced with the declarative conflicts. But I decided to put it separately because (a) the message is different (b) the data set name can be implied (i.e. not declared), the consequences being the same. That is, if in the step above, I had coded just SET instead of SET A, A would have been implied by default without any declaration - and with the same result.

5) Format conflict. An attempt is made to associate a format (straight or through ATTRIB) of a certain type on a variable, whose type has been already set to opposite. This one behaves pretty much similar to an expression conflict. Compiler ignores it. At the run-time, an attempt is made to find a same-named format of the opposite type. If it does not exist, the step abends:

888 data _null_ ; 889 v = '1' ; 890 stop ; 891 format v z1. ; --- 48 ERROR 48-59: The format $Z was not found or could not be loaded. 892 run;

If it does exist, the behaviour bifurcates. If the variable is numeric, we get a warning, and the format is converted to its numeric counterpart (and thus stored, if an output data set is named or implied), and the latter is used for printing:

914 data b ; WARNING: Variable v has already been defined as numeric. 915 v = 1 ; 916 format v $1. ; 917 put v = ; 918 run; v=1 NOTE: The data set WORK.B has 1 observations and 1 variables.

If the variable is character to begin with, no warning is issued, and the format is stored as declared (i.e. as numeric), yet the value is printed using its character counterpart:

924 data b ; 925 v = 'A' ; 926 format v 1. ; 927 put v = ; 928 run; v=A NOTE: The data set WORK.B has 1 observations and 1 variables.

> > In general, however, I'd always recommend that manual read-loops are > > handled with a single read before the loop and a do while(not > > end_of_file) type logic. This was how I was taught structured > > programming 20-odd years back and has served me well since. > > > > To implement this in SAS requires a linked block, as two > seperate SET statements would open the input dataset twice, which is not what's > > required here. So something like the following would do the trick: > > > > data b; > > link readit; > > do while (^eof & upcase(var1)^='A'); > > link readit; > > end; > > return; > > readit: > > set a end=eof ; > > return; > > run; > > > I guess if you are using do-while with a set statement inside > this is a way to avoid an infinite loop. But with the do-until structure > you don't have to do this, since SAS will stop once it tries to read past > the end of the file. In some sense, this is more "natural" SAS processing

> (e.g. you avoid "SAS stopped due to looping" notes, etc.).

Exactly, but this 'prime read' mantra has a long beard. It was born in times where certain tongues, and foremost, COBOL, were syntax deficient and only allowed a repetitive structure of the DO WHILE type, i.e. with test at the top. Ironically (or rather moronically), the COBOL loop

perform until <condition> <...body of the loop ...> end-perform

actually test the <not condition> at the top of the loop, i.e. it is similar to DO WHILE. Furthermore, in COBOL (and PL/I), end-of-file is set when a read is attempted against an empty buffer. So, if a read instruction is placed at the top of the loop's body (where it logically belongs), and the file processing follows, the stuff processed after the last read will be garbage from the empty buffer. Thus it became customary to code

read file at end set eof to true perform until eof <... do stuff ...> read file at end set eof to true end-perform

Recalling that above, UNTIL EOF actually means WHILE NOT EOF, it is clear that when the file is empty, the first read sets end-of-file, and the loop does not iterate once. Otherwise when the last record has been read and set end-of-file, control is passed to the top of the loop, and it quits. In COBOLII, IBM could not fix the until-but-really-while idiocy because of tons of legacy code written this way, so they came up with the TEST AFTER clause, as opposed to TEST BEFORE, which remained a default. Now, the "prime read" has become unnecessary, because one could code

perform with test after until eof read file at end set eof to true not at end < ... do stuff ...> end-perform

Garbage processing is thus bypassed using "not at end". Since the "prime read" had practically become a shop-standard religion, it is easy to imagine that when some, usually yonger, folks started coding with test after, it sparkled a real shop-standard war exacerbated by the notorious intolerance of general COBOL population towards other people coding styles. The decline of COBOL's market share, especially in decision support systems, made the frictions of this sort much less intense, but even now, the old animosity is from time to time rekindled in the COBOL discussion group as a rather menacing, 100-post hate-mail thread.

Now should not we be glad that the folks from SAS were smart enough to kill this garbage (literally and figuratively) from onset: end-of-file is set as soon as SAS detects that the buffer is empty internally, without any need to probe it by a reading attempt, and prevents any further reading from an empty buffer. It means that end-of-file is set either if the file is empty to begin with, or as soon as the last record left the biffer and was moved to memory. This way, we neither have to worry about the "prime read", nor "placing the read on the bottom", or "not at end" nonsense. If it is necessary to code an explicit file-reading loop, we simply write:

data b ; do until (eof) ; set a end = eof ; < ... do stuff ... > ; end ; run ;

Because of the difference in the end of file timing, the "prime read" is productive in COBOL, but counterproductive in SAS. Coding

data b ; do while (not eof); set a end = eof ; < ... do stuff ... > ; output ; end ; run ;

only results on the 'stopped due to looping' note, because when control is passed to the top of the loop, it is passed tothe top of the impiled loop as well, whereas the buffer is already empty. Trying to rectify it by coding, as Bruce suggests, by coding

data b ; link read ; do while (not eof); link read ; < ... do stuff ... > ; end ; return ; read: set a end = eof ; return ; run ;

will obviously result in the failure to <...do stuff...> with the last input record. Of course, we can try fixing it, too, by placing another <...do stuff...> after the prime read, but what is the advantage of rummaging around like this to begin with? Not really understanding what it has to do with structured programming, my guess would be that the "prime read" concept was a large and prominent part of a course of "structured programming". If it is PL/I, it really is a structured language, but the "prime read" has nothing to do its structure, just stemming from the end of file timing. In COBOL, the reason for the "prime read" is the same, except that COBOL is a typical unstructured language - which is why, by the way, so much time has been spent teaching "structured COBOL programming" courses :-).

Kind regards, ===================== Paul M. Dorfman Jacksonville, FL =====================

Blue Cross Blue Shield of Florida, Inc., and its subsidiary and affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.


Back to: Top of message | Previous page | Main SAS-L page