Date: Wed, 4 Feb 2009 15:39:01 -0500
Reply-To: Paul St Louis <pstloui@DOT.STATE.TX.US>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul St Louis <pstloui@DOT.STATE.TX.US>
Subject: Re: OT: Chance to Make SAS-L History: Did You Know That...
A missing value can/will affect the accuracy of your computations?
Yesterday I posted 'Avoiding Division by zero Err Msg Generates "I"'. Mary
<mlhoward@avalon.net> responded with a link to a very good article by Robin
High....
http://www.uoregon.edu/~robinh/missing_data.txt
A must read for anyone who thinks they fully grasp the implications of
missing data. Although I already understood that the best way to handle
missing data is with the missing function (whether numerical, character, or
date), I thought I would list a few excerpts from Robin's paper. One of
Robin's suggestions is to use...
IF (MISSING(var) EQ 1)
or
IF (MISSING(var)
Otherwise, some computations with missing data will produce inaccurate
results.
DATA _null_;
x1=.;
x2=3;
x3=6;
x_sum1 = x1 + x2 + x3;
x_sum2 = SUM(x1, x2, x3);
PUT x_sum1 x_sum2;
RUN;
Log:
. 9
NOTE: Missing values were generated as a result of performing an operation
on missing values.
Each place is given by: (Number of times) at (Line):(Column).
1 at 363:13
x_sum1 computes incorrectly, but x_sum2 is correct.
If and Where statements also affected. When IF or WHERE statements are
entered, SAS treats missing values as if they were negative numbers with
extremely large magnitudes.
A missing data value in SAS is actually a special, reserved floating point
number. The official 28 missing data codes are defined as:
* An period followed by an underscore: ._
* A single period: .
* A period followed by an alphabetic letter: .a .b .c ... .x .y .z
Comparing numerical data value with an open-ended IF statement
is risky, for example:
IF ( x_var LT <any real number>)
This type of IF statement will be "true" whenever x_var contains a missing
value. Missing value comparisons are also relevant with the greater than
(GT) test, e.g., IF (y_var GT x_var) will be:
* true if y_var is not missing but x_var is missing
* false if y_var is missing
Even though they are not treated as numerical data in calculations, missing
data codes behave as if they had unique, ordered numerical values. Since .z
is defined to be the 'largest' missing data value, a more comprehensive IF
statement that will work for all missing data values is:
IF (x_var LE .z)
A very good paper to read with many more examples....