| Date: | Mon, 16 Mar 1998 13:44:05 -0500 |
| Reply-To: | WHITLOI1 <WHITLOI1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU> |
| From: | WHITLOI1 <WHITLOI1@WESTAT.COM> |
| Subject: | Re: Accessing a Variable's Data Type in a Program |
|
| Content-Type: | text/plain; charset=US-ASCII |
Subject: Accessing a Variable's Data Type in a Program
Summary: A solution using an SQL dictionary file is carefully
considered.
Respondent: Ian Whitlock <whitloi1@westat.com>
Robert C. Saunders <saunders@UANSV5.VANDERBILT.EDU> asked for help
automating a solution to:
> I have 10 SAS datasets that I want to concatenate. However, in some
> files a variable is stored as character and in another it's stored as
> numeric. Concatenate won't procede unless the common variables are all
> of the same type. I need to convert the variables from numeric to
> character or character to numeric as appropriate for the concatenated
> data set.
His complete text is appended at the bottom.
First let me point out that the existence of the problem indicates a
severe breakdown in project management control; hence I personally
would not want to use an automated solution to this problem. There
are probably too many other little details that should be carefully
considered. With that said the problem is interesting and worth
considering for the technique involved.
Robert did not make entirely clear how one is to determine which
variables are converted in which direction. I will assume a data set
STD exists with 0 or more obs which has the correct type and length
for all variables; hece any differences from the standard indicate the
need for a conversion.
For test data let's use
data std ;
stop ;
x = 0 ; y = '00' ; z = 'abc' ;
run ;
data w1 ;
x = 1 ; y = '11' ; z = 'abc' ;
run ;
data w2 ;
x = '2.0000' ; y = 22 ; z = 'xyz' ;
run ;
data w3 ;
x = 3 ; y = 33 ; z = 'xyz' ;
run ;
data w4 ;
x = '4.4' ; y = '44' ; z = 'xyz' ;
run ;
data w5 ;
x = 5 ; y = '55' ; z = 'abc' ;
run ;
Now let's look at what we have to accomplish. The final data step
should look something like:
/* combine the data sets */
data all ( drop = __t: ) ;
set std ( obs = 0 )
w1 ( in = mem1 rename = ())
w2 ( in = mem2 rename = (X = __t2 Y = __t3))
w3 ( in = mem4 rename = (Y = __t4))
w4 ( in = mem5 rename = (X = __t5))
w5 ( in = mem6 rename = ())
;
if mem2 then do ; /* w2 */
X = input ( __t2 ,6. ) ;
Y = put ( __t3 ,2. ) ;
end;
if mem4 then do ; /* w3 */
Y = put ( __t4 ,2. ) ;
end;
if mem5 then do ; /* w4 */
X = input ( __t5 ,3. ) ;
end;
run ;
What about the empty RENAME lists? Fortunately version 6.12 is
friendly to empty lists. However with each new version of SAS the SI
programmers tend to forget this user friendly principle so one can
expect glitches from time to time. Of course the empty RENAME clauses
could be eliminated in the same way that the empty IF-DO blocks were
eliminated, but I am stubborn and tend to think it their problem
instead of mine.
As presented above we see that there are two separate problems
1) How to get the correct rename options on the data sets in the
SET statement.
2) How to coordinate the renames with the conversions after the SET
statement.
We could get a list of names from each data set where the corresponding
types differ. For example, for data set W3:
proc sql ;
select b.memname , a.name , a.type ,
case
when a.type = 'char' then a.length
else b.length
end as length
from dictionary.columns as a ,
dictionary.columns as b
where a.libname = b.libname = "WORK"
and a.memname = "STD"
and b.memname = "W3"
and a.name = b.name
and a.type ^= b.type
;
quit ;
Note that we always need the length of the character variable in the
conversion, either because we want to INPUT the variable to a numeric
one or because we want to PUT it to a character one.
Armed with such a data set it is easy to generate a rename for each
record and also to write the conversion corresponding to each record.
Since there will be ten such SELECT clauses all unioned together let's
make it a macro. There is one technical detail. What about data sets
that don't need any conversions? Well let's make a list of all the
data sets and then do a left join to insure that there is a record for
every data set. If NAME is missing then this will indicate that none
of the special code is needed.
data mems ;
input memname $ ;
cards ;
w1
w2
w3
w4
w5
;
%macro getspecs ( mem = ) ;
b.memname , a.name , a.type ,
case
when a.type = 'char' then a.length
else b.length
end as length
from dictionary.columns as a ,
dictionary.columns as b
where a.libname = b.libname = "WORK"
and a.memname = "STD"
and b.memname = "%upcase(&mem)"
and a.name = b.name
and a.type ^= b.type
%mend getspecs ;
proc sql ;
create table specs as
select %getspecs ( mem = w1 )
union corr
select %getspecs ( mem = w2 )
union corr
select %getspecs ( mem = w3 )
union corr
select %getspecs ( mem = w4 )
union corr
select %getspecs ( mem = w5 )
;
create table specs as
select m.memname , s.name , s.type , s.length
from mems as m left join specs as s
on s.memname = upcase ( m.memname )
;
******select * from specs ;
quit ;
Now to generate the code lets use two separate DATA steps - one for
the RENAMES and one for the conversion code. We will write to a
catalog to take care of the clean up problem.
filename code catalog 'work.temp' ;
/* write set statement */
data _null_ ;
file code ( setstmt.source ) ;
if eof then put ';' ;
if _n_ = 1 then put 'set std ( obs = 0 ) ' ;
set specs end = eof ;
by memname notsorted ;
__x + 1 ;
if first.memname then
put memname '( in = mem' __x 5.-l ' rename = (' ;
if name ^= ' ' then put name '= __t' __x 5.-l ;
if last.memname then put '))' ;
run ;
/* write conversion code */
data _null_ ;
file code ( convert.source ) ;
length fmt $ 5 ;
set specs end = eof ;
by memname notsorted ;
__x + 1 ;
if first.memname and name ^= ' ' then
put 'if mem' __x 5.-l ' then do ; /* ' memname '*/' ;
fmt = trim ( left ( put ( length , 3. ) ) ) || '.' ;
if type = 'char' and name ^= ' ' then
put name '= put ( __t' __x 5.-l ',' fmt ') ;' ;
else
if name ^= ' ' then
put name '= input ( __t' __x 5.-l ',' fmt ') ;' ;
if last.memname and name ^= ' ' then put 'end; ' ;
run ;
Finally, it is time to put it altogether with
/* combine the data sets */
data all ( drop = __t: ) ;
%inc code ( setstmt.source ) / source2 ;
%inc code ( convert.source ) / source2 ;
run ;
I have intentionally left the code in a crude form to be modified to
meet a particular circumstance. Of course one could make a managing
macro to take as parameters the standard set and the list of sets to
concatenate. The only changes needed to the above code would be a
%DO-loop to generate the SQL code and a %DO-loop to generate the data
set MEMS. The rest can be used as is.
What about the other obvious approach - use SCL functions in DATA
steps to retrieve tne needed information? In this case the SQL code
disappears and the the two writing DATA steps can be merged into one
glorious mess of loops. Is it really any simpler? I would be
interested in seeing someone carry out the plan in detail for
comparison. Could it be faster? Possibly, but remember the savings
is the onetime cost of generating code so I do not see it as
significant.
Ian Whitlock
______________________________ Reply Separator _________________________________
Subject: Accessing a Variable's Data Type in a Program
Author: "Robert C. Saunders" <saunders@UANSV5.VANDERBILT.EDU> at
Internet-E-Mail
Date: 3/15/98 9:42 AM
Hi,
I have a question about accessing a variable's data type as part of a program.
Of course, I've spent enough time thinking about how to automate this, that I
should just do the very slow process, but it's a problem that has come up enough
times before that my time figure this out has been an investment. Healthy
rationalization.
PREFACE:
I use SAS612, Windows NT40.
PROBLEM:
I have 10 SAS datasets that I want to concatenate. However, in some files a
variable is stored as character and in another it's stored as numeric.
Concatenate won't procede unless the common variables are all of the same type.
I need to convert the variables from numeric to character or character to
numeric as appropriate for the concatenated data set.
WHAT I WOULD LIKE TO DO:
I want the program to open the data set, check the variable's data type. There
are two types of switches (c to n and n to c). If the variable is c and I want
it to be c, the program should skip to the next variable; if the variable is c
and I want it to be n, then I want the program to do an INPUT() step to convert
the data. I would do this for the checking and conversion of numeric variables.
UNDESIRABLE SOLUTION:
Go through each data set's contents, open each data set, use INPUT() to convert
the variables, close the data set, then concatenate them. This is slow and I am
reluctant to believe this isn't automatable.
Concatenate the datasets with a RENAME= on each input data set and process it in
"wide" form, tolerating the type mismatches.
WHAT I'M CONSIDERING:
Use PROC CONTENTS DATA=FILE1 OUT=CFILE1 on each data set;
Write a data step using FILE1 that tells it to look up each variable's TYPE in
its corresponding OUT=dataset;
Repeat for data files 2 through 10.
I think I will need to make a data set that contains the "desired" data type for
each variable for the program to lookup each variable's correct state before
making the conversion.
ASSISTANCE:
If anyone has alternate suggestions or knows of some PROC/MACRO that already
exists to handle this, I would greatly appreciate your sharing it with me.
Thank you.
Robert
################################
Robert C. Saunders, M.P.P.
Research Associate
CMHP/VIPPS/Vanderbilt University
1207 18th Avenue South
Nashville, Tennessee 37212
T: 615.322.8284
F: 615.322.8081
|