Date: Thu, 11 Mar 2010 09:54:07 -0800
Reply-To: "Richard A. DeVenezia" <rdevenezia@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Richard A. DeVenezia" <rdevenezia@GMAIL.COM>
Subject: Re: Using character variales as continuous variables
Content-Type: text/plain; charset=windows-1252
On Mar 10, 6:53 pm, Lance Smith <medicaltr...@gmail.com> wrote:
> Dear all,
> I have a database of 50 SNP variables. Each SNP variable has 3 levels
> letís say AA, AG, GG. The levels vary with different SNPs, so another
> one may be CC CT and TT and still another may be AA AC and CC.
> I also have levels of four markers that are on a continuous scale.
> I need to do univariate linear regression to predict the level of
> biomarkers using wach SNP seperately.
> Thus I need to do 50*4 = 200 univariate linear regressions.
> The SNPs need to be recoded to 0,1,2 for the regression as we want to
> treat them as a continuous variable with the heterozygotes (AG or CT
> or AC) coded as 1.
> Is there a way to efficiently do the recoding to 0,1,2 in SAS without
> having to recode all the 50 SNPs separately? Or is there a way to tell
> SAS to treat them as continuous variables even though they are coded
> as character variables?
> Thank you
Yes, there is a way.
Q: How many rows are in the database ? You might want to tranpose the
entire kaboodle in order to be able to use BY or CLASS statements.
If the allowed levels of each SNP variable are specified in a separate
table, you can use that table to create a view to map the textual
level value to a numeric value.
If the allowed level are not known apriori, a pass through the
collected data _can_ extract the observed level values and map based
on that. However, if some SNP variables have fewer than 3 different
level values, the regression might be misleading or require closer
There is a unfortunate side-effect from mapping to 0,1,2 -- you can't
use a single format to reverse map a 0,1,2 to its original level value
(because each SNP variable has a different set of levels)
This sample code will pass over a study's collected data to determine
the level values and compute an appropriate recode value. The recode
data is used to create a custom informat that is applied to each SNP
variable to create an SNPX variable. The regressions would use
Note: A hash table approach could also perform the same type of
* fake snp level values are as such
* AA, AB, BB
* BB, BC, CC
* aa, ab, bb
length sampleid biomarker 4;
array snp $2 snp1-snp50 ;
do sampleid = 1 to 100;
biomarker = ceil(10*ranuni(1234));
do _n_ = 1 to dim(snp);
x = floor(3*ranuni(1234));
if _n_ < 26 then
code = rank('A') + _n_ - 1 ;
code = rank('a') + _n_ - 26;
snp(_n_) = byte(code + x/2) || byte(code + (x+1)/2);
drop code x;
proc transpose data=fake_study
proc sort data=level_values nodupkey;
by _name_ level_value;
if first._name_ then label=0; else label+1;
start = catx ('_', upcase(_name_), upcase(level_value));
fmtname = 'SNP_LEVEL_NUM';
type = 'I';
keep start label fmtname type;
proc format cntlin = level_informat_data;
data fake_study_snpX / view = fake_study_snpX;
array snp snp1-snp50;
array snpx snpx1-snpx50; format snpx: 1.;
do _n_ = 1 to dim(snp);
name_cat_level = catx
snpx(_n_) = input (name_cat_level, SNP_LEVEL_NUM.);
Richard A. DeVenezia