|Date: ||Wed, 11 Jul 2007 09:42:41 -0400|
|Reply-To: ||Steve Denham <steven.c.denham@MONSANTO.COM>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||Steve Denham <steven.c.denham@MONSANTO.COM>|
|Subject: ||Re: Need help in t-test in SAS|
On Wed, 11 Jul 2007 08:08:59 -0400, Peter Flom
>>I am sorry I think I didn't make myself clear. The problem is as
>>I have two different groups (Control & Cancer), each group has
>>observations on 6000 genes the data looks like;
>>geneid ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
>>here geneid has the gene numbers 1,2,3..............6000. and ch1 to
>>ch10 have some observations. Now for my problem
>>Control group (group1) consists of the observations on ch1 to ch5 and
>>the Cancer group (group2) consists of ch6 to ch10.
>>Then suppose for gene1 I have five observations from group1 and five
>>from group2. Now I have to perform a two sample t-test
>>for this two group to make sure that the expression obtained from
>>1 is significantly different for this two groups.
>>In the same way I have to repeat the t-test for 6000 genes. Is there
>>any way in SAS to run the t-test for 6000 several times
>>and store the p-values in a dataset.
>OK, let me see if I've got it straight:
>For gene 1 you have a t-test (with 5 people in each group)
>For gene 2 you have a t-test (with 5 people in each group)
>For gene 6000 you have a t-test (with 5 people in each group)
>But then, I don't understand "run the t test for 6000 several times and
store the p values"
>what do you want to run several times?
>In any case, if you run 6000 tests, which it seems like you want to do,
then you are asking for trouble. EVEN if all the assumptions are met,
then, if the two populations (cancer and normal) are EXACTLY equal, then
you will get about 300 significant results at p = .05. That can't be
good! How will you know what you've got?
>I think this is a foolish thing to do. I suggest that there are likely
much better ways, that involve using the structure of the data in more
complex ways (I can't imagine the 6000 genes are all independent of each
>But, if you insist on doing this, then look into ODS SELECT.
This looks like someone is searching for a QTL (quantitative trait locus).
There is a wealth of literature out there somewhere, and, seriously, there
is almost no correction for multiplicity. However, the folks I have worked
with who do this DO NOT calculate p values. Plots of the log of the F test
(or t test) value are used. If the X axis indicates location on a
chromosome (in some sort of pre-agreed upon units) and the Y indicates log
F (which is an effect size measure), you can then look for markers in the
same area for selection.
So, my suggestion is to just calculate the abolute values of these effect
size numbers--no probabilities, and sort. Big numbers are likely
associated, small numbers are not. Once likely genes are identified,
confirmatory approaches using acceptable sample sizes can be undertaken.
Here is some code (untested) that might do that:
first rearrange the data length wise;
data foo;set original;
proc sort data=foo;
by group geneid; /*group indicates whether the record is for cancer or
proc means noprint data=foo;
by group geneid;
output out=foomean mean=chmean var=chvar;
data foocanc;set foomean;
if group='Cancer ';
drop chmean chvar;
data foocont;set foomean;
drop chmean chvar;
proc sort data=foocanc;
proc sort data=foocont;
data fooboth;merge foocanc foocont;
proc sort data=fooboth;
proc print data=fooboth;
Now, there are probably slicker ways to do this, but the DATA step approach
kind of gets your hands into the data a little more. Probably much the
same thing could be done by PROC TTEST, with a by geneid statement, and ODS
control to get the fooboth dataset.