Date: Wed, 11 Jul 2007 09:42:41 -0400 Steve Denham "SAS(r) Discussion" Steve Denham Re: Need help in t-test in SAS

On Wed, 11 Jul 2007 08:08:59 -0400, Peter Flom <peterflomconsulting@MINDSPRING.COM> wrote:

>Santughosh001@GMAIL.COM wrote >> >>I am sorry I think I didn't make myself clear. The problem is as >>follows: >> >> >>I have two different groups (Control & Cancer), each group has >>observations on 6000 genes the data looks like; >> >> >>geneid ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 >> >> >>here geneid has the gene numbers 1,2,3..............6000. and ch1 to >>ch10 have some observations. Now for my problem >>Control group (group1) consists of the observations on ch1 to ch5 and >>the Cancer group (group2) consists of ch6 to ch10. >> >> >>Then suppose for gene1 I have five observations from group1 and five >>from group2. Now I have to perform a two sample t-test >>for this two group to make sure that the expression obtained from >>gene >>1 is significantly different for this two groups. >> >> >>In the same way I have to repeat the t-test for 6000 genes. Is there >>any way in SAS to run the t-test for 6000 several times >>and store the p-values in a dataset. >> > > >OK, let me see if I've got it straight: >For gene 1 you have a t-test (with 5 people in each group) >For gene 2 you have a t-test (with 5 people in each group) >.... >For gene 6000 you have a t-test (with 5 people in each group) > >But then, I don't understand "run the t test for 6000 several times and store the p values" > >what do you want to run several times? > >In any case, if you run 6000 tests, which it seems like you want to do, then you are asking for trouble. EVEN if all the assumptions are met, then, if the two populations (cancer and normal) are EXACTLY equal, then you will get about 300 significant results at p = .05. That can't be good! How will you know what you've got? > >I think this is a foolish thing to do. I suggest that there are likely much better ways, that involve using the structure of the data in more complex ways (I can't imagine the 6000 genes are all independent of each other!). > >But, if you insist on doing this, then look into ODS SELECT. > >Peter

This looks like someone is searching for a QTL (quantitative trait locus). There is a wealth of literature out there somewhere, and, seriously, there is almost no correction for multiplicity. However, the folks I have worked with who do this DO NOT calculate p values. Plots of the log of the F test (or t test) value are used. If the X axis indicates location on a chromosome (in some sort of pre-agreed upon units) and the Y indicates log F (which is an effect size measure), you can then look for markers in the same area for selection.

So, my suggestion is to just calculate the abolute values of these effect size numbers--no probabilities, and sort. Big numbers are likely associated, small numbers are not. Once likely genes are identified, confirmatory approaches using acceptable sample sizes can be undertaken.

Here is some code (untested) that might do that:

first rearrange the data length wise;

data foo;set original; indicator=1;group='Control';ch=ch1;output; <etc.> indicator=6;group='Cancer ';ch=ch6;output; <etc.> run; proc sort data=foo; by group geneid; /*group indicates whether the record is for cancer or control*/; run;

proc means noprint data=foo; by group geneid; var ch; output out=foomean mean=chmean var=chvar; run;

data foocanc;set foomean; if group='Cancer '; cancmean=chmean; cancvar=chvar; drop chmean chvar; run; data foocont;set foomean; if group='Control'; contmean=chmean; contstd=chvar; drop chmean chvar; run;

proc sort data=foocanc; by geneid; run; proc sort data=foocont; by geneid; run;

data fooboth;merge foocanc foocont; by geneid; absdiff=abs(contmean-cancmean); denom=sqrt((contvar+concvar)/2); effect=absdiff/denom; run;

proc sort data=fooboth; by effect; run;

proc print data=fooboth; run;

Now, there are probably slicker ways to do this, but the DATA step approach kind of gets your hands into the data a little more. Probably much the same thing could be done by PROC TTEST, with a by geneid statement, and ODS control to get the fooboth dataset.

Steve Denham Mathematical Biologist Monsanto Co.

Back to: Top of message | Previous page | Main SAS-L page