LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2008, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 29 Jul 2008 14:19:24 -0400
Reply-To:   msz03@albany.edu
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Mike Zdeb <msz03@ALBANY.EDU>
Subject:   Re: data mining
Content-Type:   text/plain;charset=iso-8859-1

hi ... I figured that'd be easy to suppress results that are not 'reasonable' given knowledge of the subject area

I think that the hard part is done here, i.e. finding the strings and repeats in one pass through the data, and like I said, it does agree with the posting except for the extra line

-- Mike Zdeb U@Albany School of Public Health One University Place Rensselaer, New York 12144-3456 P/518-402-6479 F/630-604-1475

> Note that you've got a start position of 16; my solution is assuming that this data is haplotypes, > and there's two alleles to every marker, so I'm not including strings that go across markers, this > particular SNP marker would be "tc" in columns 15-16. > > So including that extra results depends on whether you want to do what's appropriate for the field > or not- I don't think it is appropriate to report a haplotype string that splits a SNP in > genetics, and thus I don't use the strings that start in even columns. The user doesn't say that > this is genetics data, but given the letters used, it is likely. > > -Mary > ----- Original Message ----- > From: Mike Zdeb > To: SAS-L@LISTSERV.UGA.EDU > Sent: Tuesday, July 29, 2008 12:27 PM > Subject: Re: data mining > > > hi ... I was able to get the results you posted (plus I faked another sequence so I had two > observations) ... > > seq motif rept stpos endpos len > seq1-1 ag 2 1 4 45 > seq1-2 cg 2 12 15 45 > seq1-3 ct 8 16 31 45 > seq1-4 ga 2 37 40 45 > seq1-6 tcga 2 31 38 45 > > with this ... but, I also got a SEQ1-5 that was not on your list ... > > seq1-5 ctct 4 16 31 45 > > (***** we all await the 5-lines of code SQL method *****) > > > data sequence; > infile datalines missover; > input seq : $4. h : $100.; > datalines; > seq1 agagattcgatcgcgctctctctctctctctcgatcgagatcgat > seq2 agagtctctcga > ; > run; > > data x; > set sequence; > ll = length(h); > s = 0; > * start at position 1 in sequence, look for motifs length 2 to 5; > do j=2 to 5; > do i=1 to length(h)-4; > motif = substr(h,i,j); > start = i; > rpt = 1; > do while (trim(motif) eq trim(substr(h,i+j,j))); > rpt + 1; > i + j; > end; > if rpt ge 2 then do; > end = start + (j*rpt) - 1; > s + 1; > seqq = catx('-',seq,s); > output; > end; > end; > end; > keep seqq motif rpt start end ll; > run; > > proc print data=x; > var seqq motif rpt start end ll; > run; > > > -- > Mike Zdeb > U@Albany School of Public Health > One University Place > Rensselaer, New York 12144-3456 > P/518-402-6479 F/630-604-1475


Back to: Top of message | Previous page | Main SAS-L page