Date: Tue, 29 Jul 2008 13:27:12 -0400
Reply-To: msz03@albany.edu
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mike Zdeb <msz03@ALBANY.EDU>
Subject: Re: data mining
Content-Type: text/plain;charset=iso-8859-1
hi ... I was able to get the results you posted (plus I faked another sequence so I had two
observations) ...
seq motif rept stpos endpos len
seq1-1 ag 2 1 4 45
seq1-2 cg 2 12 15 45
seq1-3 ct 8 16 31 45
seq1-4 ga 2 37 40 45
seq1-6 tcga 2 31 38 45
with this ... but, I also got a SEQ1-5 that was not on your list ...
seq1-5 ctct 4 16 31 45
(***** we all await the 5-lines of code SQL method *****)
data sequence;
infile datalines missover;
input seq : $4. h : $100.;
datalines;
seq1 agagattcgatcgcgctctctctctctctctcgatcgagatcgat
seq2 agagtctctcga
;
run;
data x;
set sequence;
ll = length(h);
s = 0;
* start at position 1 in sequence, look for motifs length 2 to 5;
do j=2 to 5;
do i=1 to length(h)-4;
motif = substr(h,i,j);
start = i;
rpt = 1;
do while (trim(motif) eq trim(substr(h,i+j,j)));
rpt + 1;
i + j;
end;
if rpt ge 2 then do;
end = start + (j*rpt) - 1;
s + 1;
seqq = catx('-',seq,s);
output;
end;
end;
end;
keep seqq motif rpt start end ll;
run;
proc print data=x;
var seqq motif rpt start end ll;
run;
--
Mike Zdeb
U@Albany School of Public Health
One University Place
Rensselaer, New York 12144-3456
P/518-402-6479 F/630-604-1475