| Date: | Tue, 29 Jul 2008 14:19:24 -0400 |
| Reply-To: | msz03@albany.edu |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Mike Zdeb <msz03@ALBANY.EDU> |
| Subject: | Re: data mining |
| Content-Type: | text/plain;charset=iso-8859-1 |
hi ... I figured that'd be easy to suppress results that are not 'reasonable' given knowledge of
the subject area
I think that the hard part is done here, i.e. finding the strings and repeats in one pass through
the data, and like I said, it does agree with the posting except for the extra line
--
Mike Zdeb
U@Albany School of Public Health
One University Place
Rensselaer, New York 12144-3456
P/518-402-6479 F/630-604-1475
> Note that you've got a start position of 16; my solution is assuming that this data is haplotypes,
> and there's two alleles to every marker, so I'm not including strings that go across markers, this
> particular SNP marker would be "tc" in columns 15-16.
>
> So including that extra results depends on whether you want to do what's appropriate for the field
> or not- I don't think it is appropriate to report a haplotype string that splits a SNP in
> genetics, and thus I don't use the strings that start in even columns. The user doesn't say that
> this is genetics data, but given the letters used, it is likely.
>
> -Mary
> ----- Original Message -----
> From: Mike Zdeb
> To: SAS-L@LISTSERV.UGA.EDU
> Sent: Tuesday, July 29, 2008 12:27 PM
> Subject: Re: data mining
>
>
> hi ... I was able to get the results you posted (plus I faked another sequence so I had two
> observations) ...
>
> seq motif rept stpos endpos len
> seq1-1 ag 2 1 4 45
> seq1-2 cg 2 12 15 45
> seq1-3 ct 8 16 31 45
> seq1-4 ga 2 37 40 45
> seq1-6 tcga 2 31 38 45
>
> with this ... but, I also got a SEQ1-5 that was not on your list ...
>
> seq1-5 ctct 4 16 31 45
>
> (***** we all await the 5-lines of code SQL method *****)
>
>
> data sequence;
> infile datalines missover;
> input seq : $4. h : $100.;
> datalines;
> seq1 agagattcgatcgcgctctctctctctctctcgatcgagatcgat
> seq2 agagtctctcga
> ;
> run;
>
> data x;
> set sequence;
> ll = length(h);
> s = 0;
> * start at position 1 in sequence, look for motifs length 2 to 5;
> do j=2 to 5;
> do i=1 to length(h)-4;
> motif = substr(h,i,j);
> start = i;
> rpt = 1;
> do while (trim(motif) eq trim(substr(h,i+j,j)));
> rpt + 1;
> i + j;
> end;
> if rpt ge 2 then do;
> end = start + (j*rpt) - 1;
> s + 1;
> seqq = catx('-',seq,s);
> output;
> end;
> end;
> end;
> keep seqq motif rpt start end ll;
> run;
>
> proc print data=x;
> var seqq motif rpt start end ll;
> run;
>
>
> --
> Mike Zdeb
> U@Albany School of Public Health
> One University Place
> Rensselaer, New York 12144-3456
> P/518-402-6479 F/630-604-1475
|