|
Stephen Arthur asks:
>
> Hi,
>
> I have an ordered list of 73 (this is really
> arbitrary) items with values of 'A', 'T', 'C', 'G'.
>
> I want to obtain a continuous sequence of these
> letters that is 50 'bases' long. The only criteria
> being that the 'C' and 'G' letters make up at least
> 50% of an accepted sequence. If starting from
> position one the sequence fails the 50% criteria test,
> I want to be able to read the next 50 bases from
> position 2, etc...
>
Steve,
This might help - based on your code:
/* Tested code Win 95 6.12 TS020 */
data dna73(keep = base i roller)
start_at(keep=startpos);
retain i 0;
infile cards;
input base $ @@;
if base in ('C','G') then value = 1;
else value = 0;
i + 1;
retain roller 0;
dropoff = lag50(value);
roller = sum(roller,value,-dropoff);
if 1 le i le 73 then output dna73;
if roller ge 25 then do;
startpos = i - 49;
output start_at;
end;
cards;
T G A G A T C A C T T C C C T T G C
A C A G T T T G G A A G G G A G A G
C A C T T T A T T A C A G A C C T T
G G A A G C A A G A G G A T T G
C A T
;
run;
data dna50;
retain seqnum 0;
drop startpos;
set start_at;
seqnum + 1;
do pointer = startpos to startpos+49;
set dna73 point=pointer;
output;
end;
run;
The variable called 'roller' is tracking how many C and G bases have
come up in the last fifty observations, and if it hits 25 we write out
where the sequence started. Then we can create DNA50 which contains a
listing of all the valid sequences (there aren't any, but if you drop
the criterion to 24, you get four). A bit of macro code could be
wrapped round this to determine how many bases you scan and what the
success criterion is.
Dave
.
|