Date: Wed, 25 Sep 2002 14:29:26 -0400
Reply-To: Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject: Re: resolution Re: Individual count for each Record
Content-Type: text/plain; charset="iso-8859-1"
Subject: resolution Re: Individual count for each Record
Summary: Analysis of counting letters
Respondent: IanWhitlock@westat.com
Fred [xkrim3@HOTMAIL.COM] asked a question on counting letters in a
string and gave a nice summary indicating his willingness to learn.
There is a lot to learn in thinking about the general problem.
From the point of view of correctness, what happens with the data
line:
0 AAAA BBBB
The program gives 9 as the count of B. Perhaps there can never be
blanks in the data, perhaps not. What about lower case letters,
punctuation etc.? Even if these things should not be present, I would
think it best to guarantee that they are not or count and report them.
I like the second solution.
On the algorithm side, I asked myself why count all 36 possible
letters when there are only two letters? That seems like an waste. Is
there any reason for limiting the interest to the 36 characters?
(Lesson 1 - sometimes generalization leads to a more efficient answer,
and can catch unexpected values in the data.) To generalize we let
the data tell us the letters of interest rather than vice versa.
Now what about the output file. I found it pretty hard to search
through all those 0 counts to get the letters, and moreover, a waste
of space. Could there be a better way to store the counts?
I will keep the Chakravarthy, Venky [Venky.Chakravarthy@PFIZER.COM]
suggestion to count lengths, but let's restrict it to just the letters
in the text field. (If strings were long enough it might not matter.)
Now what about those pesky blanks? Get rid of them by ending the
text, with some other symbol, thus making all blanks always internal
and solving the LENGTH problem at the same time. For output, long and
narrow is more flexible and more efficient since only the relevant
counts are present. Put one count with the letter on each output
record. (Lesson 2 - don't store information in variable names, store
it in data values.)
The basic algorithm is quite simple. Peal off the first letter, count
it and then remove all occurrences of that letter. Repeat until done.
Here is the code.
data w ( keep = snum text letter count ) ;
input snum $1. +1 text $char20. ;
length letter $ 1 c1text c2text $ 22 ;
c1text = trim(text) || "ff"x ;
do while ( length ( c1text ) > 2) ;
letter = c1text ;
c2text = compress ( c1text , letter ) ;
count = length ( trim (c1text) )
- length ( trim ( c2text ) ) ;
output ;
c1text = c2text ;
end ;
cards;
0 AAAA BBBB
1 aa(+~^$$%qqqq
2 124YYZ4400994
3 34X88Z23Z3331
4 237X772Y32Z36
5 5ZZ3Y46080827
;
I am still making several assumptions. Trailing blanks will not be
counted, "FF"x is not a character in the text. This can easily be
proved with the INDEX function. If necessary one could count and
eliminate this first and then apply the basic code. Similarly one
could handle blanks any where by eliminating then first. Thus the
assumptions can be eliminated if necessary at the expense of
complexity.
Now what if the report really has to be wide? Then either store the
counts in an array or use PROC TRANSPOSE. Here is some code assuming
that just the counts for capital letters and numbers are wanted.
proc transpose data = w out = t ;
where index ( "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" , letter ) ;
by snum text ;
var count ;
id letter ;
run ;
data t ( drop = _name_ ) ;
retain snum text ;
retain _0 - _9
a b c d e f g h i j k l m n o p q r s t u v w x y z . ;
set t ;
run ;
In writing this, I cannot help but think of "Speed daemon" Paul Dorfman.
Perhaps some of this work was in the air surrounding him at SESUG. Of
course I suggest the code for its generality and principles rather than
speed. Thanks also to Sigurd Hermansen, who was also at SESUG, for teaching
the advantages of long and narrow.
Ian Whitlock
-----Original Message-----
From: fred [mailto:xkrim3@HOTMAIL.COM]
Sent: Wednesday, September 25, 2002 10:57 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: resolution Re: Individual count for each Record
many thanks to Venky, Jim & Richard who shared your expertise in this
thread. you've been most helpful. you can see from my questions that my
coding skills are less than mediocre. trying to accelerate the learning
process by asking plenty of dumb questions. glad that the guys in this
newsgroup are so willing to teach.
i am using a variant of Venky's code to solve my problem:
data set1 ;
input snum $ text $ ;
cards;
1 11XXY11222335
2 124YYZ4400994
3 34X88Z23Z3331
4 237X772Y32Z36
5 5ZZ3Y46080827
;
run ;
data set2 ( drop = i j k) ;
set set1 ;
array n_(36) n0-n9 a1-a26 ;
do i = 48 to 57 , 65 to 90 ; * all ASCII;
j = i - 47 ;
k = i - 54 ;
if i le 57 then
n_(j) = length(trim(text)) - length(compress(text,put(byte(i),1.))) ;
else
n_(k) = length(trim(text)) - length(compress(text,put(byte(i),1.))) ;
end ;
run ;
proc print noobs ; run ;
"fred" <xkrim3@hotmail.com> wrote in message
news:ampua8$vmvh$1@engel.scvmaxonline.com.sg...
> Hi all,
> I am trying to count the number of 1's 2's 3's etc in variable 'text' for
> each of the five records. Variables s1 s2 s3.... s9 reports the number of
> 1's 2's 3's....9's respectively for each record. Program is appended.
> However, output shows that the values of s1 to s9 are cummulative. The
> desired result is individual counts of 1's 2's etc for each record.
>
> Please advise. Is there a more elegant way to write this, e.g. using
> arrays?
>
> Thanks.
>
> data set1;
> input snum $ text $;
> cards;
> 1 111122233
> 2 124444599
> 3 348823333
> 4 237772323
> 5 534662882
> ;
> run;
> data set2 (drop=i);
> set set1;
> by snum;
> retain s1 0 s2 0 s3 0 s4 0 s5 0 s6 0 s7 0 s8 s9 0;
> if first.snum then do i=1 to 6;
> if substr(text,i,1)='1' then s1=s1+1;
> if substr(text,i,1)='2' then s2=s2+1;
> if substr(text,i,1)='3' then s3=s3+1;
> if substr(text,i,1)='4' then s4=s4+1;
> if substr(text,i,1)='5' then s5=s5+1;
> if substr(text,i,1)='6' then s6=s6+1;
> if substr(text,i,1)='7' then s7=s7+1;
> if substr(text,i,1)='8' then s8=s8+1;
> if substr(text,i,1)='9' then s9=s9+1;
> end;
> if last.snum then output;
> run;
>
>