LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2002, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 25 Sep 2002 14:29:26 -0400
Reply-To:     Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject:      Re: resolution Re: Individual count for each Record
Comments: To: fred <xkrim3@hotmail.com>
Content-Type: text/plain; charset="iso-8859-1"

Subject: resolution Re: Individual count for each Record Summary: Analysis of counting letters Respondent: IanWhitlock@westat.com

Fred [xkrim3@HOTMAIL.COM] asked a question on counting letters in a string and gave a nice summary indicating his willingness to learn.

There is a lot to learn in thinking about the general problem.

From the point of view of correctness, what happens with the data line:

0 AAAA BBBB

The program gives 9 as the count of B. Perhaps there can never be blanks in the data, perhaps not. What about lower case letters, punctuation etc.? Even if these things should not be present, I would think it best to guarantee that they are not or count and report them. I like the second solution.

On the algorithm side, I asked myself why count all 36 possible letters when there are only two letters? That seems like an waste. Is there any reason for limiting the interest to the 36 characters? (Lesson 1 - sometimes generalization leads to a more efficient answer, and can catch unexpected values in the data.) To generalize we let the data tell us the letters of interest rather than vice versa.

Now what about the output file. I found it pretty hard to search through all those 0 counts to get the letters, and moreover, a waste of space. Could there be a better way to store the counts?

I will keep the Chakravarthy, Venky [Venky.Chakravarthy@PFIZER.COM] suggestion to count lengths, but let's restrict it to just the letters in the text field. (If strings were long enough it might not matter.) Now what about those pesky blanks? Get rid of them by ending the text, with some other symbol, thus making all blanks always internal and solving the LENGTH problem at the same time. For output, long and narrow is more flexible and more efficient since only the relevant counts are present. Put one count with the letter on each output record. (Lesson 2 - don't store information in variable names, store it in data values.)

The basic algorithm is quite simple. Peal off the first letter, count it and then remove all occurrences of that letter. Repeat until done. Here is the code.

data w ( keep = snum text letter count ) ; input snum $1. +1 text $char20. ; length letter $ 1 c1text c2text $ 22 ; c1text = trim(text) || "ff"x ; do while ( length ( c1text ) > 2) ; letter = c1text ; c2text = compress ( c1text , letter ) ; count = length ( trim (c1text) ) - length ( trim ( c2text ) ) ; output ; c1text = c2text ; end ; cards; 0 AAAA BBBB 1 aa(+~^$$%qqqq 2 124YYZ4400994 3 34X88Z23Z3331 4 237X772Y32Z36 5 5ZZ3Y46080827 ;

I am still making several assumptions. Trailing blanks will not be counted, "FF"x is not a character in the text. This can easily be proved with the INDEX function. If necessary one could count and eliminate this first and then apply the basic code. Similarly one could handle blanks any where by eliminating then first. Thus the assumptions can be eliminated if necessary at the expense of complexity.

Now what if the report really has to be wide? Then either store the counts in an array or use PROC TRANSPOSE. Here is some code assuming that just the counts for capital letters and numbers are wanted.

proc transpose data = w out = t ; where index ( "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" , letter ) ; by snum text ; var count ; id letter ; run ;

data t ( drop = _name_ ) ; retain snum text ; retain _0 - _9 a b c d e f g h i j k l m n o p q r s t u v w x y z . ; set t ; run ;

In writing this, I cannot help but think of "Speed daemon" Paul Dorfman. Perhaps some of this work was in the air surrounding him at SESUG. Of course I suggest the code for its generality and principles rather than speed. Thanks also to Sigurd Hermansen, who was also at SESUG, for teaching the advantages of long and narrow.

Ian Whitlock -----Original Message----- From: fred [mailto:xkrim3@HOTMAIL.COM] Sent: Wednesday, September 25, 2002 10:57 AM To: SAS-L@LISTSERV.UGA.EDU Subject: resolution Re: Individual count for each Record

many thanks to Venky, Jim & Richard who shared your expertise in this thread. you've been most helpful. you can see from my questions that my coding skills are less than mediocre. trying to accelerate the learning process by asking plenty of dumb questions. glad that the guys in this newsgroup are so willing to teach.

i am using a variant of Venky's code to solve my problem:

data set1 ;

input snum $ text $ ;

cards;

1 11XXY11222335

2 124YYZ4400994

3 34X88Z23Z3331

4 237X772Y32Z36

5 5ZZ3Y46080827

;

run ;

data set2 ( drop = i j k) ;

set set1 ;

array n_(36) n0-n9 a1-a26 ;

do i = 48 to 57 , 65 to 90 ; * all ASCII;

j = i - 47 ;

k = i - 54 ;

if i le 57 then

n_(j) = length(trim(text)) - length(compress(text,put(byte(i),1.))) ;

else

n_(k) = length(trim(text)) - length(compress(text,put(byte(i),1.))) ;

end ;

run ;

proc print noobs ; run ;

"fred" <xkrim3@hotmail.com> wrote in message news:ampua8$vmvh$1@engel.scvmaxonline.com.sg... > Hi all, > I am trying to count the number of 1's 2's 3's etc in variable 'text' for > each of the five records. Variables s1 s2 s3.... s9 reports the number of > 1's 2's 3's....9's respectively for each record. Program is appended. > However, output shows that the values of s1 to s9 are cummulative. The > desired result is individual counts of 1's 2's etc for each record. > > Please advise. Is there a more elegant way to write this, e.g. using > arrays? > > Thanks. > > data set1; > input snum $ text $; > cards; > 1 111122233 > 2 124444599 > 3 348823333 > 4 237772323 > 5 534662882 > ; > run; > data set2 (drop=i); > set set1; > by snum; > retain s1 0 s2 0 s3 0 s4 0 s5 0 s6 0 s7 0 s8 s9 0; > if first.snum then do i=1 to 6; > if substr(text,i,1)='1' then s1=s1+1; > if substr(text,i,1)='2' then s2=s2+1; > if substr(text,i,1)='3' then s3=s3+1; > if substr(text,i,1)='4' then s4=s4+1; > if substr(text,i,1)='5' then s5=s5+1; > if substr(text,i,1)='6' then s6=s6+1; > if substr(text,i,1)='7' then s7=s7+1; > if substr(text,i,1)='8' then s8=s8+1; > if substr(text,i,1)='9' then s9=s9+1; > end; > if last.snum then output; > run; > >


Back to: Top of message | Previous page | Main SAS-L page