Date: Tue, 13 Jan 2009 13:01:06 +0100
Reply-To: Allen Ziegenfus <aziegenfus@ANAXIMA.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Allen Ziegenfus <aziegenfus@ANAXIMA.COM>
Subject: Re: Frequency count of words
In-Reply-To: <4e829fd30901130249i5dd2ebd2w911b733e44a26dfb@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"
Hi,
Perhaps something like the following, although I am sure this code could be
optimized. You might want to think about how you want to interpret word
boundaries or whether it should be case sensitive or not.
data _null_;
set lyrics end=dataeof;
if _n_ = 1 then do;
length word $100 count 8.;
declare hash wordcount();
wordcount.definekey("word");
wordcount.definedata("word");
wordcount.definedata("count");
wordcount.definedone();
end;
length word_index 8.;
word_index = 1;
word = scan(dline, word_index);
do while (word ne "");
count = 0;
rc = wordcount.find();
count = count + 1;
if rc = 0 then wordcount.replace();
else wordcount.add();
word_index = word_index + 1;
word = scan(dline, word_index);
end;
if dataeof then wordcount.output(dataset: 'work.word_count');
run;
-----Ursprüngliche Nachricht-----
Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag von
Anindya Mozumdar
Gesendet: Dienstag, 13. Januar 2009 11:50
An: SAS-L@LISTSERV.UGA.EDU
Betreff: Frequency count of words
All,
Supposing I have a dataset which is created this way -
data lyrics;
infile datalines dsd dlm = "|" missover firstobs = 1;
input dline :$20000.;
datalines;
There I was completely wasting, out of work and down
All inside its so frustrating as I drift from town to town
Feel as though nobody cares if I live or die
So I might as well begin to put some action in my life
Breaking the law, breaking the law
Breaking the law, breaking the law
Breaking the law, breaking the law
Breaking the law, breaking the law
;
run;
What I want is a dataset called word_counts, containing two variables
word and count which will be the number of times each word occurs in
any line in the above dataset. For example, given the dataset lyrics,
word_counts should contain
word count
completely 1
breaking 8
frustrating 1
....
Can any of you suggest a solution for this problem? Thanks in advance.
Regards,
Anindya