| Date: | Tue, 13 Jan 2009 12:41:27 -0600 |
| Reply-To: | Joe Matise <snoopy369@GMAIL.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Joe Matise <snoopy369@GMAIL.COM> |
| Subject: | Re: Frequency count of words |
| In-Reply-To: | <200901131816.n0DBkx4d022839@malibu.cc.uga.edu> |
| Content-Type: | text/plain; charset=ISO-8859-1 |
Clearly I am making the assumption that my data is good proper english
without any contractions. I think that is a fair assumption to make...
;)
Yeah, obviously this is a bad idea; just use scan and forget compress. Too
tired to think today apparently. Probably should go home from work ...
wonder if my boss would buy that...
-Joe
On Tue, Jan 13, 2009 at 12:16 PM, Howard Schreier <hs AT dc-sug DOT org> <
schreier.junk.mail@gmail.com> wrote:
> On Tue, 13 Jan 2009 10:41:55 -0600, Joe Matise <snoopy369@GMAIL.COM>
> wrote:
>
> >Scan does indeed use all four of those characters as delimiters, good
> >point. I'd still use compress for the option of dropping other nonalpha
> >characters (scan doesn't seem to have the option of only counting letters
> as
> >non-word-breaks) unless the data was guaranteed really clean (and it never
> >is...).
> >
> >-Joe
>
> Keep in mind that some words incorporate non-alpha characters (eg,
> "isn't").
> Languages other than English may give rise to other issues.
>
> >
> >On Tue, Jan 13, 2009 at 10:39 AM, Allen Ziegenfus
> <aziegenfus@anaxima.com>wrote:
> >
> >> Scan appears to do this too. I don't see any difference between:
> >>
> >> word = lowcase(compress(scan(dline,i," "),"!.?,' '"));
> >> word2 = lowcase(scan(dline,i,"!.?,' '"));
> >>
> >> -----Urspr�ngliche Nachricht-----
> >> Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag von
> Joe
> >> Matise
> >> Gesendet: Dienstag, 13. Januar 2009 17:19
> >> An: SAS-L@LISTSERV.UGA.EDU
> >> Betreff: Re: Frequency count of words
> >>
> >> Adding compress eliminates the punctuation. You would probably be
> better
> >> off, in fact, doing:
> >>
> >> Compress(scan(dline,i),,"ak");
> >>
> >> which if I get my compress syntax right would only keep alphabet
> characters
> >> (or one of the other options if you want to keep numerics or whatnot).
> >>
> >>
> >> -Joe
> >>
> >> On Tue, Jan 13, 2009 at 6:38 AM, Allen Ziegenfus
> >> <aziegenfus@anaxima.com>wrote:
> >>
> >> > I just thought of that too! I wonder which is faster.
> >> >
> >> > What does adding compress do in your example?
> >> >
> >> > You could also use a datastep view so that you don't actually store
> the
> >> > words dataset.
> >> >
> >> > data work.tmp (keep=word) / view=work.tmp;
> >> > set lyrics;
> >> > word_index = 1;
> >> > word = scan(dline, 1);
> >> > do while (word ne "");
> >> > output;
> >> > word_index = word_index + 1;
> >> > word = scan(dline, word_index);
> >> > end;
> >> > run;
> >> >
> >> > proc summary data=tmp;
> >> > class word;
> >> > output out = word_count;
> >> > run;
> >> >
> >> > -----Urspr�ngliche Nachricht-----
> >> > Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag von
> >> > Gerhard Hellriegel
> >> > Gesendet: Dienstag, 13. Januar 2009 13:26
> >> > An: SAS-L@LISTSERV.UGA.EDU
> >> > Betreff: Re: Frequency count of words
> >> >
> >> > easier to understand for me (without hash-lists):
> >> >
> >> > data words;
> >> > set lyrics;
> >> > length word $50;
> >> > i=1;
> >> > word = lowcase(compress(scan(dline,i," "),"!.?,' '"));
> >> > do while (word ne " ");
> >> > i+1;
> >> > output;
> >> > word = lowcase(compress(scan(dline,i," "),"!.?,' '"));
> >> > end;
> >> > keep word;
> >> > run;
> >> >
> >> > proc summary;
> >> > class word;
> >> > output out=x;
> >> > run;
> >> > data count;
> >> > set x;
> >> > if _type_=0 then word = "SUM of ALL words";
> >> > drop _type_;
> >> > run;
> >> >
> >> > Gerhard
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Tue, 13 Jan 2009 13:01:06 +0100, Allen Ziegenfus
> >> > <aziegenfus@ANAXIMA.COM> wrote:
> >> >
> >> > >Hi,
> >> > >
> >> > >Perhaps something like the following, although I am sure this code
> could
> >> > be
> >> > >optimized. You might want to think about how you want to interpret
> word
> >> > >boundaries or whether it should be case sensitive or not.
> >> > >
> >> > >data _null_;
> >> > > set lyrics end=dataeof;
> >> > >
> >> > > if _n_ = 1 then do;
> >> > > length word $100 count 8.;
> >> > >
> >> > > declare hash wordcount();
> >> > > wordcount.definekey("word");
> >> > > wordcount.definedata("word");
> >> > > wordcount.definedata("count");
> >> > > wordcount.definedone();
> >> > > end;
> >> > >
> >> > > length word_index 8.;
> >> > > word_index = 1;
> >> > > word = scan(dline, word_index);
> >> > > do while (word ne "");
> >> > > count = 0;
> >> > > rc = wordcount.find();
> >> > > count = count + 1;
> >> > > if rc = 0 then wordcount.replace();
> >> > > else wordcount.add();
> >> > > word_index = word_index + 1;
> >> > > word = scan(dline, word_index);
> >> > > end;
> >> > >
> >> > > if dataeof then wordcount.output(dataset: 'work.word_count');
> >> > >run;
> >> > >
> >> > >-----Urspr�ngliche Nachricht-----
> >> > >Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag
> von
> >> > >Anindya Mozumdar
> >> > >Gesendet: Dienstag, 13. Januar 2009 11:50
> >> > >An: SAS-L@LISTSERV.UGA.EDU
> >> > >Betreff: Frequency count of words
> >> > >
> >> > >All,
> >> > > Supposing I have a dataset which is created this way -
> >> > >
> >> > >data lyrics;
> >> > > infile datalines dsd dlm = "|" missover firstobs = 1;
> >> > > input dline :$20000.;
> >> > >datalines;
> >> > >There I was completely wasting, out of work and down
> >> > >All inside its so frustrating as I drift from town to town
> >> > >Feel as though nobody cares if I live or die
> >> > >So I might as well begin to put some action in my life
> >> > >Breaking the law, breaking the law
> >> > >Breaking the law, breaking the law
> >> > >Breaking the law, breaking the law
> >> > >Breaking the law, breaking the law
> >> > >;
> >> > >run;
> >> > >
> >> > >What I want is a dataset called word_counts, containing two variables
> >> > >word and count which will be the number of times each word occurs in
> >> > >any line in the above dataset. For example, given the dataset lyrics,
> >> > >word_counts should contain
> >> > >
> >> > >word count
> >> > >completely 1
> >> > >breaking 8
> >> > >frustrating 1
> >> > >....
> >> > >
> >> > >Can any of you suggest a solution for this problem? Thanks in
> advance.
> >> > >
> >> > >Regards,
> >> > >Anindya
> >> >
> >>
> >>
>
|