LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2009, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 13 Jan 2009 12:41:27 -0600
Reply-To:   Joe Matise <snoopy369@GMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Joe Matise <snoopy369@GMAIL.COM>
Subject:   Re: Frequency count of words
In-Reply-To:   <200901131816.n0DBkx4d022839@malibu.cc.uga.edu>
Content-Type:   text/plain; charset=ISO-8859-1

Clearly I am making the assumption that my data is good proper english without any contractions. I think that is a fair assumption to make...

;)

Yeah, obviously this is a bad idea; just use scan and forget compress. Too tired to think today apparently. Probably should go home from work ... wonder if my boss would buy that...

-Joe

On Tue, Jan 13, 2009 at 12:16 PM, Howard Schreier <hs AT dc-sug DOT org> < schreier.junk.mail@gmail.com> wrote:

> On Tue, 13 Jan 2009 10:41:55 -0600, Joe Matise <snoopy369@GMAIL.COM> > wrote: > > >Scan does indeed use all four of those characters as delimiters, good > >point. I'd still use compress for the option of dropping other nonalpha > >characters (scan doesn't seem to have the option of only counting letters > as > >non-word-breaks) unless the data was guaranteed really clean (and it never > >is...). > > > >-Joe > > Keep in mind that some words incorporate non-alpha characters (eg, > "isn't"). > Languages other than English may give rise to other issues. > > > > >On Tue, Jan 13, 2009 at 10:39 AM, Allen Ziegenfus > <aziegenfus@anaxima.com>wrote: > > > >> Scan appears to do this too. I don't see any difference between: > >> > >> word = lowcase(compress(scan(dline,i," "),"!.?,' '")); > >> word2 = lowcase(scan(dline,i,"!.?,' '")); > >> > >> -----Urspr�ngliche Nachricht----- > >> Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag von > Joe > >> Matise > >> Gesendet: Dienstag, 13. Januar 2009 17:19 > >> An: SAS-L@LISTSERV.UGA.EDU > >> Betreff: Re: Frequency count of words > >> > >> Adding compress eliminates the punctuation. You would probably be > better > >> off, in fact, doing: > >> > >> Compress(scan(dline,i),,"ak"); > >> > >> which if I get my compress syntax right would only keep alphabet > characters > >> (or one of the other options if you want to keep numerics or whatnot). > >> > >> > >> -Joe > >> > >> On Tue, Jan 13, 2009 at 6:38 AM, Allen Ziegenfus > >> <aziegenfus@anaxima.com>wrote: > >> > >> > I just thought of that too! I wonder which is faster. > >> > > >> > What does adding compress do in your example? > >> > > >> > You could also use a datastep view so that you don't actually store > the > >> > words dataset. > >> > > >> > data work.tmp (keep=word) / view=work.tmp; > >> > set lyrics; > >> > word_index = 1; > >> > word = scan(dline, 1); > >> > do while (word ne ""); > >> > output; > >> > word_index = word_index + 1; > >> > word = scan(dline, word_index); > >> > end; > >> > run; > >> > > >> > proc summary data=tmp; > >> > class word; > >> > output out = word_count; > >> > run; > >> > > >> > -----Urspr�ngliche Nachricht----- > >> > Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag von > >> > Gerhard Hellriegel > >> > Gesendet: Dienstag, 13. Januar 2009 13:26 > >> > An: SAS-L@LISTSERV.UGA.EDU > >> > Betreff: Re: Frequency count of words > >> > > >> > easier to understand for me (without hash-lists): > >> > > >> > data words; > >> > set lyrics; > >> > length word $50; > >> > i=1; > >> > word = lowcase(compress(scan(dline,i," "),"!.?,' '")); > >> > do while (word ne " "); > >> > i+1; > >> > output; > >> > word = lowcase(compress(scan(dline,i," "),"!.?,' '")); > >> > end; > >> > keep word; > >> > run; > >> > > >> > proc summary; > >> > class word; > >> > output out=x; > >> > run; > >> > data count; > >> > set x; > >> > if _type_=0 then word = "SUM of ALL words"; > >> > drop _type_; > >> > run; > >> > > >> > Gerhard > >> > > >> > > >> > > >> > > >> > > >> > > >> > On Tue, 13 Jan 2009 13:01:06 +0100, Allen Ziegenfus > >> > <aziegenfus@ANAXIMA.COM> wrote: > >> > > >> > >Hi, > >> > > > >> > >Perhaps something like the following, although I am sure this code > could > >> > be > >> > >optimized. You might want to think about how you want to interpret > word > >> > >boundaries or whether it should be case sensitive or not. > >> > > > >> > >data _null_; > >> > > set lyrics end=dataeof; > >> > > > >> > > if _n_ = 1 then do; > >> > > length word $100 count 8.; > >> > > > >> > > declare hash wordcount(); > >> > > wordcount.definekey("word"); > >> > > wordcount.definedata("word"); > >> > > wordcount.definedata("count"); > >> > > wordcount.definedone(); > >> > > end; > >> > > > >> > > length word_index 8.; > >> > > word_index = 1; > >> > > word = scan(dline, word_index); > >> > > do while (word ne ""); > >> > > count = 0; > >> > > rc = wordcount.find(); > >> > > count = count + 1; > >> > > if rc = 0 then wordcount.replace(); > >> > > else wordcount.add(); > >> > > word_index = word_index + 1; > >> > > word = scan(dline, word_index); > >> > > end; > >> > > > >> > > if dataeof then wordcount.output(dataset: 'work.word_count'); > >> > >run; > >> > > > >> > >-----Urspr�ngliche Nachricht----- > >> > >Von: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] Im Auftrag > von > >> > >Anindya Mozumdar > >> > >Gesendet: Dienstag, 13. Januar 2009 11:50 > >> > >An: SAS-L@LISTSERV.UGA.EDU > >> > >Betreff: Frequency count of words > >> > > > >> > >All, > >> > > Supposing I have a dataset which is created this way - > >> > > > >> > >data lyrics; > >> > > infile datalines dsd dlm = "|" missover firstobs = 1; > >> > > input dline :$20000.; > >> > >datalines; > >> > >There I was completely wasting, out of work and down > >> > >All inside its so frustrating as I drift from town to town > >> > >Feel as though nobody cares if I live or die > >> > >So I might as well begin to put some action in my life > >> > >Breaking the law, breaking the law > >> > >Breaking the law, breaking the law > >> > >Breaking the law, breaking the law > >> > >Breaking the law, breaking the law > >> > >; > >> > >run; > >> > > > >> > >What I want is a dataset called word_counts, containing two variables > >> > >word and count which will be the number of times each word occurs in > >> > >any line in the above dataset. For example, given the dataset lyrics, > >> > >word_counts should contain > >> > > > >> > >word count > >> > >completely 1 > >> > >breaking 8 > >> > >frustrating 1 > >> > >.... > >> > > > >> > >Can any of you suggest a solution for this problem? Thanks in > advance. > >> > > > >> > >Regards, > >> > >Anindya > >> > > >> > >> >


Back to: Top of message | Previous page | Main SAS-L page