LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2000, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 24 Jan 2000 15:15:32 -0800
Reply-To:     David Cassell <cassell@MERCURY.COR.EPA.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         David Cassell <cassell@MERCURY.COR.EPA.GOV>
Organization: OAO Corp.
Subject:      Re: Count Words in Web Texts using SAS
Content-Type: text/plain; charset=us-ascii

David Ward wrote: [attribution to OP lost] > >The idea is to track down documents on the web that meet certain topic > >characteristics (have >keywords from list A and don't have keywords > >from list B), then for each document count the number of times each > >and every keyword from list C appears. Examples of > >keyword in list C are "month" and "week." > > This code should get you started down the right path. Personally (I know I > could get flak for this) I'd use Perl - the code would be really simple and > this functionality is one of Perl's staples.

Look everyone! I wasn't the one who said it! [However, I agree.]

> filename web url 'http://search.yahoo.com:80/bin/search?p=sas' debug; > data _null_; > length url $200; > infile web dsd dlm='>';

Using '>' as a delimiter on arbitrary HTML is fraught with danger, and will break on a lot of well-formed HTML which also includes comments and/or scripts. To really get this right, you have to have a proper parser, or at least a good lexer. Of course, if this is on HTML pages you control, you can prevent such a disaster.

> input @'HREF=' url $; > if url^='' then do; > url=dequote(url); > * ADD CODE TO PLACE THE :80 PORT STRING IN THE RIGHT PLACE *; > nurls+1; > call symput('url'||compress(nurls),trim(url)); > call symput('nurls',compress(nurls)); > end; > run; > > You would, of course, need to add code to search the resulting URLs, which > would probably introduce macro code but shouldn't bee too difficult. You > could loop through the resultant macro array of URLs, i.e. > %do i = 1 %to &nurls; > filename u url "&&url&i"; > data step ...; > > %end;

But my concern is that the original poster's query may not be well-formed. Do *all* appearances of the keywords count? Even in the alt or longdesc attribute of an IMG tag? Or within larger words? Uppercase and lowercase and mixed-case too? Wrapping over lines? Hyphenated? Perhaps we need more of the desired spec.

If only text not inside tags is to be considered, an alternative may be to pipe the page through lynx using the -dump option, and then feed *that* text into the datastep. You wouldn't have to worry about parsing out tags, since lynx would do it for you. Or, if you don't have the lynx browser, you could pipe the page through a short P__l program.

If anyone wants to ask me about doing this in Perl, ask me in private email. Perl coding is off-topic here in the general case.

David -- David Cassell, OAO Corp. cassell@mail.cor.epa.gov Senior Computing Specialist mathematical statistician


Back to: Top of message | Previous page | Main SAS-L page