Date: Mon, 24 Jan 2000 15:15:32 -0800
Reply-To: David Cassell <cassell@MERCURY.COR.EPA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David Cassell <cassell@MERCURY.COR.EPA.GOV>
Organization: OAO Corp.
Subject: Re: Count Words in Web Texts using SAS
Content-Type: text/plain; charset=us-ascii
David Ward wrote:
[attribution to OP lost]
> >The idea is to track down documents on the web that meet certain topic
> >characteristics (have >keywords from list A and don't have keywords
> >from list B), then for each document count the number of times each
> >and every keyword from list C appears. Examples of
> >keyword in list C are "month" and "week."
>
> This code should get you started down the right path. Personally (I know I
> could get flak for this) I'd use Perl - the code would be really simple and
> this functionality is one of Perl's staples.
Look everyone! I wasn't the one who said it! [However, I agree.]
> filename web url 'http://search.yahoo.com:80/bin/search?p=sas' debug;
> data _null_;
> length url $200;
> infile web dsd dlm='>';
Using '>' as a delimiter on arbitrary HTML is fraught with danger, and
will break on a lot of well-formed HTML which also includes comments
and/or scripts. To really get this right, you have to have a proper
parser, or at least a good lexer. Of course, if this is on HTML pages
you control, you can prevent such a disaster.
> input @'HREF=' url $;
> if url^='' then do;
> url=dequote(url);
> * ADD CODE TO PLACE THE :80 PORT STRING IN THE RIGHT PLACE *;
> nurls+1;
> call symput('url'||compress(nurls),trim(url));
> call symput('nurls',compress(nurls));
> end;
> run;
>
> You would, of course, need to add code to search the resulting URLs, which
> would probably introduce macro code but shouldn't bee too difficult. You
> could loop through the resultant macro array of URLs, i.e.
> %do i = 1 %to &nurls;
> filename u url "&&url&i";
> data step ...;
>
> %end;
But my concern is that the original poster's query may not be
well-formed.
Do *all* appearances of the keywords count? Even in the alt or longdesc
attribute of an IMG tag? Or within larger words? Uppercase and
lowercase
and mixed-case too? Wrapping over lines? Hyphenated? Perhaps we need
more of the desired spec.
If only text not inside tags is to be considered, an alternative may
be to pipe the page through lynx using the -dump option, and then feed
*that* text into the datastep. You wouldn't have to worry about parsing
out tags, since lynx would do it for you. Or, if you don't have the
lynx browser, you could pipe the page through a short P__l program.
If anyone wants to ask me about doing this in Perl, ask me in private
email. Perl coding is off-topic here in the general case.
David
--
David Cassell, OAO Corp. cassell@mail.cor.epa.gov
Senior Computing Specialist
mathematical statistician