| Date: | Thu, 25 Jan 2001 09:13:11 -0500 |
| Reply-To: | Bob Burnham <bburnham@DARTMOUTH.EDU> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Bob Burnham <bburnham@DARTMOUTH.EDU> |
| Organization: | Dartmouth College, Hanover, NH, USA |
| Subject: | Re: nasty text processing puzzle: SAS or Perl? |
|---|
[This may be off-topic to some people, since I'm going to talk about the
Perl aspects of the question. Oh well, just add me to your killfile :>)]
Howdy,
I always like finding interesting jobs where SAS and Perl can complement
each other, and this looks like a good one. I'm not sure that you need to
break down the comments into a list of all of the unique words -- especially
since you probably want to just flag them so you can read them in context
anyway.
TextPipe sounds like an interesting product, but fortunately you can get a
list of unique words from a block of text using Perl in only a couple of
lines. For example:
while(<COMMENTS>) { # read from the comment file
chomp;
foreach(split) { # split the line on white
spaces
$word{$_} = 1; # set a hash key
}
}
@unique = sort(keys(%word)); # get list of unique words
Another way of looking at your main task is to build a regular expression
that would match any of the nasty words that you want to check out. To do
that, just join all of the words together separated by a pipe character and
use that as your search criteria. Then you can simply whip through the file
and print out any line with an offending word. For example:
#!/usr/local/bin/perl
# open a list of 'bad words'
open(BADWORDS, "badwords.txt") ||
die "Error opening nasty word file.";
# read all of the nastiness into an array
@nasty_mean_words = <BADWORDS>;
close(BADWORDS);
# get rid of those pesky CR-LFs. . .
chomp(@nasty_mean_words);
# concatenate all of the words into a search string
$badwords = join '|', @nasty_mean_words;
# open our list of comments to search for horrors
open(COMMENTS, "comments.txt") ||
die "Error opening comments file.\n";
@comments = <COMMENTS>;
close(COMMENTS);
# create a loop to look at each comment
for($i = 0; $i < scalar(@comments); $i++) {
# if we find something nasty, tell the world about it
if($comments[$i] =~ /$badwords/) {
printf("Oh no! A bad word on line #%d: %s\n",
$i+1, $comments[$i]);
}
}
Just my two cents. . .
Good luck and best regards,
Bob
|