LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2001, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 25 Jan 2001 09:13:11 -0500
Reply-To:   Bob Burnham <bburnham@DARTMOUTH.EDU>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Bob Burnham <bburnham@DARTMOUTH.EDU>
Organization:   Dartmouth College, Hanover, NH, USA
Subject:   Re: nasty text processing puzzle: SAS or Perl?

[This may be off-topic to some people, since I'm going to talk about the Perl aspects of the question. Oh well, just add me to your killfile :>)]

Howdy,

I always like finding interesting jobs where SAS and Perl can complement each other, and this looks like a good one. I'm not sure that you need to break down the comments into a list of all of the unique words -- especially since you probably want to just flag them so you can read them in context anyway.

TextPipe sounds like an interesting product, but fortunately you can get a list of unique words from a block of text using Perl in only a couple of lines. For example:

while(<COMMENTS>) { # read from the comment file chomp; foreach(split) { # split the line on white spaces $word{$_} = 1; # set a hash key } } @unique = sort(keys(%word)); # get list of unique words

Another way of looking at your main task is to build a regular expression that would match any of the nasty words that you want to check out. To do that, just join all of the words together separated by a pipe character and use that as your search criteria. Then you can simply whip through the file and print out any line with an offending word. For example:

#!/usr/local/bin/perl

# open a list of 'bad words' open(BADWORDS, "badwords.txt") || die "Error opening nasty word file.";

# read all of the nastiness into an array @nasty_mean_words = <BADWORDS>;

close(BADWORDS);

# get rid of those pesky CR-LFs. . . chomp(@nasty_mean_words);

# concatenate all of the words into a search string $badwords = join '|', @nasty_mean_words;

# open our list of comments to search for horrors open(COMMENTS, "comments.txt") || die "Error opening comments file.\n";

@comments = <COMMENTS>;

close(COMMENTS);

# create a loop to look at each comment for($i = 0; $i < scalar(@comments); $i++) { # if we find something nasty, tell the world about it if($comments[$i] =~ /$badwords/) { printf("Oh no! A bad word on line #%d: %s\n", $i+1, $comments[$i]); } }

Just my two cents. . .

Good luck and best regards,

Bob


Back to: Top of message | Previous page | Main SAS-L page