Date: Thu, 13 Oct 2005 11:13:29 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Text cleaning
Content-Type: text/plain; format=flowed
emoorthy@BLUESINGAPORE.COM wrote back:
>Thanks a lot for your time...Yes..it fails for the word CO in
>which I don't wanna trim!! :((
>You have missed the %'s while creating the criteria dataset! Actually I
>a standard file which has the criterias to be eliminated..with %'s
>included..I cannot change that file as it has 3500+ entries..! Since it
>has a %, the oracle guys clean the company names using a function and a
>cursor and a LIKE clause.. I've to do the same using SAS..
>Surprised that Gurus are busy!! A small hint can help me a lot..thanks in
>Is there any possiblity to use a LIKE with a IF??
>I'm goin Mad now! :(
So... you're saying that someone else has gotten the 'cleaning' file into a
form that's convenient to use in SQL, and they do a full Cartesian join and
check every record of the cleaning file against every record of your data
file? Ick. Oh, pardon me, they're using a cursor to make things slower,
and doing a complete search through the 'cleaning' file against every record
of the data file.
Well, you could go ahead and do either of those. The first could be a full
join using PROC SQL. The second could be a DATA step using the POINT=
I wouldn't do either. It seems to me that the optimal solution is to:
 fix your 'cleaning' file so that replication like:
% PTY.LTD %
% PTY LTD%
%PTY LTD %
% PTY LTD %
is replaced with a regular expression, like:
This would also cut your 3500+ cleaning file down to something more
But it would require some work.
 Then you could read each line of the cleaning file into a data step and
into a new regex. Once you have done that, you could read the data file and
it against each regex.
Alternatively, you could leave the cleaning file as is and end up with a
number of checks to make on each record. Since each of the lines in your
file appears to be simple text instead of using the more sophisticated
features of LIKE,
you could just drop the percent signs and do a straightforward INDEX() to
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
Express yourself instantly with MSN Messenger! Download today - it's FREE!