Date: Mon, 15 Jan 2007 00:04:49 -0800
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Extracting word(s) occurring in text before a certain keyword
In-Reply-To: <200701141719.l0EBk3Lr007245@mailgw.cc.uga.edu>
Content-Type: text/plain; format=flowed
art297@NETSCAPE.NET sagely replied:
>---------------
>On Sun, 14 Jan 2007 05:29:53 -0800, Hakan Ener <hakanener99@YAHOO.COM>
>wrote:
>
> > Hello,
> >
> > I could not find a general solution to what I'm
> >trying to do when analyzing a character variable that
> >contains unstructured text.
> >
> > Each observation contains a paragraph of text
> >(multiple sentences separated by period), where names
> >of certain companies are mentioned, such as "Microsoft
> >Inc." or "Advanced Micro Devices Corp." within
> >sentences. I want to extract the company name that
> >precedes "Inc." or "Corp." in this text. Considering
> >that company names may contain any number of words
> >(each of which have a capital first letter), and that
> >an observation may contain any number of company names
> >one after the other, is there a suggestion to handle
> >this coding such that the result will be a horizontal
> >array of full company names mentioned in the source
> >field?
> >
> >Thank you,
> >
> >Hakan Ener
> >France
> >
>
>Hakan,
>
>I doubt if your data is sufficiently clean to allow the following approach
>but, if not, you might be able to modify it to meet your needs. It
>contains a number of trim and left statements that likely are not needed,
>but don't appear to hurt.
>
>The data I used, while it will likely wrap in the post, is simply your
>post, on one line, repeated 3 times.
>
>Art
>--------
>data have;
> infile cards truncover;
> format thetext $1000.;
> input thetext $1000.;
> cards;
>Each observation contains a paragraph of text (multiple sentences
>separated by period), where names of certain companies are mentioned, such
>as Microsoft Inc. or Advanced Micro Devices Corp. within sentences. I want
>to extract the company name that precedes "Inc." or "Corp." in this text.
>Considering that company names may contain any number of words (each of
>which have a capital first letter), and that an observation may contain
>any number of company names one after the other, is there a suggestion to
>handle this coding such that the result will be a horizontal array of full
>company names mentioned in the source field?
>Each observation contains a paragraph of text (multiple sentences
>separated by period), where names of certain companies are mentioned, such
>as Microsoft Inc. or Advanced Micro Devices Corp. within sentences. I want
>to extract the company name that precedes "Inc." or "Corp." in this text.
>Considering that company names may contain any number of words (each of
>which have a capital first letter), and that an observation may contain
>any number of company names one after the other, is there a suggestion to
>handle this coding such that the result will be a horizontal array of full
>company names mentioned in the source field?
>Each observation contains a paragraph of text (multiple sentences
>separated by period), where names of certain companies are mentioned, such
>as Microsoft Inc. or Advanced Micro Devices Corp. within sentences. I want
>to extract the company name that precedes "Inc." or "Corp." in this text.
>Considering that company names may contain any number of words (each of
>which have a capital first letter), and that an observation may contain
>any number of company names one after the other, is there a suggestion to
>handle this coding such that the result will be a horizontal array of full
>company names mentioned in the source field?
>;
>run;
>data want (keep=Company);
> set have;
> format x $150.;
> format Company $150.;
> format HoldCompany $150.;
> do i=1 to 1000;
> x=trim(left(scan(thetext,-i,' ')));
> if x eq '' then i=1000;
> else do;
> if x in ('Inc.','Corp.') then do;
> stopit=0;
> j=0;
> HoldCompany=x;
> do until (stopit ne 0);
> x=trim(left(scan(thetext,-(i+j+1),' ')));
> if x in ('Inc.','Corp.')
> or not(anyupper(substr(x,1,1))) then do;
> i=i+j;
> output;
> Company='';
> HoldCompany='';
> stopit=1;
> end;
> else do;
> Company=catx(' ',x,HoldCompany);
> HoldCompany=Company;
> j+1;
> end;
> end;
> end;
> end;
> end;
>run;
>
I would take this approach and adapt it a bit.
First, get the data read in, so that the strings are in your variable
THETEXT. Then use regular expressions. It's late, so I'm not
writing an entire app, but the regex I would start with would be
something like this:
re = prxparse('/([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*\s+(?:Inc.|Corp.))/');
This may get you started...
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
From photos to predictions, The MSN Entertainment Guide to Golden Globes has
it all. http://tv.msn.com/tv/globes2007/?icid=nctagline1
|