LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (January 2007, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 15 Jan 2007 00:04:49 -0800
Reply-To:     David L Cassell <davidlcassell@MSN.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         David L Cassell <davidlcassell@MSN.COM>
Subject:      Re: Extracting word(s) occurring in text before a certain keyword
In-Reply-To:  <200701141719.l0EBk3Lr007245@mailgw.cc.uga.edu>
Content-Type: text/plain; format=flowed

art297@NETSCAPE.NET sagely replied: >--------------- >On Sun, 14 Jan 2007 05:29:53 -0800, Hakan Ener <hakanener99@YAHOO.COM> >wrote: > > > Hello, > > > > I could not find a general solution to what I'm > >trying to do when analyzing a character variable that > >contains unstructured text. > > > > Each observation contains a paragraph of text > >(multiple sentences separated by period), where names > >of certain companies are mentioned, such as "Microsoft > >Inc." or "Advanced Micro Devices Corp." within > >sentences. I want to extract the company name that > >precedes "Inc." or "Corp." in this text. Considering > >that company names may contain any number of words > >(each of which have a capital first letter), and that > >an observation may contain any number of company names > >one after the other, is there a suggestion to handle > >this coding such that the result will be a horizontal > >array of full company names mentioned in the source > >field? > > > >Thank you, > > > >Hakan Ener > >France > >

> >Hakan, > >I doubt if your data is sufficiently clean to allow the following approach >but, if not, you might be able to modify it to meet your needs. It >contains a number of trim and left statements that likely are not needed, >but don't appear to hurt. > >The data I used, while it will likely wrap in the post, is simply your >post, on one line, repeated 3 times. > >Art >-------- >data have; > infile cards truncover; > format thetext $1000.; > input thetext $1000.; > cards; >Each observation contains a paragraph of text (multiple sentences >separated by period), where names of certain companies are mentioned, such >as Microsoft Inc. or Advanced Micro Devices Corp. within sentences. I want >to extract the company name that precedes "Inc." or "Corp." in this text. >Considering that company names may contain any number of words (each of >which have a capital first letter), and that an observation may contain >any number of company names one after the other, is there a suggestion to >handle this coding such that the result will be a horizontal array of full >company names mentioned in the source field? >Each observation contains a paragraph of text (multiple sentences >separated by period), where names of certain companies are mentioned, such >as Microsoft Inc. or Advanced Micro Devices Corp. within sentences. I want >to extract the company name that precedes "Inc." or "Corp." in this text. >Considering that company names may contain any number of words (each of >which have a capital first letter), and that an observation may contain >any number of company names one after the other, is there a suggestion to >handle this coding such that the result will be a horizontal array of full >company names mentioned in the source field? >Each observation contains a paragraph of text (multiple sentences >separated by period), where names of certain companies are mentioned, such >as Microsoft Inc. or Advanced Micro Devices Corp. within sentences. I want >to extract the company name that precedes "Inc." or "Corp." in this text. >Considering that company names may contain any number of words (each of >which have a capital first letter), and that an observation may contain >any number of company names one after the other, is there a suggestion to >handle this coding such that the result will be a horizontal array of full >company names mentioned in the source field? >; >run; >data want (keep=Company); > set have; > format x $150.; > format Company $150.; > format HoldCompany $150.; > do i=1 to 1000; > x=trim(left(scan(thetext,-i,' '))); > if x eq '' then i=1000; > else do; > if x in ('Inc.','Corp.') then do; > stopit=0; > j=0; > HoldCompany=x; > do until (stopit ne 0); > x=trim(left(scan(thetext,-(i+j+1),' '))); > if x in ('Inc.','Corp.') > or not(anyupper(substr(x,1,1))) then do; > i=i+j; > output; > Company=''; > HoldCompany=''; > stopit=1; > end; > else do; > Company=catx(' ',x,HoldCompany); > HoldCompany=Company; > j+1; > end; > end; > end; > end; > end; >run; >

I would take this approach and adapt it a bit.

First, get the data read in, so that the strings are in your variable THETEXT. Then use regular expressions. It's late, so I'm not writing an entire app, but the regex I would start with would be something like this:

re = prxparse('/([A-Z][a-z]+(?:\s*[A-Z][a-z]+)*\s+(?:Inc.|Corp.))/');

This may get you started... David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ From photos to predictions, The MSN Entertainment Guide to Golden Globes has it all. http://tv.msn.com/tv/globes2007/?icid=nctagline1


Back to: Top of message | Previous page | Main SAS-L page