LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2010, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Fri, 30 Jul 2010 16:32:18 +0000
Reply-To:   toby dunn <tobydunn@HOTMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   toby dunn <tobydunn@HOTMAIL.COM>
Subject:   Re: Perl Regular Expression question
Comments:   To: i89rt5@gmail.com
In-Reply-To:   <201007292346.o6TGsHXJ021504@willow.cc.uga.edu>
Content-Type:   text/plain; charset="iso-8859-1"

Jerry, I'm not totally sure if this gets you 100% of the way there, it will provide the basic frame work you can use to take your intial file and create the end result you want. Data Have ; Length Value $ 200 ; Infile Cards DLM = "|" ; Input ID $ Source $ Qualifier $ Value $ Logic $ ; Cards ; 1|GPI |LIKE|66100052, 66100053|or 1|GPI |LIKE|66100055|or 1|GPI |LIKE|66100065, 66100066|and 1|ICD9|LIKE|V852, V853|or 2|ICD3|IN |278, 279|or 2|ICD3|IN |288, 289|or 2|ICD5|IN |27802, 27803|or 2|GPI |LIKE|66100055, 66100052|or 3|ICD9|LE |1398|or 3|ICD9|LIKE|48[0-7]|or 3|ICD9|LIKE|46|or 3|ICD9|LIKE|7955|or 3|ICD9|LIKE|7907|or 3|ICD9|LIKE|68[126]|or 3|ICD9|LIKE|599|or 4|ICD4|IN |4771,6931,6938,6939,6925|or 4|ICD5|IN |V1501,V1502,V1503,V1504,V1505|or 5|GPI |IN |66250050100320, 21300050100310|or 5|CPT |IN |J8610, J9260, J9250|or 6|ICD9|IN |12345,23456|or 6|ICD3|IN |123,456,789|or 7|GPI |LIKE|99406010|or 7|GPI |LIKE|21101020|or 7|CPT |IN |J7500|or 7|CPT |IN |J7501|or 7|CPT |IN |J9093|or 7|CPT |IN |J9097|or 7|CPT |IN |J8530|or 8|ICD9|IN |73316,823|or 8|ICD9|LIKE|8230|or 8|ICD9|LIKE|8232|or 8|ICD9|LIKE|8238|or ; Run ;

Data Have ; Set Have ; If UpCase( Source ) = 'ICD3' Then SortOrder = 1 ; Else If UpCase( Source ) = 'ICD5' Then SortOrder = 2 ; Else If UpCase( Source ) = 'ICD9' Then SortOrder = 3 ; Else If UpCase( Source ) = 'GPI' Then SortOrder = 4 ; Else If UpCase( Source ) = 'CPT' Then SortOrder = 5 ; Else SortOrder = 6 ; Run ;

Proc Sort Data = Have ; By ID SortOrder Qualifier ; Run ; Data Need ; Length Description $ 200 ;

Do I = 1 By 1 Until( Last.ID ) ; Set Have ; By ID SortOrder Qualifier ;

If First.Qualifier Then Do ; Description = CatX( ' ' , Description , Source , Qualifier , Value ) ; End ; Else Do ; Description = CatX( ' , ' , Description , Value ) ; End ; If ( Last.Qualifier And Not( Last.ID ) ) Then Do ; Description = CatX( ' ' , Description , Logic ) ; End ; End ; Run ;

Proc Print Data = Need ; Run ;

Toby Dunn

"I'm a hell bent 100% Texan til I die"

"Don't touch my Willie, I don't know you that well"

> Date: Thu, 29 Jul 2010 19:46:14 -0400 > From: i89rt5@GMAIL.COM > Subject: Re: Perl Regular Expression question > To: SAS-L@LISTSERV.UGA.EDU > > Toby, > > Ideally, I would like to have the output to be in a particular order, say, > ICD9, GPI, CPT. > > Thank you. > > On Thu, 29 Jul 2010 22:36:46 +0000, toby dunn <tobydunn@HOTMAIL.COM> wrote: > > >Jerry, > > > >So it was originally in long form, o so much better to work with heck > yeah.. while I dont at the momemnt have a chance to work up a solution if > you use the long for you dont need Perl RegEx, just some good old fashion > SAS programming. I do have one question, is there a particular order you > want the output (IE. GPI,ICD3, ICD9, CPT). > > > >Simply use the long form, IF output order makes a difference order it how > you want, then use a DoW to collect the information and output one > observation per ID. Its really super simple. > > > >Toby Dunn > > > > > >"I'm a hell bent 100% Texan til I die" > > > >"Don't touch my Willie, I don't know you that well" > > > > > > > > > >> Date: Thu, 29 Jul 2010 18:23:24 -0400 > >> From: i89rt5@GMAIL.COM > >> Subject: Re: Perl Regular Expression question > >> To: SAS-L@LISTSERV.UGA.EDU > >> > >> Matt, Chang, and Toby > >> > >> Thank you all very much for your Help, I really appreciate your time and > codes. > >> > >> First, as you have already noticed, in my original post I accidentally > >> dropped the string "GPI" between "66100055 or" and "start" in the first > >> observation. > >> > >> Not only that, in reality, the text I'm trying to manipulate is more messy > >> and complicated. Below is the more realistic sample representing my data > >> (but I'm sure I'll find more irregularity in my data) > >> > >> data test; > >> > >> input id & $1. description & $200.; > >> datalines; > >> 1 GPI Like 66100052, 66100053 or GPI LIKE 66100055 or GPI LIKE 66100065, > >> 66100066 and ICD9 LIKE V852, V853 > >> 2 ICD3 IN 278, 279 or ICD3 IN 288, 289 or ICD5 IN 27802, 27803 or GPI LIKE > >> 66100055, 66100052 > >> 3 ICD9 LE 1398 or ICD9 LIKE 48[0-7] or ICD9 LIKE 46 or ICD9 LIKE 7955 or > >> ICD9 LIKE 7907 or ICD9 LIKE 68[126] or ICD9 LIKE 599 > >> 4 ICD4 IN 4771,6931,6938,6939,6925 or ICD5 IN V1501,V1502,V1503,V1504,V1505 > >> 5 GPI IN 66250050100320, 21300050100310 or CPT IN J8610, J9260, J9250 > >> 6 ICD9 IN 12345,23456 or ICD3 IN 123,456,789 > >> 7 GPI LIKE 99406010 or GPI LIKE 21101020 or CPT IN J7500 or CPT IN J7501 or > >> CPT IN J9093 or CPT IN J9097 or CPT IN J8530 > >> 8 ICD9 IN 73316,823 or ICD9 LIKE 8230 or ICD9 LIKE 8232 or ICD9 LIKE 8238 > >> ; > >> run; > >> > >> > >> My desired output data (with 2 variables: id and description) should look > >> like this > >> > >> 1 GPI start with 66100052, 66100053, 66100055, 66100065, 66100066 [and] ICD9 > >> start with V852, V853 > >> 2 ICD9 (first 3 digits) in 278, 279, 288, 289 [or] ICD9 (first 5 digits) in > >> 27802, 27803 [or] GPI start with 66100055, 66100052 > >> 3 ICD9 LE 1398 [or] ICD9 IN 48[0-7], 46, 7955, 7907, 68[126], 599 > >> 4 ICD9 (first 4 digit) in 4771,6931,6938,6939,6925 [or] ICD9 (first 5 digit) > >> IN V1501,V1502,V1503,V1504,V1505 > >> 5 GPI IN 66250050100320, 21300050100310 or CPT IN J8610, J9260, J9250 > >> 6 ICD9 IN 12345,23456 or ICD9(first 3 digits) IN 123,456,789 > >> 7 GPI start with 99406010, 21101020 or CPT IN J7500, J7501, J9093, J9097, > J8530 > >> 8 ICD9 IN 73316,823 or ICD9 start with 8230, 8232, 8238 > >> > >> ; > >> > >> Note: > >> The conjunction between all conditions is not static, it could be either > >> "or" or "and". And it needs to be wrapped with [] > >> > >> Also, all "like" shouLd BE replaced with "start with". > >> And, ICD? should be converted to ICD9 (first ? digit) if ? is less than 9. > >> > >> > >> ******* > >> To Matt, > >> ******* > >> Yes, the data certainly can be output to a flat file. But I'm in Windows > >> environment, and don't have any PERL tool installed. I'd be more than happy > >> to output it to a flat file for you to try. > >> > >> ******* > >> To Toby, > >> ******* > >> I wish I knew how to do look-ahead and look-behind in PERL! > >> > >> > >> ******* > >> To Chang, > >> ******* > >> The input file originally was in long format, see below. > >> > >> data original; > >> input id & $1. code_source & $4. qualifier & $5. code_value & $200. logic > & $3.; > >> datalines; > >> 1 GPI LIKE 66100052, 66100053 or > >> 1 GPI LIKE 66100055 or > >> 1 GPI LIKE 66100065, 66100066 and > >> 1 ICD9 LIKE V852, V853 or > >> 2 ICD3 IN 278, 279 or > >> 2 ICD3 IN 288, 289 or > >> 2 ICD5 IN 27802, 27803 or > >> 2 GPI LIKE 66100055, 66100052 or > >> 3 ICD9 LE 1398 or > >> 3 ICD9 LIKE 48[0-7] or > >> 3 ICD9 LIKE 46 or > >> 3 ICD9 LIKE 7955 or > >> 3 ICD9 LIKE 7907 or > >> 3 ICD9 LIKE 68[126] or > >> 3 ICD9 LIKE 599 or > >> 4 ICD4 IN 4771,6931,6938,6939,6925 or > >> 4 ICD5 IN V1501,V1502,V1503,V1504,V1505 or > >> 5 GPI IN 66250050100320, 21300050100310 or > >> 5 CPT IN J8610, J9260, J9250 or > >> 6 ICD9 IN 12345,23456 or > >> 6 ICD3 IN 123,456,789 or > >> 7 GPI LIKE 99406010 or > >> 7 GPI LIKE 21101020 or > >> 7 CPT IN J7500 or > >> 7 CPT IN J7501 or > >> 7 CPT IN J9093 or > >> 7 CPT IN J9097 or > >> 7 CPT IN J8530 or > >> 8 ICD9 IN 73316,823 or > >> 8 ICD9 LIKE 8230 or > >> 8 ICD9 LIKE 8232 or > >> 8 ICD9 LIKE 8238 or > >> ; > >> run; > >> > >> I used Mike's approach (see below) and concatenated 4 columns (code_source, > >> qualifier, code_value, logic), across rows if needed, for each id. > >> > >> Note: for each id, the last logic value is skipped when I did the > >> concatenation, which results in the data "test", shown at the top > >> > >> data test; > >> length description $1000; > >> do until (last.id); > >> set original; > >> by id; > >> if not last.id then description=catx(' ', description, code_source, > >> qualifier, code_value, logic); > >> end; > >> description = catx(' ', description, code_source, qualifier, code_value); > >> keep id description; > >> run; > >> > >> > >> > >> > >> On Thu, 29 Jul 2010 13:58:09 -0500, Matthew Pettis > >> <matt.pettis@THOMSONREUTERS.COM> wrote: > >> > >> >To add to this, SAS *can* do text regex mangling, but if this data is in > >> >a flat file, or could be put into one, it would likely be easier to code > >> >a solution purely in Perl to get what you want done. The SAS PRX* > >> >functions and how they work with datasteps can make for a more > >> >complicated solution and can require the making of more boilerplate > >> >code. If you provide a few more examples of text you want parsed (and > >> >answer the missing 'GPI' from the first obs question), it might make our > >> >answers more complete and might help you determine if SAS is the best > >> >tool to do this text extraction... > >> > > >> >Matt > >> > > >> >-----Original Message----- > >> >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > >> >toby dunn > >> >Sent: Thursday, July 29, 2010 11:20 AM > >> >To: SAS-L@LISTSERV.UGA.EDU > >> >Subject: Re: Perl Regular Expression question > >> > > >> >Jerry , > >> > > >> > > >> > > >> >Not sure its really any better or worse than the other solutions you > >> >recieved, it does however hammer it out in one pass of the data set for > >> >what it is worth. Like Chang I too wondered about the missing GPI in > >> >your first observation. If it is truelly missing then the Pattern for > >> >GPI will need to be modified out to use a Look Behind. > >> > > >> > > >> > > >> >Data Need ( Keep = NewDescription ) ; > >> >Length Temp1 Temp2 NewDescription $200. ; > >> >Set Have ; > >> > > >> > > >> > > >> >Start = 1 ; > >> >Stop = Length( Description ) ; > >> >Position = 0 ; > >> > > >> > > >> > > >> >Pattern = PrxParse( '/(?:(\b\d{8}\b)|(\b[VE]*\d+\b))/') ; > >> > > >> > > >> > > >> >Call PRXNext( Pattern , Start , Stop , Description , Position , Length ) > >> >; > >> > > >> > > >> > > >> >Do While ( Position > 0 ) ; > >> > Temp1 = CatX( ' , ' , Temp1 , PRXPosn( Pattern , 1 , Description ) ) ; > >> > Temp2 = CatX( ' , ' , Temp2 , PRXPosn( Pattern , 2 , Description ) ) ; > >> > > >> > Call PRXNext( Pattern , Start , Stop , Description , Position , Length > >> >) ; > >> >End ; > >> > > >> > > >> > > >> >Temp1 = IfC( Not Missing( Temp1 ) , 'GPI Starts With ' || Temp1 , '' ) > >> >; > >> >Temp2 = IfC( Not Missing( Temp2 ) , 'ICD9 Starts With ' || Temp2 , '' ) > >> >; > >> >NewDescription = CatX( ' And ' , Temp1 , Temp2 ) ; > >> > > >> > >> >Run ; > >> > > >> >> Date: Wed, 28 Jul 2010 17:15:59 -0400 > >> >> From: i89rt5@GMAIL.COM > >> >> Subject: Perl Regular Expression question > >> >> To: SAS-L@LISTSERV.UGA.EDU > >> >> > >> >> Hi, > >> >> > >> >> Suppose I have an input data below > >> >> > >> >> data in; > >> >> input description $ 1-100; > >> >> datalines; > >> >> GPI start with 66100052 or GPI start with 66100055 or start with > >> >66100065 > >> >> and ICD9 start with V852 > >> >> ICD9 start with 27800 or ICD9 start with 27801 or ICD9 start with > >> >27802 or > >> >> ICD9 start with V852 > >> >> ; > >> >> run; > >> >> > >> >> How could Perl Regular Expression be used to make the output data > >> >(still 1 > >> >> var: description, and still 2 observations) look like this > >> >> > >> >> GPI start with 66100052, 66100055, 66100065 and ICD9 start with V852 > >> >> ICD9 start with 27800, 27801, 27802, V852 > >> >> > >> >> So, part of my specific question would be: how to use Perl Rx to > >> >determine > >> >> "GPI start with" or "ICD9 start with" occurs more than once, and then > >> >> extract the numbers and put them together? > >> >> > >> >> Thank you. > >


Back to: Top of message | Previous page | Main SAS-L page