| Date: | Fri, 30 Jul 2010 16:32:18 +0000 |
| Reply-To: | toby dunn <tobydunn@HOTMAIL.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | toby dunn <tobydunn@HOTMAIL.COM> |
| Subject: | Re: Perl Regular Expression question |
|
| In-Reply-To: | <201007292346.o6TGsHXJ021504@willow.cc.uga.edu> |
| Content-Type: | text/plain; charset="iso-8859-1" |
Jerry,
I'm not totally sure if this gets you 100% of the way there, it will provide the basic frame work you can use to take your intial file and create the end result you want.
Data Have ;
Length Value $ 200 ;
Infile Cards DLM = "|" ;
Input ID $ Source $ Qualifier $ Value $ Logic $ ;
Cards ;
1|GPI |LIKE|66100052, 66100053|or
1|GPI |LIKE|66100055|or
1|GPI |LIKE|66100065, 66100066|and
1|ICD9|LIKE|V852, V853|or
2|ICD3|IN |278, 279|or
2|ICD3|IN |288, 289|or
2|ICD5|IN |27802, 27803|or
2|GPI |LIKE|66100055, 66100052|or
3|ICD9|LE |1398|or
3|ICD9|LIKE|48[0-7]|or
3|ICD9|LIKE|46|or
3|ICD9|LIKE|7955|or
3|ICD9|LIKE|7907|or
3|ICD9|LIKE|68[126]|or
3|ICD9|LIKE|599|or
4|ICD4|IN |4771,6931,6938,6939,6925|or
4|ICD5|IN |V1501,V1502,V1503,V1504,V1505|or
5|GPI |IN |66250050100320, 21300050100310|or
5|CPT |IN |J8610, J9260, J9250|or
6|ICD9|IN |12345,23456|or
6|ICD3|IN |123,456,789|or
7|GPI |LIKE|99406010|or
7|GPI |LIKE|21101020|or
7|CPT |IN |J7500|or
7|CPT |IN |J7501|or
7|CPT |IN |J9093|or
7|CPT |IN |J9097|or
7|CPT |IN |J8530|or
8|ICD9|IN |73316,823|or
8|ICD9|LIKE|8230|or
8|ICD9|LIKE|8232|or
8|ICD9|LIKE|8238|or
;
Run ;
Data Have ;
Set Have ;
If UpCase( Source ) = 'ICD3' Then SortOrder = 1 ;
Else If UpCase( Source ) = 'ICD5' Then SortOrder = 2 ;
Else If UpCase( Source ) = 'ICD9' Then SortOrder = 3 ;
Else If UpCase( Source ) = 'GPI' Then SortOrder = 4 ;
Else If UpCase( Source ) = 'CPT' Then SortOrder = 5 ;
Else SortOrder = 6 ;
Run ;
Proc Sort
Data = Have ;
By ID SortOrder Qualifier ;
Run ;
Data Need ;
Length Description $ 200 ;
Do I = 1 By 1 Until( Last.ID ) ;
Set Have ;
By ID SortOrder Qualifier ;
If First.Qualifier Then Do ;
Description = CatX( ' ' , Description , Source , Qualifier , Value ) ;
End ;
Else Do ;
Description = CatX( ' , ' , Description , Value ) ;
End ;
If ( Last.Qualifier And Not( Last.ID ) ) Then Do ;
Description = CatX( ' ' , Description , Logic ) ;
End ;
End ;
Run ;
Proc Print
Data = Need ;
Run ;
Toby Dunn
"I'm a hell bent 100% Texan til I die"
"Don't touch my Willie, I don't know you that well"
> Date: Thu, 29 Jul 2010 19:46:14 -0400
> From: i89rt5@GMAIL.COM
> Subject: Re: Perl Regular Expression question
> To: SAS-L@LISTSERV.UGA.EDU
>
> Toby,
>
> Ideally, I would like to have the output to be in a particular order, say,
> ICD9, GPI, CPT.
>
> Thank you.
>
> On Thu, 29 Jul 2010 22:36:46 +0000, toby dunn <tobydunn@HOTMAIL.COM> wrote:
>
> >Jerry,
> >
> >So it was originally in long form, o so much better to work with heck
> yeah.. while I dont at the momemnt have a chance to work up a solution if
> you use the long for you dont need Perl RegEx, just some good old fashion
> SAS programming. I do have one question, is there a particular order you
> want the output (IE. GPI,ICD3, ICD9, CPT).
> >
> >Simply use the long form, IF output order makes a difference order it how
> you want, then use a DoW to collect the information and output one
> observation per ID. Its really super simple.
> >
> >Toby Dunn
> >
> >
> >"I'm a hell bent 100% Texan til I die"
> >
> >"Don't touch my Willie, I don't know you that well"
> >
> >
> >
> >
> >> Date: Thu, 29 Jul 2010 18:23:24 -0400
> >> From: i89rt5@GMAIL.COM
> >> Subject: Re: Perl Regular Expression question
> >> To: SAS-L@LISTSERV.UGA.EDU
> >>
> >> Matt, Chang, and Toby
> >>
> >> Thank you all very much for your Help, I really appreciate your time and
> codes.
> >>
> >> First, as you have already noticed, in my original post I accidentally
> >> dropped the string "GPI" between "66100055 or" and "start" in the first
> >> observation.
> >>
> >> Not only that, in reality, the text I'm trying to manipulate is more messy
> >> and complicated. Below is the more realistic sample representing my data
> >> (but I'm sure I'll find more irregularity in my data)
> >>
> >> data test;
> >>
> >> input id & $1. description & $200.;
> >> datalines;
> >> 1 GPI Like 66100052, 66100053 or GPI LIKE 66100055 or GPI LIKE 66100065,
> >> 66100066 and ICD9 LIKE V852, V853
> >> 2 ICD3 IN 278, 279 or ICD3 IN 288, 289 or ICD5 IN 27802, 27803 or GPI LIKE
> >> 66100055, 66100052
> >> 3 ICD9 LE 1398 or ICD9 LIKE 48[0-7] or ICD9 LIKE 46 or ICD9 LIKE 7955 or
> >> ICD9 LIKE 7907 or ICD9 LIKE 68[126] or ICD9 LIKE 599
> >> 4 ICD4 IN 4771,6931,6938,6939,6925 or ICD5 IN V1501,V1502,V1503,V1504,V1505
> >> 5 GPI IN 66250050100320, 21300050100310 or CPT IN J8610, J9260, J9250
> >> 6 ICD9 IN 12345,23456 or ICD3 IN 123,456,789
> >> 7 GPI LIKE 99406010 or GPI LIKE 21101020 or CPT IN J7500 or CPT IN J7501 or
> >> CPT IN J9093 or CPT IN J9097 or CPT IN J8530
> >> 8 ICD9 IN 73316,823 or ICD9 LIKE 8230 or ICD9 LIKE 8232 or ICD9 LIKE 8238
> >> ;
> >> run;
> >>
> >>
> >> My desired output data (with 2 variables: id and description) should look
> >> like this
> >>
> >> 1 GPI start with 66100052, 66100053, 66100055, 66100065, 66100066 [and] ICD9
> >> start with V852, V853
> >> 2 ICD9 (first 3 digits) in 278, 279, 288, 289 [or] ICD9 (first 5 digits) in
> >> 27802, 27803 [or] GPI start with 66100055, 66100052
> >> 3 ICD9 LE 1398 [or] ICD9 IN 48[0-7], 46, 7955, 7907, 68[126], 599
> >> 4 ICD9 (first 4 digit) in 4771,6931,6938,6939,6925 [or] ICD9 (first 5 digit)
> >> IN V1501,V1502,V1503,V1504,V1505
> >> 5 GPI IN 66250050100320, 21300050100310 or CPT IN J8610, J9260, J9250
> >> 6 ICD9 IN 12345,23456 or ICD9(first 3 digits) IN 123,456,789
> >> 7 GPI start with 99406010, 21101020 or CPT IN J7500, J7501, J9093, J9097,
> J8530
> >> 8 ICD9 IN 73316,823 or ICD9 start with 8230, 8232, 8238
> >>
> >> ;
> >>
> >> Note:
> >> The conjunction between all conditions is not static, it could be either
> >> "or" or "and". And it needs to be wrapped with []
> >>
> >> Also, all "like" shouLd BE replaced with "start with".
> >> And, ICD? should be converted to ICD9 (first ? digit) if ? is less than 9.
> >>
> >>
> >> *******
> >> To Matt,
> >> *******
> >> Yes, the data certainly can be output to a flat file. But I'm in Windows
> >> environment, and don't have any PERL tool installed. I'd be more than happy
> >> to output it to a flat file for you to try.
> >>
> >> *******
> >> To Toby,
> >> *******
> >> I wish I knew how to do look-ahead and look-behind in PERL!
> >>
> >>
> >> *******
> >> To Chang,
> >> *******
> >> The input file originally was in long format, see below.
> >>
> >> data original;
> >> input id & $1. code_source & $4. qualifier & $5. code_value & $200. logic
> & $3.;
> >> datalines;
> >> 1 GPI LIKE 66100052, 66100053 or
> >> 1 GPI LIKE 66100055 or
> >> 1 GPI LIKE 66100065, 66100066 and
> >> 1 ICD9 LIKE V852, V853 or
> >> 2 ICD3 IN 278, 279 or
> >> 2 ICD3 IN 288, 289 or
> >> 2 ICD5 IN 27802, 27803 or
> >> 2 GPI LIKE 66100055, 66100052 or
> >> 3 ICD9 LE 1398 or
> >> 3 ICD9 LIKE 48[0-7] or
> >> 3 ICD9 LIKE 46 or
> >> 3 ICD9 LIKE 7955 or
> >> 3 ICD9 LIKE 7907 or
> >> 3 ICD9 LIKE 68[126] or
> >> 3 ICD9 LIKE 599 or
> >> 4 ICD4 IN 4771,6931,6938,6939,6925 or
> >> 4 ICD5 IN V1501,V1502,V1503,V1504,V1505 or
> >> 5 GPI IN 66250050100320, 21300050100310 or
> >> 5 CPT IN J8610, J9260, J9250 or
> >> 6 ICD9 IN 12345,23456 or
> >> 6 ICD3 IN 123,456,789 or
> >> 7 GPI LIKE 99406010 or
> >> 7 GPI LIKE 21101020 or
> >> 7 CPT IN J7500 or
> >> 7 CPT IN J7501 or
> >> 7 CPT IN J9093 or
> >> 7 CPT IN J9097 or
> >> 7 CPT IN J8530 or
> >> 8 ICD9 IN 73316,823 or
> >> 8 ICD9 LIKE 8230 or
> >> 8 ICD9 LIKE 8232 or
> >> 8 ICD9 LIKE 8238 or
> >> ;
> >> run;
> >>
> >> I used Mike's approach (see below) and concatenated 4 columns (code_source,
> >> qualifier, code_value, logic), across rows if needed, for each id.
> >>
> >> Note: for each id, the last logic value is skipped when I did the
> >> concatenation, which results in the data "test", shown at the top
> >>
> >> data test;
> >> length description $1000;
> >> do until (last.id);
> >> set original;
> >> by id;
> >> if not last.id then description=catx(' ', description, code_source,
> >> qualifier, code_value, logic);
> >> end;
> >> description = catx(' ', description, code_source, qualifier, code_value);
> >> keep id description;
> >> run;
> >>
> >>
> >>
> >>
> >> On Thu, 29 Jul 2010 13:58:09 -0500, Matthew Pettis
> >> <matt.pettis@THOMSONREUTERS.COM> wrote:
> >>
> >> >To add to this, SAS *can* do text regex mangling, but if this data is in
> >> >a flat file, or could be put into one, it would likely be easier to code
> >> >a solution purely in Perl to get what you want done. The SAS PRX*
> >> >functions and how they work with datasteps can make for a more
> >> >complicated solution and can require the making of more boilerplate
> >> >code. If you provide a few more examples of text you want parsed (and
> >> >answer the missing 'GPI' from the first obs question), it might make our
> >> >answers more complete and might help you determine if SAS is the best
> >> >tool to do this text extraction...
> >> >
> >> >Matt
> >> >
> >> >-----Original Message-----
> >> >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> >> >toby dunn
> >> >Sent: Thursday, July 29, 2010 11:20 AM
> >> >To: SAS-L@LISTSERV.UGA.EDU
> >> >Subject: Re: Perl Regular Expression question
> >> >
> >> >Jerry ,
> >> >
> >> >
> >> >
> >> >Not sure its really any better or worse than the other solutions you
> >> >recieved, it does however hammer it out in one pass of the data set for
> >> >what it is worth. Like Chang I too wondered about the missing GPI in
> >> >your first observation. If it is truelly missing then the Pattern for
> >> >GPI will need to be modified out to use a Look Behind.
> >> >
> >> >
> >> >
> >> >Data Need ( Keep = NewDescription ) ;
> >> >Length Temp1 Temp2 NewDescription $200. ;
> >> >Set Have ;
> >> >
> >> >
> >> >
> >> >Start = 1 ;
> >> >Stop = Length( Description ) ;
> >> >Position = 0 ;
> >> >
> >> >
> >> >
> >> >Pattern = PrxParse( '/(?:(\b\d{8}\b)|(\b[VE]*\d+\b))/') ;
> >> >
> >> >
> >> >
> >> >Call PRXNext( Pattern , Start , Stop , Description , Position , Length )
> >> >;
> >> >
> >> >
> >> >
> >> >Do While ( Position > 0 ) ;
> >> > Temp1 = CatX( ' , ' , Temp1 , PRXPosn( Pattern , 1 , Description ) ) ;
> >> > Temp2 = CatX( ' , ' , Temp2 , PRXPosn( Pattern , 2 , Description ) ) ;
> >> >
> >> > Call PRXNext( Pattern , Start , Stop , Description , Position , Length
> >> >) ;
> >> >End ;
> >> >
> >> >
> >> >
> >> >Temp1 = IfC( Not Missing( Temp1 ) , 'GPI Starts With ' || Temp1 , '' )
> >> >;
> >> >Temp2 = IfC( Not Missing( Temp2 ) , 'ICD9 Starts With ' || Temp2 , '' )
> >> >;
> >> >NewDescription = CatX( ' And ' , Temp1 , Temp2 ) ;
> >> >
> >>
> >> >Run ;
> >> >
> >> >> Date: Wed, 28 Jul 2010 17:15:59 -0400
> >> >> From: i89rt5@GMAIL.COM
> >> >> Subject: Perl Regular Expression question
> >> >> To: SAS-L@LISTSERV.UGA.EDU
> >> >>
> >> >> Hi,
> >> >>
> >> >> Suppose I have an input data below
> >> >>
> >> >> data in;
> >> >> input description $ 1-100;
> >> >> datalines;
> >> >> GPI start with 66100052 or GPI start with 66100055 or start with
> >> >66100065
> >> >> and ICD9 start with V852
> >> >> ICD9 start with 27800 or ICD9 start with 27801 or ICD9 start with
> >> >27802 or
> >> >> ICD9 start with V852
> >> >> ;
> >> >> run;
> >> >>
> >> >> How could Perl Regular Expression be used to make the output data
> >> >(still 1
> >> >> var: description, and still 2 observations) look like this
> >> >>
> >> >> GPI start with 66100052, 66100055, 66100065 and ICD9 start with V852
> >> >> ICD9 start with 27800, 27801, 27802, V852
> >> >>
> >> >> So, part of my specific question would be: how to use Perl Rx to
> >> >determine
> >> >> "GPI start with" or "ICD9 start with" occurs more than once, and then
> >> >> extract the numbers and put them together?
> >> >>
> >> >> Thank you.
> >
|