|
Jerry,
So it was originally in long form, o so much better to work with heck yeah.. while I dont at the momemnt have a chance to work up a solution if you use the long for you dont need Perl RegEx, just some good old fashion SAS programming. I do have one question, is there a particular order you want the output (IE. GPI,ICD3, ICD9, CPT).
Simply use the long form, IF output order makes a difference order it how you want, then use a DoW to collect the information and output one observation per ID. Its really super simple.
Toby Dunn
"I'm a hell bent 100% Texan til I die"
"Don't touch my Willie, I don't know you that well"
> Date: Thu, 29 Jul 2010 18:23:24 -0400
> From: i89rt5@GMAIL.COM
> Subject: Re: Perl Regular Expression question
> To: SAS-L@LISTSERV.UGA.EDU
>
> Matt, Chang, and Toby
>
> Thank you all very much for your Help, I really appreciate your time and codes.
>
> First, as you have already noticed, in my original post I accidentally
> dropped the string "GPI" between "66100055 or" and "start" in the first
> observation.
>
> Not only that, in reality, the text I'm trying to manipulate is more messy
> and complicated. Below is the more realistic sample representing my data
> (but I'm sure I'll find more irregularity in my data)
>
> data test;
>
> input id & $1. description & $200.;
> datalines;
> 1 GPI Like 66100052, 66100053 or GPI LIKE 66100055 or GPI LIKE 66100065,
> 66100066 and ICD9 LIKE V852, V853
> 2 ICD3 IN 278, 279 or ICD3 IN 288, 289 or ICD5 IN 27802, 27803 or GPI LIKE
> 66100055, 66100052
> 3 ICD9 LE 1398 or ICD9 LIKE 48[0-7] or ICD9 LIKE 46 or ICD9 LIKE 7955 or
> ICD9 LIKE 7907 or ICD9 LIKE 68[126] or ICD9 LIKE 599
> 4 ICD4 IN 4771,6931,6938,6939,6925 or ICD5 IN V1501,V1502,V1503,V1504,V1505
> 5 GPI IN 66250050100320, 21300050100310 or CPT IN J8610, J9260, J9250
> 6 ICD9 IN 12345,23456 or ICD3 IN 123,456,789
> 7 GPI LIKE 99406010 or GPI LIKE 21101020 or CPT IN J7500 or CPT IN J7501 or
> CPT IN J9093 or CPT IN J9097 or CPT IN J8530
> 8 ICD9 IN 73316,823 or ICD9 LIKE 8230 or ICD9 LIKE 8232 or ICD9 LIKE 8238
> ;
> run;
>
>
> My desired output data (with 2 variables: id and description) should look
> like this
>
> 1 GPI start with 66100052, 66100053, 66100055, 66100065, 66100066 [and] ICD9
> start with V852, V853
> 2 ICD9 (first 3 digits) in 278, 279, 288, 289 [or] ICD9 (first 5 digits) in
> 27802, 27803 [or] GPI start with 66100055, 66100052
> 3 ICD9 LE 1398 [or] ICD9 IN 48[0-7], 46, 7955, 7907, 68[126], 599
> 4 ICD9 (first 4 digit) in 4771,6931,6938,6939,6925 [or] ICD9 (first 5 digit)
> IN V1501,V1502,V1503,V1504,V1505
> 5 GPI IN 66250050100320, 21300050100310 or CPT IN J8610, J9260, J9250
> 6 ICD9 IN 12345,23456 or ICD9(first 3 digits) IN 123,456,789
> 7 GPI start with 99406010, 21101020 or CPT IN J7500, J7501, J9093, J9097, J8530
> 8 ICD9 IN 73316,823 or ICD9 start with 8230, 8232, 8238
>
> ;
>
> Note:
> The conjunction between all conditions is not static, it could be either
> "or" or "and". And it needs to be wrapped with []
>
> Also, all "like" shouLd BE replaced with "start with".
> And, ICD? should be converted to ICD9 (first ? digit) if ? is less than 9.
>
>
> *******
> To Matt,
> *******
> Yes, the data certainly can be output to a flat file. But I'm in Windows
> environment, and don't have any PERL tool installed. I'd be more than happy
> to output it to a flat file for you to try.
>
> *******
> To Toby,
> *******
> I wish I knew how to do look-ahead and look-behind in PERL!
>
>
> *******
> To Chang,
> *******
> The input file originally was in long format, see below.
>
> data original;
> input id & $1. code_source & $4. qualifier & $5. code_value & $200. logic & $3.;
> datalines;
> 1 GPI LIKE 66100052, 66100053 or
> 1 GPI LIKE 66100055 or
> 1 GPI LIKE 66100065, 66100066 and
> 1 ICD9 LIKE V852, V853 or
> 2 ICD3 IN 278, 279 or
> 2 ICD3 IN 288, 289 or
> 2 ICD5 IN 27802, 27803 or
> 2 GPI LIKE 66100055, 66100052 or
> 3 ICD9 LE 1398 or
> 3 ICD9 LIKE 48[0-7] or
> 3 ICD9 LIKE 46 or
> 3 ICD9 LIKE 7955 or
> 3 ICD9 LIKE 7907 or
> 3 ICD9 LIKE 68[126] or
> 3 ICD9 LIKE 599 or
> 4 ICD4 IN 4771,6931,6938,6939,6925 or
> 4 ICD5 IN V1501,V1502,V1503,V1504,V1505 or
> 5 GPI IN 66250050100320, 21300050100310 or
> 5 CPT IN J8610, J9260, J9250 or
> 6 ICD9 IN 12345,23456 or
> 6 ICD3 IN 123,456,789 or
> 7 GPI LIKE 99406010 or
> 7 GPI LIKE 21101020 or
> 7 CPT IN J7500 or
> 7 CPT IN J7501 or
> 7 CPT IN J9093 or
> 7 CPT IN J9097 or
> 7 CPT IN J8530 or
> 8 ICD9 IN 73316,823 or
> 8 ICD9 LIKE 8230 or
> 8 ICD9 LIKE 8232 or
> 8 ICD9 LIKE 8238 or
> ;
> run;
>
> I used Mike's approach (see below) and concatenated 4 columns (code_source,
> qualifier, code_value, logic), across rows if needed, for each id.
>
> Note: for each id, the last logic value is skipped when I did the
> concatenation, which results in the data "test", shown at the top
>
> data test;
> length description $1000;
> do until (last.id);
> set original;
> by id;
> if not last.id then description=catx(' ', description, code_source,
> qualifier, code_value, logic);
> end;
> description = catx(' ', description, code_source, qualifier, code_value);
> keep id description;
> run;
>
>
>
>
> On Thu, 29 Jul 2010 13:58:09 -0500, Matthew Pettis
> <matt.pettis@THOMSONREUTERS.COM> wrote:
>
> >To add to this, SAS *can* do text regex mangling, but if this data is in
> >a flat file, or could be put into one, it would likely be easier to code
> >a solution purely in Perl to get what you want done. The SAS PRX*
> >functions and how they work with datasteps can make for a more
> >complicated solution and can require the making of more boilerplate
> >code. If you provide a few more examples of text you want parsed (and
> >answer the missing 'GPI' from the first obs question), it might make our
> >answers more complete and might help you determine if SAS is the best
> >tool to do this text extraction...
> >
> >Matt
> >
> >-----Original Message-----
> >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> >toby dunn
> >Sent: Thursday, July 29, 2010 11:20 AM
> >To: SAS-L@LISTSERV.UGA.EDU
> >Subject: Re: Perl Regular Expression question
> >
> >Jerry ,
> >
> >
> >
> >Not sure its really any better or worse than the other solutions you
> >recieved, it does however hammer it out in one pass of the data set for
> >what it is worth. Like Chang I too wondered about the missing GPI in
> >your first observation. If it is truelly missing then the Pattern for
> >GPI will need to be modified out to use a Look Behind.
> >
> >
> >
> >Data Need ( Keep = NewDescription ) ;
> >Length Temp1 Temp2 NewDescription $200. ;
> >Set Have ;
> >
> >
> >
> >Start = 1 ;
> >Stop = Length( Description ) ;
> >Position = 0 ;
> >
> >
> >
> >Pattern = PrxParse( '/(?:(\b\d{8}\b)|(\b[VE]*\d+\b))/') ;
> >
> >
> >
> >Call PRXNext( Pattern , Start , Stop , Description , Position , Length )
> >;
> >
> >
> >
> >Do While ( Position > 0 ) ;
> > Temp1 = CatX( ' , ' , Temp1 , PRXPosn( Pattern , 1 , Description ) ) ;
> > Temp2 = CatX( ' , ' , Temp2 , PRXPosn( Pattern , 2 , Description ) ) ;
> >
> > Call PRXNext( Pattern , Start , Stop , Description , Position , Length
> >) ;
> >End ;
> >
> >
> >
> >Temp1 = IfC( Not Missing( Temp1 ) , 'GPI Starts With ' || Temp1 , '' )
> >;
> >Temp2 = IfC( Not Missing( Temp2 ) , 'ICD9 Starts With ' || Temp2 , '' )
> >;
> >NewDescription = CatX( ' And ' , Temp1 , Temp2 ) ;
> >
>
> >Run ;
> >
> >> Date: Wed, 28 Jul 2010 17:15:59 -0400
> >> From: i89rt5@GMAIL.COM
> >> Subject: Perl Regular Expression question
> >> To: SAS-L@LISTSERV.UGA.EDU
> >>
> >> Hi,
> >>
> >> Suppose I have an input data below
> >>
> >> data in;
> >> input description $ 1-100;
> >> datalines;
> >> GPI start with 66100052 or GPI start with 66100055 or start with
> >66100065
> >> and ICD9 start with V852
> >> ICD9 start with 27800 or ICD9 start with 27801 or ICD9 start with
> >27802 or
> >> ICD9 start with V852
> >> ;
> >> run;
> >>
> >> How could Perl Regular Expression be used to make the output data
> >(still 1
> >> var: description, and still 2 observations) look like this
> >>
> >> GPI start with 66100052, 66100055, 66100065 and ICD9 start with V852
> >> ICD9 start with 27800, 27801, 27802, V852
> >>
> >> So, part of my specific question would be: how to use Perl Rx to
> >determine
> >> "GPI start with" or "ICD9 start with" occurs more than once, and then
> >> extract the numbers and put them together?
> >>
> >> Thank you.
|