Date: Wed, 29 Apr 2009 20:42:33 -0700
Reply-To: Savian <savian.net@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Savian <savian.net@GMAIL.COM>
Organization: http://groups.google.com
Subject: Re: Text Parsing exercise - regex?
Content-Type: text/plain; charset=ISO-8859-1
On Apr 29, 8:39 pm, mysas...@GMAIL.COM (Michael Murff) wrote:
> Chang,
>
> Thanks for taking the time to conjure up that wicked cool code; it does
> indeed work as promised :-> I have big batches of this sas code stored in
> discrete files across UNIX directories which I am scanning, looping through,
> plucking metadata from the filenames, and putting into pipes via datastep,
> and then executing your esoteric subroutine - seems to work rather well.
> Fortunately, the code is machine actuated so formatting variation has rather
> predictable boundaries.
>
> btw - Next time I need to build an editor, or compiler for that matter, I
> know who to call!
>
> I continue to be amazed at how helpful the L can be.
>
> Thanks to you all (Ya, Alan as well),
> Mike
>
>
>
> On Wed, Apr 29, 2009 at 6:49 PM, Savian <savian....@gmail.com> wrote:
> > On Apr 28, 5:37 pm, mysas...@GMAIL.COM (Michael Murff) wrote:
> > > Hello,
>
> > > Would anyone be willing to provide a primer in text parsing to produce
> > the
> > > table below from the two code snippets shown.
>
> > > Much obliged.
>
> > > Mike
>
> > > /* metadata for myVar_site_num - type A */
>
> > > if ( -1e38 < myVar_site <= 999 ) then myVar_s3 = -1.607070;
>
> > > else if ( 999 < myVar_site <= 1000 ) then myVar_s3 = 0.711604;
>
> > > else if ( 1000 < myVar_site <= 1244 ) then myVar_s3 = -0.722618;
>
> > > else if ( 1244 < myVar_site <= 1301 ) then myVar_s3 = 1.306531;
>
> > > else if ( 1301 < myVar_site <= 1334 ) then myVar_s3 = 2.390180;
>
> > > else if ( 1334 < myVar_site <= 1443 ) then myVar_s3 = 1.152770;
>
> > > else if ( 1443 < myVar_site <= 1491 ) then myVar_s3 = 0.261655;
>
> > > else if ( 1491 < myVar_site <= 1605 ) then myVar_s3 = 0.753317;
>
> > > else if ( 1605 < myVar_site <= 1847 ) then myVar_s3 = -0.038277;
>
> > > else if ( 1847 < myVar_site <= 2263 ) then myVar_s3 = -0.465657;
>
> > > else if ( 2263 < myVar_site <= 2999 ) then myVar_s3 = -1.149752;
>
> > > else if ( 2999 < myVar_site <= 3830 ) then myVar_s3 = -0.118839;
>
> > > else if ( 3830 < myVar_site <= 5124 ) then myVar_s3 = -0.610279;
>
> > > else if ( 5124 < myVar_site <= 7194 ) then myVar_s3 = -0.009158;
>
> > > else if ( 7194 < myVar_site <= 12730 ) then myVar_s3 = -0.444817;
>
> > > else if ( myVar_site > 12730 ) then myVar_s3 = -0.637903;
>
> > > else myVar_s3 = 777;
>
> > > label myVar_s3 = 'AAAA';
>
> > > /* WOE recoding for myVar_site2 - type B */
>
> > > if myVar_site2 in ( ' 71', ' 65' ) then wmyVar_site2_s3 =
> > > -2.326094;
>
> > > else if myVar_site2 = ' 41' then wmyVar_site2_s3 = -1.797038;
>
> > > else if myVar_site2 in ( ' 91', ' 30' ) then wmyVar_site2_s3 =
> > > -1.603211;
>
> > > else if myVar_site2 in ( ' 5', ' 46' ) then wmyVar_site2_s3 =
> > > -1.409411;
>
> > > else if myVar_site2 in ( ' 20', ' 75', ' 43' ) then
> > > wmyVar_site2_s3 = -1.074107;
>
> > > else if myVar_site2 in ( ' 10', ' 12' ) then wmyVar_site2_s3 =
> > > -0.960373;
>
> > > else if myVar_site2 in ( ' 42', ' 13', ' 77', ' 74' )
> > > then wmyVar_site2_s3 = -0.707259;
>
> > > else if myVar_site2 in ( ' 11', ' 78', ' 40' ) then
> > > wmyVar_site2_s3 = -0.078952;
>
> > > else if myVar_site2 = ' 99' then wmyVar_site2_s3 = 0.937188;
>
> > > else wmyVar_site2_s3 = 555;
>
> > > label wmyVar_site2_s3 = 'ZZZZ';
>
> > > *Proposed DB format to capture both code types:*
>
> > > vartype varname_site startval_num endval_num binkey_cat
> > varname_rd
> > > val missval labelval A myVar_site myVar_s3 -1.607070 777 AAA A
> > > myVar_site 999 1000 myVar_s3 0.711604 777 AAA A myVar_site 1000 1244
> > > myVar_s3 -0.722618 777 AAA A myVar_site 1244 1301 myVar_s3 1.306531 777
> > > AAA A myVar_site 1301 1334 myVar_s3 2.390180 777 AAA A myVar_site 1334
> > > 1443 myVar_s3 1.152770 777 AAA A myVar_site 1443 1491 myVar_s3
> > 0.261655
> > > 777 AAA A myVar_site 1491 1605 myVar_s3 0.753317 777 AAA A myVar_site
> > 1605
> > > 1847 myVar_s3 -0.038277 777 AAA A myVar_site 1847 2263 myVar_s3
> > > -0.465657 777 AAA A myVar_site 2263 2999 myVar_s3 -1.149752 777 AAA A
> > > myVar_site 2999 3830 myVar_s3 -0.118839 777 AAA A myVar_site 3830 5124
> > > myVar_s3 -0.610279 777 AAA A myVar_site 5124 7194 myVar_s3 -0.009158
> > 777
> > > AAA A myVar_site 7194 12730 myVar_s3 -0.444817 777 AAA A myVar_site
> > 12730
> > > myVar_s3 -0.637903 777 AAA B myVar_site2 ' 71', ' 65'
> > > wmyVar_site2_s3 -2.326094 555 ZZZ B myVar_site2 ' 41'
> > > wmyVar_site2_s3 -1.797038 555 ZZZ B myVar_site2 ' 91', '
> > 30'
> > > wmyVar_site2_s3 -1.603211 555 ZZZ B myVar_site2 ' 5', '
> > 46'
> > > wmyVar_site2_s3 -1.409411 555 ZZZ B myVar_site2 ' 20', '
> > > 75',' 43' wmyVar_site2_s3 -1.074107 555 ZZZ B myVar_site2 '
> > 10',
> > > ' 12' wmyVar_site2_s3 -0.960373 555 ZZZ B myVar_site2 '
> > 42',
> > > ' 13',' 77',' 74' wmyVar_site2_s3 -0.707259 555 ZZZ B
> > > myVar_site2 ' 11', ' 78',' 40' wmyVar_site2_s3
> > -0.078952
> > > 555 ZZZ B myVar_site2 ' 99' wmyVar_site2_s3 0.937188 555 ZZZ
>
> > This isn't all of the regex capture but it is a large chunk. This only
> > the regex, not the SAS code. Look under regex and there are lots of
> > good sites:
>
> > type\s(\w)\s\*/|if\s\(([\d\D]+?)<([\D\d]+?)<=([\d\D]+?)\)
>
> > Whenever you see something in parentheses, that means a capturing
> > group. Hence, you have 4 majors groups here:
>
> > type\s(\w)\s\*/ The (\w) is the group
> > | Means 'or'
> > if\s\(([\d\D]+?)< The capturing group here is the ([\d\D]+?).
> > Be careful with the [\d\D]+ construct since it captures everything.
> > The ? means don't be greedy.
> > <=([\d\D]+?)\) Last group
>
> > Download RegexBuddy (godsend) and use it to test as you go. Then
> > translate into SAS.
>
> > Alan
> >http://www.savian.net- Hide quoted text -
>
> - Show quoted text -
Mike,
As you view approaches, keep in mind that regex is something you
should approach at some point. It doesn't have to be the problem du
jour but start exploring its possibilities and you will discover how
absolutely great they are at parsing...far better than functions. I
could probably solve your parsing in a single line of code (albeit
dense) but that is a guess.
When I worked on SAS's log analytic engine, a perl programmer at SAS
named John White introduced me to them. The perl regex he created was
my introduction but, at the time, SAS did not have perl regex support
which meant calling an external routine. That said, the power I saw
was stunning. I finally took the time to learn them when I had to
parse the SAS language and SAS logs in minute detail. I now rely on
them all of the time.
If you haven't dived in, pick a problem and give it a go. They seem
odd at first but they quickly become familiar. RegexBuddy makes them
very easy.
My .01,
Alan
http://www.savian.net
|