Date: Thu, 11 Apr 2002 17:41:14 -0400
Reply-To: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject: Re: Matching strings of unequal length
Content-Type: text/plain; charset=iso-8859-1
Ron,
Here's a simple idea: Store the short file names, with a period appended to
them on the right, in a hash table. Then read the larger file and search the
table for matches using the colon modifier. Here is a simple sample code
(below, I chose 1003 because it is prime and much greater than 350). Matches
will be marked by 1, no-matches -- by a missing value.
data small ;
input fn $char44. ;
cards;
LAPK.RDSUM.OSHPD.WWH
LAPK.DCMMRDRV.CNTLCARD.NPH
LAPK.DCMMRDRV.CNTLCARD.SMH
LAPK.DCMMRDRV.CNTLCARD.WWH
LAPK.RDMSTR.RDTXLRC.SORTIP
run ;
data large ;
input fn $char44. ;
cards ;
LAPK.RDSUM.OSHPD.WWH.G0002V00
LAPK.DCMMRDRV.CNTLCARD.NPH.G0008V00
LAPK.DCMMRDRV.CNTLCARD.SMH.SM917E.DATA
LAPK.DCMMRDRV.CNTLCARD.SMX.SM917E.DATA
LAPK.DCMMRDRV.CNTLCARD.WWH.T1018B.G0001V00
LAPK.RDMSTR.RDTXLRC.SORTIP.G0661V00
LAPK.RDMSTR.RDTXLRC.SORTIZ.G0661V00
run ;
%let h = 1003 ;
data match (keep = fn match) ;
array h (0:&h) $44. _temporary_ ;
if _n_ = 1 then do until (s) ;
set small end = s ;
k = trim(fn) || '.' ;
do j = mod(input(k,pib6.),&h) by 1 until (h(j) = k) ;
if j > &h then j = 0 ;
if h(j) =: '' then h(j) = k ;
end ;
end ;
set large ;
do j = mod(input(fn,pib6.),&h) by 1 until ( h(j) =: '') ;
if j > &h then j = 0 ;
if h(j) =: substr(fn,1,length(h(j))) then do ;
match = 1 ;
leave ;
end ;
end ;
run ;
proc print data= match ;
run ;
I believe in V8.2 (which I do not have handy at the moment), one could code
use the equivalent of the EQ: operator, EQT, to compare the names in a join
like
small.FN EQT substr(large.FN, 1, length(small.FN))
Try it. If it works, it is somewhat simpler than coding a hash.
Kind regards,
=====================
Paul M. Dorfman
Jacksonville, FL
=====================
> -----Original Message-----
> From: Carriere, Ron [mailto:rcarriere@MEDNET.UCLA.EDU]
> Sent: Thursday, April 11, 2002 3:26 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Matching strings of unequal length
>
>
> I have two files. The first is a table that looks like:
>
> LAPK.RDSUM.OSHPD.WWH DAILY
> LAPK.DCMMRDRV.CNTLCARD.NPH MONTHLY
> LAPK.DCMMRDRV.CNTLCARD.SMH MONTHLY
> LAPK.DCMMRDRV.CNTLCARD.WWH MONTHLY
> LAPK.RDMSTR.RDTXLRC.SORTIP WEEKLY
>
> The second file shows file names
>
> LAPK.RDSUM.OSHPD.WWH.G0002V00
> LAPK.DCMMRDRV.CNTLCARD.NPH.G0008V00
> LAPK.DCMMRDRV.CNTLCARD.SMH.SM917E.DATA
> LAPK.DCMMRDRV.CNTLCARD.WWH.T1018B.G0001V00
> LAPK.RDMSTR.RDTXLRC.SORTIP.G0661V00
>
> I would like to match up the file names in the second file
> with the table
> ignoring the extraneous data in file names, i.e. the
> generation identifiers
> and low level qualifiers (G0002v00/SM917E.DATA). The second file has
> approximately 10,000 entries the first 350. So in the
> example above the
> first file name matches up with the first entry and so on
> with the last
> file name matching up with the last table entry. If I could
> be certain
> that the only extraneous information in the second file were
> the generation
> numbers, then I could search for and strip them off and
> simply merge the
> two files. But this will not work for the third example in
> the second file
> and many other examples as well. Suggestions???
>
> Ron Carriere
> UCLA Medical Center
>
>
Blue Cross Blue Shield of Florida, Inc., and its subsidiary and
affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.
|