LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (September 1998, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 22 Sep 1998 14:51:19 -0400
Reply-To:     MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         "Michael A. Raithel" <MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM>
Subject:      Re: How to do a ?closest match'' merge?

Date: Fri, 18 Sep 1998 11:32:05 -0400 From: WHITLOI1 <WHITLOI1@WESTAT.COM> Subject: Re: How to do a ``closest match'' merge?

Subject: How to do a ``closest match'' merge? Summary: Use SQL, but take care for efficiency. Respondent: Ian Whitlock <whitloi1@westat.com>

Stuart Luppescu <abasl70@PRAXIS.SPC.UCHICAGO.EDU.SPC.UCHICAGO.EDU> asks a classic question about match on near values.

> I would like to merge two data sets matching on a continuous > numeric variable. It is very unlikely that there are any exact > matches, so I want to merge cases by ``closest match.'' I thought > of rounding the matching variables, merging on the rounded > variables, then selecting the case within each rounded value with > the smallest difference in unrounded variables, but this is > extremely messy. Does anyone know of a more elegant solution?

Here is example code where the data set W is matched against itself. Oddly, the algorithm described by Stuart is pretty close to the one used. In part messiness is a question of familiarity and the tools used.

The problem soon blows up if the files are large and one has no idea about the size of "close". In the case below the numbers are uniformly distributed, hence the measure of closeness is dependent on the size of W. The smaller one can make the upper bound of closeness, the more efficient the code will be. On the other hand the smaller the bound on closeness, the more likely it is to miss a match.

The secret of introducing the OFFSET data set to make the process more efficient was originally developed by Howard Schreier in answer to a similar question on SAS-L in 1993-1994. The method has appeared in SUGI (national and/or regional) papers by Paul Kent and by myself.

data w ; /* test data */ do i = 1 to 5000 ; key = ranuni ( 86194093 ) ; rndkey = round ( key , .0005 ) ; /* based on size of close */ output ; end ; run ;

data offset ; do offset = -.0005 , 0 , .0005 ; /* match round term above */ output ; end ; run ;

proc sql ; create table match as select first.I as I1 , second.I as I2 , first.key as key1 , second.key as key2 , abs ( first.key - second.key ) as diff from w as first , w as second , offset where first.rndkey = offset + second.rndkey and calculated diff > 0 group by first.key having calculated diff = min ( calculated diff ) order by I1 ; title "No match less than .0005" ;

select i , key from w except select i1 , key1 from match ; quit ;

The fact that the WHERE condition eliminates most pairs from consideration, makes the process efficient because it can be implemented with a simple sort and Cartesian merge (in SQL terms it is an equi-join). The HAVING clause singles out the desired match from a small set of possibilities caught in the WHERE net.

There are notes about optimization and remerging, but they are inherent in the problem and can be ignored as long as the time of execution is tolerable.

The second step reports the match failures, where the measure of closeness is set to small. If it is impossible to find a measure of closeness small enough for efficiency and large enough to get most matches, then one should split the input files into more homogeneous subsets and match the subsets.

Note that the method will produce more than one match when the minimum distance occurs more than once.

Ian Whitlock <whitloi1@westat.com>


Back to: Top of message | Previous page | Main SAS-L page