Date: Tue, 22 Sep 1998 14:51:19 -0400
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Michael A. Raithel" <MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM>
Subject: Re: How to do a ?closest match'' merge?
Date: Fri, 18 Sep 1998 11:32:05 -0400
From: WHITLOI1 <WHITLOI1@WESTAT.COM>
Subject: Re: How to do a ``closest match'' merge?
Subject: How to do a ``closest match'' merge?
Summary: Use SQL, but take care for efficiency.
Respondent: Ian Whitlock <firstname.lastname@example.org>
Stuart Luppescu <abasl70@PRAXIS.SPC.UCHICAGO.EDU.SPC.UCHICAGO.EDU>
asks a classic question about match on near values.
> I would like to merge two data sets matching on a continuous
> numeric variable. It is very unlikely that there are any exact
> matches, so I want to merge cases by ``closest match.'' I
> of rounding the matching variables, merging on the rounded
> variables, then selecting the case within each rounded value
> the smallest difference in unrounded variables, but this is
> extremely messy. Does anyone know of a more elegant solution?
Here is example code where the data set W is matched against
Oddly, the algorithm described by Stuart is pretty close to the one
used. In part messiness is a question of familiarity and the tools
The problem soon blows up if the files are large and one has no
about the size of "close". In the case below the numbers are
uniformly distributed, hence the measure of closeness is dependent
the size of W. The smaller one can make the upper bound of
the more efficient the code will be. On the other hand the smaller
the bound on closeness, the more likely it is to miss a match.
The secret of introducing the OFFSET data set to make the process
efficient was originally developed by Howard Schreier in answer to
similar question on SAS-L in 1993-1994. The method has appeared in
SUGI (national and/or regional) papers by Paul Kent and by myself.
data w ; /* test data */
do i = 1 to 5000 ;
key = ranuni ( 86194093 ) ;
rndkey = round ( key , .0005 ) ; /* based on size of
data offset ;
do offset = -.0005 , 0 , .0005 ; /* match round term
proc sql ;
create table match as
select first.I as I1 , second.I as I2 ,
first.key as key1 , second.key as key2 ,
abs ( first.key - second.key ) as diff
from w as first , w as second , offset
where first.rndkey =
offset + second.rndkey
and calculated diff > 0
group by first.key
having calculated diff = min ( calculated diff )
order by I1
title "No match less than .0005" ;
select i , key from w
select i1 , key1 from match
The fact that the WHERE condition eliminates most pairs from
consideration, makes the process efficient because it can be
implemented with a simple sort and Cartesian merge (in SQL terms it
an equi-join). The HAVING clause singles out the desired match
small set of possibilities caught in the WHERE net.
There are notes about optimization and remerging, but they are
inherent in the problem and can be ignored as long as the time of
execution is tolerable.
The second step reports the match failures, where the measure of
closeness is set to small. If it is impossible to find a measure
of closeness small enough for efficiency and large enough to get
most matches, then one should split the input files into more
homogeneous subsets and match the subsets.
Note that the method will produce more than one match when the
minimum distance occurs more than once.
Ian Whitlock <email@example.com>