Date: Tue, 22 Sep 1998 14:51:19 -0400
Reply-To: MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Michael A. Raithel" <MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM>
Subject: Re: How to do a ?closest match'' merge?
Date: Fri, 18 Sep 1998 11:32:05 -0400
From: WHITLOI1 <WHITLOI1@WESTAT.COM>
Subject: Re: How to do a ``closest match'' merge?
Subject: How to do a ``closest match'' merge?
Summary: Use SQL, but take care for efficiency.
Respondent: Ian Whitlock <whitloi1@westat.com>
Stuart Luppescu <abasl70@PRAXIS.SPC.UCHICAGO.EDU.SPC.UCHICAGO.EDU>
asks a classic question about match on near values.
> I would like to merge two data sets matching on a continuous
> numeric variable. It is very unlikely that there are any exact
> matches, so I want to merge cases by ``closest match.'' I
thought
> of rounding the matching variables, merging on the rounded
> variables, then selecting the case within each rounded value
with
> the smallest difference in unrounded variables, but this is
> extremely messy. Does anyone know of a more elegant solution?
Here is example code where the data set W is matched against
itself.
Oddly, the algorithm described by Stuart is pretty close to the one
used. In part messiness is a question of familiarity and the tools
used.
The problem soon blows up if the files are large and one has no
idea
about the size of "close". In the case below the numbers are
uniformly distributed, hence the measure of closeness is dependent
on
the size of W. The smaller one can make the upper bound of
closeness,
the more efficient the code will be. On the other hand the smaller
the bound on closeness, the more likely it is to miss a match.
The secret of introducing the OFFSET data set to make the process
more
efficient was originally developed by Howard Schreier in answer to
a
similar question on SAS-L in 1993-1994. The method has appeared in
SUGI (national and/or regional) papers by Paul Kent and by myself.
data w ; /* test data */
do i = 1 to 5000 ;
key = ranuni ( 86194093 ) ;
rndkey = round ( key , .0005 ) ; /* based on size of
close
*/
output ;
end ;
run ;
data offset ;
do offset = -.0005 , 0 , .0005 ; /* match round term
above
*/
output ;
end ;
run ;
proc sql ;
create table match as
select first.I as I1 , second.I as I2 ,
first.key as key1 , second.key as key2 ,
abs ( first.key - second.key ) as diff
from w as first , w as second , offset
where first.rndkey =
offset + second.rndkey
and calculated diff > 0
group by first.key
having calculated diff = min ( calculated diff )
order by I1
;
title "No match less than .0005" ;
select i , key from w
except
select i1 , key1 from match
;
quit ;
The fact that the WHERE condition eliminates most pairs from
consideration, makes the process efficient because it can be
implemented with a simple sort and Cartesian merge (in SQL terms it
is
an equi-join). The HAVING clause singles out the desired match
from
a
small set of possibilities caught in the WHERE net.
There are notes about optimization and remerging, but they are
inherent in the problem and can be ignored as long as the time of
execution is tolerable.
The second step reports the match failures, where the measure of
closeness is set to small. If it is impossible to find a measure
of closeness small enough for efficiency and large enough to get
most matches, then one should split the input files into more
homogeneous subsets and match the subsets.
Note that the method will produce more than one match when the
minimum distance occurs more than once.
Ian Whitlock <whitloi1@westat.com>