Date: Fri, 18 Jul 1997 11:22:53 +1000
Reply-To: Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Subject: Re: LINKPro System -Reply -Reply
Content-Type: text/plain
This is a bit off-topic but probabilistic record linkage (which includes
tasks such as customer list de-duplication) is probably of interest to
many SAS users.
Richard Hockey notes:
>>> Richard Hockey <richardh@quokka.epidem.uwa.edu.au>
We have done a lot of successful linkage work using the original
Links macros which we have extended to encompass all aspects of
probabilistic record linkage. My impression is that LInksPro now
includes a lot of this extra stuff.
The advantage of the Links macros are that they run under any OS
running SAS and they are extremely flexible/configurable. Automatch
(mentioned below) on the other hand is not. It is a virtual blackbox
system.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I have to take issue with the statement that AutoMatch is "a virtual black
box". In fact, almost every aspect of the linkage process and
parameters are configurable in AutoMatch and every aspect of its
operation is well documented both in the manual and in the scientific
literature (see Medline record below). AutoMatch also runs on just
about every platform which SAS runs on from PC to mainframe. The
only downside to AutoMatch (apart from its reasonable but not totally
trivial cost) is the need to export all your data to ASCII files.
There are two other record linkage/de-duplication products I know of:
SSA-Names and ScrubMaster. Both of these products are definitely
"black boxes" and tend to be offered as "turn-key" solutions. AutoMatch
(and no doubt LinkPRO and/or the Links macros) require iterative
development of linkage strategies to get optimal results, although in most
circumstances you cab get pretty good results with minimal fiddling.
Tim Churches
NSW Health Department
Sydney, Australia
Email: tchur@doh.health.nsw.gov.au
Medline Record:
TITLE
Probabilistic linkage of large public health data files [see comments]
AUTHOR(S)
Jaro-MA
SOURCE (BIBLIOGRAPHIC CITATION)
Stat-Med.1995 Mar 15-Apr 15; 14(5-7): 491-8.
INTERNATIONAL STANDARD SERIAL NUMBER
0277-6715
LANGUAGE OF ARTICLE
ENGLISH
ABSTRACT
Probabilistic linkage technology makes it feasible and efficient to link
large public health databases in a statistically justifiable manner. The
problem
addressed by the methodology is that of matching two files of
individual data under conditions of uncertainty. Each field is subject to
error which is
measured by the probability that the field agrees given a record pair
matches (called the m probability) and probabilities of chance agreement
of its value
states (called the u probability). Fellegi and Sunter pioneered record
linkage theory. Advances in methodology include use of an EM algorithm
for
parameter estimation, optimization of matches by means of a linear
sum assignment program, and more recently, a probability model that
addresses both
m and u probabilities for all value states of a field. This provides a
means for obtaining greater precision from non-uniformly distributed
fields, without the
theoretical complications arising from frequency-based matching
alone. The model includes an iterative parameter estimation procedure
that is more robust
than pre-match estimation techniques. The methodology was
originally developed and tested by the author at the U.S. Census Bureau
for census
undercount estimation. The more recent advances and a new
generalized software system were tested and validated by linking
highway crashes to
Emergency Medical Service (EMS) reports and to hospital admission
records for the National Highway Traffic Safety Administration
(NHTSA).
MEDLINE ACCESSION NUMBER
95312707