LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 1997, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 18 Jul 1997 11:22:53 +1000
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
Subject:      Re: LINKPro System -Reply -Reply
Comments: To:
Content-Type: text/plain

This is a bit off-topic but probabilistic record linkage (which includes tasks such as customer list de-duplication) is probably of interest to many SAS users.

Richard Hockey notes: >>> Richard Hockey <> We have done a lot of successful linkage work using the original Links macros which we have extended to encompass all aspects of probabilistic record linkage. My impression is that LInksPro now includes a lot of this extra stuff. The advantage of the Links macros are that they run under any OS running SAS and they are extremely flexible/configurable. Automatch (mentioned below) on the other hand is not. It is a virtual blackbox system. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< I have to take issue with the statement that AutoMatch is "a virtual black box". In fact, almost every aspect of the linkage process and parameters are configurable in AutoMatch and every aspect of its operation is well documented both in the manual and in the scientific literature (see Medline record below). AutoMatch also runs on just about every platform which SAS runs on from PC to mainframe. The only downside to AutoMatch (apart from its reasonable but not totally trivial cost) is the need to export all your data to ASCII files.

There are two other record linkage/de-duplication products I know of: SSA-Names and ScrubMaster. Both of these products are definitely "black boxes" and tend to be offered as "turn-key" solutions. AutoMatch (and no doubt LinkPRO and/or the Links macros) require iterative development of linkage strategies to get optimal results, although in most circumstances you cab get pretty good results with minimal fiddling.

Tim Churches NSW Health Department Sydney, Australia Email:

Medline Record: TITLE Probabilistic linkage of large public health data files [see comments] AUTHOR(S) Jaro-MA SOURCE (BIBLIOGRAPHIC CITATION) Stat-Med.1995 Mar 15-Apr 15; 14(5-7): 491-8. INTERNATIONAL STANDARD SERIAL NUMBER 0277-6715 LANGUAGE OF ARTICLE ENGLISH ABSTRACT Probabilistic linkage technology makes it feasible and efficient to link large public health databases in a statistically justifiable manner. The problem addressed by the methodology is that of matching two files of individual data under conditions of uncertainty. Each field is subject to error which is measured by the probability that the field agrees given a record pair matches (called the m probability) and probabilities of chance agreement of its value states (called the u probability). Fellegi and Sunter pioneered record linkage theory. Advances in methodology include use of an EM algorithm for parameter estimation, optimization of matches by means of a linear sum assignment program, and more recently, a probability model that addresses both m and u probabilities for all value states of a field. This provides a means for obtaining greater precision from non-uniformly distributed fields, without the theoretical complications arising from frequency-based matching alone. The model includes an iterative parameter estimation procedure that is more robust than pre-match estimation techniques. The methodology was originally developed and tested by the author at the U.S. Census Bureau for census undercount estimation. The more recent advances and a new generalized software system were tested and validated by linking highway crashes to Emergency Medical Service (EMS) reports and to hospital admission records for the National Highway Traffic Safety Administration (NHTSA). MEDLINE ACCESSION NUMBER 95312707

Back to: Top of message | Previous page | Main SAS-L page