LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 1997, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 21 Jul 1997 18:04:39 +1000
Reply-To:     Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Subject:      Proposed end of thread: LINKPro System
Comments: To: richardh@quokka.epidem.uwa.edu.au
Comments: cc: mjaro@matchware.com
Content-Type: text/plain

I suggest that this thread be terminated as it is now a bit off topic - but not before a short reply...

I agree that the ability to add deterministic rules is useful in some circumstances. We do post-processing via SAS macros to acheive this with AutoMatch. I gather that the next version of AutoMatch will have this feature built in. The match strategy and specifications in AutoMatch are completely configurable. The undup (deduplicate) function now behaves properly within blocks and the GEOMATCH option will give you what you want across blocks. Composite variables for differential weights are easy to set up and don't require any duplication of data in the input files. As for art versus science, probabilistic record linkage is like any other branch of statistical science: you need some theoretical underpinnings which you then apply with considered license...

Anyway, your LINKS macros sound interesting and I am keen to keep an open mind. Why not make them available on an FTP site somewhere so we can all try them out?

Tim Churches Epidemiology & Health Surveillance Branch NSW Health Department Sydney, Australia Email: tchur@doh.health.nsw.gov.au

>>> Richard Hockey <richardh@quokka.epidem.uwa.edu.au> 21/July/1997 03:49pm >>> Hi I was probably being a bit provocative when I said Automatch was a "blackbox" but compared to Links it is probably a reasonable description. The beauty of links is that you can write your own rules by either modifying the macros or imbedding SAS code within your Links code. I think an essential element of any record linkage system is the facility to add your own rules. (GIRLS has this although you had to write them in PL/1). With Automatch you are stuck with the rules they give you ( and it never seems to have quite the one you need). Another problem we found with Automatch was its Undup function. Unlike conventional record linkage systems it doesn't compare every possibility within a pocket and the method it uses is basically flawed (meaning some potential links are lost). No facility is provided to do conventional internal linkage without a lot of post processing to remove the redundant links and resolve crosslinks. There is also no way to preform differential weighting without creating composite variables eg different weights for marital status for males and females (they are very different). I think the problem with Automatch is that it treats record linkage as a science when it's really an artform. Some of the wonderful theoretical statistical concepts that have been incorporated into it may sound good but in fact they add very little utility to the end result I think I've gone on enough Cheers Richard > Date: Fri, 18 Jul 1997 11:22:53 +1000 > From: Tim CHURCHES <TCHUR@doh.health.nsw.gov.au> > Subject: Re: LINKPro System -Reply -Reply > To: TCHUR@doh.health.nsw.gov.au, richardh@quokka.epidem.uwa.edu.au, > SAS-L@UGA.CC.UGA.EDU

> This is a bit off-topic but probabilistic record linkage (which includes > tasks such as customer list de-duplication) is probably of interest to > many SAS users. > > Richard Hockey notes: > >>> Richard Hockey <richardh@quokka.epidem.uwa.edu.au> > We have done a lot of successful linkage work using the original > Links macros which we have extended to encompass all aspects of > probabilistic record linkage. My impression is that LInksPro now > includes a lot of this extra stuff. > The advantage of the Links macros are that they run under any OS > running SAS and they are extremely flexible/configurable. Automatch > (mentioned below) on the other hand is not. It is a virtual blackbox > system. > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > I have to take issue with the statement that AutoMatch is "a virtual black > box". In fact, almost every aspect of the linkage process and > parameters are configurable in AutoMatch and every aspect of its > operation is well documented both in the manual and in the scientific > literature (see Medline record below). AutoMatch also runs on just > about every platform which SAS runs on from PC to mainframe. The > only downside to AutoMatch (apart from its reasonable but not totally > trivial cost) is the need to export all your data to ASCII files. > > There are two other record linkage/de-duplication products I know of: > SSA-Names and ScrubMaster. Both of these products are definitely > "black boxes" and tend to be offered as "turn-key" solutions. AutoMatch > (and no doubt LinkPRO and/or the Links macros) require iterative > development of linkage strategies to get optimal results, although in most > circumstances you cab get pretty good results with minimal fiddling. > > Tim Churches > NSW Health Department > Sydney, Australia > Email: tchur@doh.health.nsw.gov.au > > Medline Record: > TITLE > Probabilistic linkage of large public health data files [see comments] > AUTHOR(S) > Jaro-MA > SOURCE (BIBLIOGRAPHIC CITATION) > Stat-Med.1995 Mar 15-Apr 15; 14(5-7): 491-8. > INTERNATIONAL STANDARD SERIAL NUMBER > 0277-6715 > LANGUAGE OF ARTICLE > ENGLISH > ABSTRACT > Probabilistic linkage technology makes it feasible and efficient to link > large public health databases in a statistically justifiable manner. The > problem > addressed by the methodology is that of matching two files of > individual data under conditions of uncertainty. Each field is subject to > error which is > measured by the probability that the field agrees given a record pair > matches (called the m probability) and probabilities of chance agreement > of its value > states (called the u probability). Fellegi and Sunter pioneered record > linkage theory. Advances in methodology include use of an EM algorithm > for > parameter estimation, optimization of matches by means of a linear > sum assignment program, and more recently, a probability model that > addresses both > m and u probabilities for all value states of a field. This provides a > means for obtaining greater precision from non-uniformly distributed > fields, without the > theoretical complications arising from frequency-based matching > alone. The model includes an iterative parameter estimation procedure > that is more robust > than pre-match estimation techniques. The methodology was > originally developed and tested by the author at the U.S. Census Bureau > for census > undercount estimation. The more recent advances and a new > generalized software system were tested and validated by linking > highway crashes to > Emergency Medical Service (EMS) reports and to hospital admission > records for the National Highway Traffic Safety Administration > (NHTSA). > MEDLINE ACCESSION NUMBER > 95312707 > > | Richard Hockey email:richardh@epidem.uwa.edu.au | | Department of Public Health phone:+61 8 9380-1292 _--_|\ | | University of Western Australia fax: +61 8 9380-1188 / \ | | NEDLANDS WA 6907 *_.--._/ | | Australia v | "It is better to have tried and failed than to have failed to try, but the result's the same."


Back to: Top of message | Previous page | Main SAS-L page