Date: Mon, 21 Jul 1997 18:04:39 +1000
Reply-To: Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Tim CHURCHES <TCHUR@DOH.HEALTH.NSW.GOV.AU>
Subject: Proposed end of thread: LINKPro System
Content-Type: text/plain
I suggest that this thread be terminated as it is now a bit off topic - but not
before a short
reply...
I agree that the ability to add deterministic rules is useful in some
circumstances. We do
post-processing via SAS macros to acheive this with AutoMatch. I gather that the
next version
of AutoMatch will have this feature built in. The match strategy and
specifications in AutoMatch
are completely configurable. The undup (deduplicate) function now behaves
properly within
blocks and the GEOMATCH option will give you what you want across blocks.
Composite
variables for differential weights are easy to set up and don't require any
duplication of data in
the input files. As for art versus science, probabilistic record linkage is like
any other branch of
statistical science: you need some theoretical underpinnings which you then
apply with
considered license...
Anyway, your LINKS macros sound interesting and I am keen to keep an open mind.
Why not
make them available on an FTP site somewhere so we can all try them out?
Tim Churches
Epidemiology & Health Surveillance Branch
NSW Health Department
Sydney, Australia
Email: tchur@doh.health.nsw.gov.au
>>> Richard Hockey <richardh@quokka.epidem.uwa.edu.au> 21/July/1997 03:49pm >>>
Hi
I was probably being a bit provocative when I said Automatch was a
"blackbox" but compared to Links it is probably a reasonable
description. The beauty of links is that you can write your own
rules by either modifying the macros or imbedding SAS code within
your Links code. I think an essential element of any record linkage
system is the facility to add your own rules. (GIRLS has this
although you had to write them in PL/1). With Automatch you are stuck
with the rules they give you ( and it never seems to have quite the
one you need). Another problem we found with Automatch was its Undup
function. Unlike conventional record linkage systems it doesn't
compare every possibility within a pocket and the method it uses is
basically flawed (meaning some potential links are lost). No
facility is provided to do conventional internal linkage without a
lot of post processing to remove the redundant links and resolve
crosslinks. There is also no way to preform differential weighting
without creating composite variables eg different weights for marital
status for males and females (they are very different). I think the
problem with Automatch is that it treats record linkage as a science
when it's really an artform. Some of the wonderful theoretical
statistical concepts that have been incorporated into it may sound
good but in fact they add very little utility to the end result
I think I've gone on enough
Cheers
Richard
> Date: Fri, 18 Jul 1997 11:22:53 +1000
> From: Tim CHURCHES <TCHUR@doh.health.nsw.gov.au>
> Subject: Re: LINKPro System -Reply -Reply
> To: TCHUR@doh.health.nsw.gov.au, richardh@quokka.epidem.uwa.edu.au,
> SAS-L@UGA.CC.UGA.EDU
> This is a bit off-topic but probabilistic record linkage (which includes
> tasks such as customer list de-duplication) is probably of interest to
> many SAS users.
>
> Richard Hockey notes:
> >>> Richard Hockey <richardh@quokka.epidem.uwa.edu.au>
> We have done a lot of successful linkage work using the original
> Links macros which we have extended to encompass all aspects of
> probabilistic record linkage. My impression is that LInksPro now
> includes a lot of this extra stuff.
> The advantage of the Links macros are that they run under any OS
> running SAS and they are extremely flexible/configurable. Automatch
> (mentioned below) on the other hand is not. It is a virtual blackbox
> system.
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> I have to take issue with the statement that AutoMatch is "a virtual black
> box". In fact, almost every aspect of the linkage process and
> parameters are configurable in AutoMatch and every aspect of its
> operation is well documented both in the manual and in the scientific
> literature (see Medline record below). AutoMatch also runs on just
> about every platform which SAS runs on from PC to mainframe. The
> only downside to AutoMatch (apart from its reasonable but not totally
> trivial cost) is the need to export all your data to ASCII files.
>
> There are two other record linkage/de-duplication products I know of:
> SSA-Names and ScrubMaster. Both of these products are definitely
> "black boxes" and tend to be offered as "turn-key" solutions. AutoMatch
> (and no doubt LinkPRO and/or the Links macros) require iterative
> development of linkage strategies to get optimal results, although in most
> circumstances you cab get pretty good results with minimal fiddling.
>
> Tim Churches
> NSW Health Department
> Sydney, Australia
> Email: tchur@doh.health.nsw.gov.au
>
> Medline Record:
> TITLE
> Probabilistic linkage of large public health data files [see comments]
> AUTHOR(S)
> Jaro-MA
> SOURCE (BIBLIOGRAPHIC CITATION)
> Stat-Med.1995 Mar 15-Apr 15; 14(5-7): 491-8.
> INTERNATIONAL STANDARD SERIAL NUMBER
> 0277-6715
> LANGUAGE OF ARTICLE
> ENGLISH
> ABSTRACT
> Probabilistic linkage technology makes it feasible and efficient to link
> large public health databases in a statistically justifiable manner. The
> problem
> addressed by the methodology is that of matching two files of
> individual data under conditions of uncertainty. Each field is subject to
> error which is
> measured by the probability that the field agrees given a record pair
> matches (called the m probability) and probabilities of chance agreement
> of its value
> states (called the u probability). Fellegi and Sunter pioneered record
> linkage theory. Advances in methodology include use of an EM algorithm
> for
> parameter estimation, optimization of matches by means of a linear
> sum assignment program, and more recently, a probability model that
> addresses both
> m and u probabilities for all value states of a field. This provides a
> means for obtaining greater precision from non-uniformly distributed
> fields, without the
> theoretical complications arising from frequency-based matching
> alone. The model includes an iterative parameter estimation procedure
> that is more robust
> than pre-match estimation techniques. The methodology was
> originally developed and tested by the author at the U.S. Census Bureau
> for census
> undercount estimation. The more recent advances and a new
> generalized software system were tested and validated by linking
> highway crashes to
> Emergency Medical Service (EMS) reports and to hospital admission
> records for the National Highway Traffic Safety Administration
> (NHTSA).
> MEDLINE ACCESSION NUMBER
> 95312707
>
>
| Richard Hockey email:richardh@epidem.uwa.edu.au |
| Department of Public Health phone:+61 8 9380-1292 _--_|\ |
| University of Western Australia fax: +61 8 9380-1188 / \ |
| NEDLANDS WA 6907 *_.--._/ |
| Australia v |
"It is better to have tried and failed than to have failed to try,
but the result's the same."
|