Date: Sat, 30 Aug 2008 17:38:51 -0400
Reply-To: Arthur Tabachneck <art297@NETSCAPE.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Arthur Tabachneck <art297@NETSCAPE.NET>
Subject: Re: Sounds Like...
Mike,
Thanks for the additional option, however I was planning to respond anyhow
with one additional tidbit.
I tried both of the proposed proc sql offerings but, on a 4-processor
server, I had to increase the number of records to enable the comparison
as otherwise both methods completed without any reported time having
elapsed.
Thus, trying:
data names;
input name : $15. @@;
soundex=soundex(name);
do i=1 to 100000;
output;
end;
datalines;
SMITH SMYTHE O'HENRY OHENRY ZDEB ZDYB SOO SU CLINTON CLINTEN
WASHINGTON WOSHINGTEN TERRI TERRY JACKSON JAKSON SCHMIDT SCHMIT KALE CALE
GREG GREGORY
;
proc sql;
create table matches as
select *
from a,b
where name1 =* name2;
quit;
proc sql;
create table matches as
select *
from a,b
where soundex(name1) = soundex(name2);
quit;
was a BIG mistake, but with an interesting finding. The first method took
about an hour and a half, but did successfully complete the process.
The second method wiped out my system's resources.
Dropping the number of replicatons down to 1000 allowed both methods to
successfully complete, with the latter method running over five times
faster than the former.
Art
---------
On Sat, 30 Aug 2008 15:34:49 -0400, Mike Zdeb <msz03@ALBANY.EDU> wrote:
>hi ... in case anyone would like to try a SOUNDEX alternative, NYSIIS
coding is another phonetic
>based coding scheme
>
>there is no NYSIIS function in SAS, but a NYSIIS macro is available
>
>http://www.albany.edu/~msz03/basug/nysiis.zip
>
>I originally found the macro on the SAMHSA web site
>
>http://csat.samhsa.gov/
>
>it seems to have disappeared from that site, but it's referenced in both
of the following
>
>Linking Client Records from Substance Abuse, Mental Health and Medicaid
State Agencies
>http://csat.samhsa.gov/IDBSE/idb/modules/linking/material/linking.pdf
>
>Record Linking 102
>http://csat.samhsa.gov/IDBSE/idb/modules/linking/presentations/RecordLinki
ng102_alt.ppt
>
>here's a SOUNDEX/NYSIIS comparison (remember the original posting
compared GREG and GREGORY ...
>they are still different with the NYSIIS CODE)
>
>data names;
>input name : $15. @@;
>soundex=soundex(name);
>%nysiis(name,nysiis_code);
>datalines;
>SMITH SMYTHE O'HENRY OHENRY ZDEB ZDYB SOO SU CLINTON CLINTEN
>WASHINGTON WOSHINGTEN TERRI TERRY JACKSON JAKSON SCHMIDT SCHMIT KALE CALE
>GREG GREGORY
>;
>run;
>
> nysiis_
>name soundex code
>SMITH S53 SNATH
>SMYTHE S53 SNAT
>O'HENRY O56 O'ANR
>OHENRY O56 OANR
>ZDEB Z31 ZDAB
>ZDYB Z31 ZDAB
>SOO S S
>SU S S
>CLINTON C4535 CLANTAN
>CLINTEN C4535 CLANTAN
>WASHINGTON W25235 WASANGTAN
>WOSHINGTEN W25235 WASANGTAN
>TERRI T6 TAR
>TERRY T6 TAR
>JACKSON J25 JACSAN
>JAKSON J25 JACSAN
>SCHMIDT S53 SNAD
>SCHMIT S53 SNAT
>KALE K4 CAL
>CALE C4 CAL
>GREG G62 GRAG
>GREGORY G626 GRAGAR
>
>
>if you use either SOUNDEX or NYSIIS coding, it's nice to know the coding
rules
>
>http://en.wikipedia.org/wiki/Soundex
>http://en.wikipedia.org/wiki/NYSIIS
>
>(Wikipedia does actually have them correct)
>
>ps re Paul Choate's posting ... great to have both "DDS"
and "Extraction" in the same
>signature, "DDS Data Extraction" ... so, is there also a "DDS Data Root
Canal"
>
>--
>Mike Zdeb
>U@Albany School of Public Health
>One University Place
>Rensselaer, New York 12144-3456
>P/518-402-6479 F/630-604-1475
>
>> My guess would be on the second the SQL processor computes the soundex
>> scores and then matches, hence a optimized join, while the first matches
>> and then computes the soundex scores, hence a Cartesian join.
>>
>> As expected the second runs much more quickly. Interesting point Mike.
>>
>> Paul Choate
>> DDS Data Extraction
>> (916) 654-2160
>>
>> -----Original Message-----
>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
>> Mike Zdeb
>> Sent: Friday, August 29, 2008 11:33 AM
>> To: SAS-L@LISTSERV.UGA.EDU
>> Subject: Re: Sounds Like...
>>
>> hi ... so if the SOUNDS LIKE operator compares the SOUNDEX of variables,
>> I don't understand why
>> the first use of PROC SQL below produces the NOTE
>>
>> NOTE: The execution of this query involves performing one or more
>> Cartesian product joins that can
>> not be optimized.
>>
>> while the second on does not
>>
>> can any of you SQL folk explain? thanks
>>
>>
>> data a;
>> input name1 : $10. @@;
>> datalines;
>> SMITH SMYTHE O'HENRY OHENRY ZDEB ZDYB SOO SU CLINTON CLINTEN
>> ;
>> run;
>>
>> data b;
>> set a (rename=(name1=name2));
>> run;
>>
>> proc sql;
>> create table matches as
>> select *
>> from a,b
>> where name1 =* name2;
>> quit;
>>
>> proc sql;
>> create table matches as
>> select *
>> from a,b
>> where soundex(name1) = soundex(name2);
>> quit;
>>
>>
>> --
>> Mike Zdeb
>> U@Albany School of Public Health
>> One University Place
>> Rensselaer, New York 12144-3456
>> P/518-402-6479 F/630-604-1475
>>
>>
|