LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 2002, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 31 Oct 2002 14:39:12 -0500
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: SQL Question - Doing equijoins and getting this log message
Comments: To: "Gerstle, John" <yzg9@CDC.GOV>
Content-Type: text/plain; charset="iso-8859-1"

Summary: Avoid solutions that require Cartesian products for matching more than a few thousand records. Contact Paul Dorfman if you need a reasonable solution to a large-scale matching problem.

A thread that touched indirectly on this topic wove its way through the list not too long back. The message means, in basic terms, that the SQL compiler cannot determine an index or sort order of the keys (LineID) that it can use to limit the rows being compared to resolve the where condition. Sometimes that means that none of the typical optimization strategies will work, and sometimes it means that the SAS SQL compiler cannot implement whatever strategies might work.

The inequality operator (<) in the WHERE clause is complicating the SQL plan for the query. I believe that this query represents an extreme case of a theta-join. I consider it an extreme case because the query should select about half of the Cartesian product of scenarios reflexed on itself! That means that if scenarios has 5,000 rows, you would expect the query to yield 12.5 million rows. The numbers increase geometrically with the number of rows. SAS SQL may as well in this case form the Cartesian product and throw out the ones that do not meet the WHERE condition.

Inexact matching of natural key values may require comparisons of every possible pairing of two sets of records. If the decision to select a match depends on the yield of a binary function with one argument being taken from each record, how else could a program determine which pairs of records to select? Binary comparison methods for deduplicating large sets of records tend to bog down quickly as the number of records increases.

The solution? No perfect solution exists for this class of problem, but reasonable approximations involving prescreening on indexed unary function values cost little in terms of sensitivity and allow control of specificity of matching. If anyone has to find a way to conduct fuzzy matching or deduplication of a large number of records, send an e-mail to Paul Dorfman. He has developed a very fast and accurate implementation of unary function value indexing, and he understands the in's and out's of fuzzy linkage.

Sig

-----Original Message----- From: Gerstle, John [mailto:yzg9@CDC.GOV] Sent: Thursday, October 31, 2002 11:54 AM To: SAS-L@LISTSERV.UGA.EDU Subject: SQL Question - Doing equijoins and getting this log message

Hallo, I'm looking for some more information and an understanding of this message in the log after running an sql program, matching each row of a table with itself using a WHERE statement that prevents a row from matching with itself or others before it in the order of the original rows (i.e. line1 > line2). It's a modified program from the fuzzy matching programs put up on Charles Patridge's website.

proc sql; create table matches as select t1.lineID as lineID1,t2.lineID as lineID2, t1.idno as idno1, t2.idno as idno2, from scenarios as t1, scenarios as t2 where t1.lineID < t2.LineID ; quit;

I've searched the SAS-L archives and SAS Tech Support, and found some interesting tidbits, but not any info on this specific log message.

Here is the message:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

Is this a worry? Can this be corrected, if so?

Thanks for any input and/or opinions.

John Gerstle Biostatistician CDC Information Technical Services (CITS) Contractor Support to NCHSTP Division of HIV/AIDS Prevention HIV/AIDS Incidence and Case Study Branch (HICSB) Phone: 404-639-3980 Fax: 404-639-2980 Cell: 770-639-5060 Email: jgerstle@cdc.gov <mailto:jgerstle@cdc.gov>


Back to: Top of message | Previous page | Main SAS-L page