Date: Thu, 31 Oct 2002 15:36:13 -0500 Ian Whitlock "SAS(r) Discussion" Ian Whitlock Re: SQL Question - Doing equijoins and getting this log message To: "Gerstle, John" text/plain; charset="iso-8859-1"

John,

Sometimes there is no way to execute an SQL join without comparing every record in the first set with every record in the second. There is nothing wrong with this, but it may take a long time to execute. The message in translation says, "None of my short cut tricks are going to work on this problem the way you phrased it, I am using brute force to please you." If you are satisfied with the performance, who cares? If you are not, then the message tells you to modify the code so that it could apply a short cut, or get another problem. Sometimes problems can only be solved with brute force and you need to get a bigger and faster computer, or give up. (Say that for the best formulation of the problem, the estimated CPU time is 1 century to finish the task, then give up or at least wait a few years.)

An equi-join says that the WHERE condition can be structured as a request that two entities be equal with possibly other conditions joined to the equal condition by an AND. For example,

where x = y equi-join where x = y and z > 5 equi-join where x = y and ( any kind of complexity you wish ) equi-join where x = y or z > 5 not an equi-join

In the last case the restriction to Z > 5 might be so strong that you could use a UNION in which the first part used X = Y equi-join and the second part uses brute force but the remaining amount of data is so small that it doesn't matter. Or possibly you might have extra information, say when Z > 5 then X = Y + 9. Again you could improve the performance with a UNION of two equi-joins.

The significance of an equi-join is that at worst you can sort and merge. This means there is a way to know when to stop comparing records, hence it is usually more efficient that brute force where all pairs must be compared. This does not necessarily mean faster execution, but it often does and the computer doesn't see the necessity for resorting to brute force. For example, you have

where x < y

So you create a new variable Z always 0 and then use the condition

where x < y and z = 0

Now you have an equi-join, but the new WHERE is not one bit better than the previous one, because the equi-part adds no new information to the problem.

Howard Schreier is the master of rephrasing problems to introduce equi-joins. My all time favorite occurred around 1993 when someone on SAS-L posed the question of finding records with near values of X, say

where abs ( a.x - b.x ) <= 1

(I think the value was 2.3, but let's work with 1 anyway.) Howard saw that the above implies one of the following is always true

round ( a.x , 1 ) = round ( b.x , 1 ) round ( a.x , 1 ) = round ( b.x , 1 ) + 1 round ( a.x , 1 ) = round ( b.x , 1 ) - 1

So he introduced a third data set, C, with 3 records having variable R in (0, 1 , -1). The condition then became

where abs ( a.x - b.x ) < 1 and round ( a.x , 1 ) = round ( b.x , 1 ) + c.r

This speeded up the performance significantly.

IanWhitlock@westat.com

-----Original Message----- From: Gerstle, John [mailto:yzg9@CDC.GOV] Sent: Thursday, October 31, 2002 11:54 AM To: SAS-L@LISTSERV.UGA.EDU Subject: SQL Question - Doing equijoins and getting this log message

Hallo, I'm looking for some more information and an understanding of this message in the log after running an sql program, matching each row of a table with itself using a WHERE statement that prevents a row from matching with itself or others before it in the order of the original rows (i.e. line1 > line2). It's a modified program from the fuzzy matching programs put up on Charles Patridge's website.

proc sql; create table matches as select t1.lineID as lineID1,t2.lineID as lineID2, t1.idno as idno1, t2.idno as idno2, from scenarios as t1, scenarios as t2 where t1.lineID < t2.LineID ; quit;

I've searched the SAS-L archives and SAS Tech Support, and found some interesting tidbits, but not any info on this specific log message.

Here is the message:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

Is this a worry? Can this be corrected, if so?

Thanks for any input and/or opinions.

John Gerstle Biostatistician CDC Information Technical Services (CITS) Contractor Support to NCHSTP Division of HIV/AIDS Prevention HIV/AIDS Incidence and Case Study Branch (HICSB) Phone: 404-639-3980 Fax: 404-639-2980 Cell: 770-639-5060 Email: jgerstle@cdc.gov <mailto:jgerstle@cdc.gov>

Back to: Top of message | Previous page | Main SAS-L page