|
This is actually kind of an interesting performance tuning problem. I've
done enough to convince myself that my earlier theory holds.
First, create a test table with the essential characteristics of Amit's:
%let nrows=1e5;
data demo;
do id = 1 to &nrows;
if ranuni(123)>0.5 then address = round(ranuni(123)*&nrows);
else address = . ;
if ranuni(123)>0.5 then phone = round(ranuni(123)*&nrows);
else phone = . ;
output;
end;
run;
Try it first with a smaller number for &NROWS, like 100. Make the number
larger when it's time to benchmark performance.
Create simple indexes on ADDRESS and PHONE:
proc datasets;
modify demo;
index create address phone;
quit;
Now run a query which, like Amit's, includes equi-join conditions linked
with an OR:
proc sql;
create table matched as
select a.*, b.id as match
from demo as a, demo as b
where (a.address=b.address and a.address is not null
or
a.phone =b.phone and a.phone is not null)
and a.id > b.id
order by id, match
;
quit;
The telltale note ("The execution of this query involves performing one or
more Cartesian product joins that can not be optimized.") appears. So we
know the indexes were not exploited. With 100K rows in the source table,
this took >28 minutes on my machine. The CPU time was only 0.4 seconds
less than the elapsed time.
Now rework the query into two, each with a single equi-join condition, and
UNION the results:
proc sql;
create table matched_fast as
select a.*, b.id as match
from demo as a, demo as b
where a.address=b.address and a.address is not null
and a.id > b.id
union
select a.*, b.id as match
from demo as a, demo as b
where a.phone =b.phone and a.phone is not null
and a.id > b.id
order by id, match
;
quit;
Time: 1.7 seconds, or 99.9% less than the original (even after including
the half second it took to create the indexes).
This all demonstrates that the optimizer has its limitations. Sometimes a
little hand tuning (often adding to the verbosity of the code) can bring
significant performance benefits.
A PROC COMPARE confirms results are identical:
proc compare data=matched compare=matched_fast; run;
So Amit should be able to use this as a model and get his results in a
reasonable time. Of course that does nothing to assure that the underlying
problem will be solved.
On Sat, 3 Sep 2005 22:00:10 -0400, Howard Schreier <hs AT dc-sug DOT org>
<nospam@HOWLES.COM> wrote:
>You are undoubtedly seeing a NOTE indicating that the query cannot be
>optimized. That means PROC SQL is forming the full cartesian product. With
>0.9e6 rows in the input, that means 0.81e12, or 810 billion, rows.
>So "lots of time" would be an understatement.
>
>In theory, indices on the three columns used in the OR-linked equality
>conditions (contact_name, company_address, company_address) might allow
>some optimization. However, I suspect that it would not happen that
>easily. You might be able to force it by reworking the query into three
>simple joins, each applying one of those equality conditions, then
>UNIONing the results, then applying the other conditions.
>
>Other have expressed some doubt about whether your query would be
>effective in achieving its apparent purpose, so you probably should work
>through those issues before attempting any performance tuning on the
>existing code.
>
>>
>>(a.company_phone)
>
>Fri, 2 Sep 2005 06:09:22 -0400, Amit Das <AD as@INDUCTIS.COM> wrote:
>
>>Hi,
>>
>>
>>
>>I am running the following SQL Query, but running this query on 0.9
>>Million is taking lots of time. I am trying to use indexing on data set
>>(data set name non_d1), but wondering at the same time about indexing
>>on which variable will fetch me the optimum result.
>>
>>
>>
>>proc sql;
>>
>>create table step_9_d_1 as
>>
>>select a.*,b.company_id as original_record,b.cox_COMPANYASSETS as
>>asset2,
>>
>>b.cox_LASTMODIFIED as lmdate2,b.company_name as company_name2 from
>>
>>non_d1 as a, non_d1 as b
>>
>>where (a.company_id ne b.company_id) and
>>
>>(
>>
>>(a.contact_name=b.contact_name and a.contact_name ne "") or
>>
>>(a.company_address=b.company_address and a.company_address ne "") or
>>
>>(a.company_phone=b.company_phone and a.company_phone ne "")
>>
>>)
>>
>>and (index(a.company_name,trim(b.company_name))^=0 or
>>index(b.company_name,trim(a.company_name))^=0)
>>
>>and (a.cox_Companyassets <= b.cox_Companyassets);
>>
>>quit;
>>
>>
>>
>>company_id is unique.
>>
>>Company_name is also unique.
>>
>>But in two companies with name "GB Wrox" and "Wrox" if either of address
>>or phone or contact name is same then the one with lower asset needs to
>>written into dataset named: step_9_d_1.
>>
>>
>>
>>Please suggest.
>>
>>
>>
>>And as always... thanks a ton for helping this rookie out!
>>
>>
>>
>>
>>
>>Amit
|