|
Further investigation shows that creating the indexes was irrelevant
because they were not used. Instead, PROC SQL used hashing to perform the
joins. So all of the gain is attributable to rewriting the code to make it
optimizable.
That is the case for a 100K-row source table. It is possible that at a
larger scale there might not be enough memory available for the hash. In
that situation indexes might be exploited.
On Mon, 5 Sep 2005 22:31:10 -0400, Howard Schreier <hs AT dc-sug DOT org>
<nospam@HOWLES.COM> wrote:
>This is actually kind of an interesting performance tuning problem. I've
>done enough to convince myself that my earlier theory holds.
>
>First, create a test table with the essential characteristics of Amit's:
>
> %let nrows=1e5;
>
> data demo;
> do id = 1 to &nrows;
> if ranuni(123)>0.5 then address = round(ranuni(123)*&nrows);
> else address = . ;
> if ranuni(123)>0.5 then phone = round(ranuni(123)*&nrows);
> else phone = . ;
> output;
> end;
> run;
>
>Try it first with a smaller number for &NROWS, like 100. Make the number
>larger when it's time to benchmark performance.
>
>Create simple indexes on ADDRESS and PHONE:
>
> proc datasets;
> modify demo;
> index create address phone;
> quit;
>
>Now run a query which, like Amit's, includes equi-join conditions linked
>with an OR:
>
> proc sql;
> create table matched as
> select a.*, b.id as match
> from demo as a, demo as b
> where (a.address=b.address and a.address is not null
> or
> a.phone =b.phone and a.phone is not null)
> and a.id > b.id
> order by id, match
> ;
> quit;
>
>The telltale note ("The execution of this query involves performing one or
>more Cartesian product joins that can not be optimized.") appears. So we
>know the indexes were not exploited. With 100K rows in the source table,
>this took >28 minutes on my machine. The CPU time was only 0.4 seconds
>less than the elapsed time.
>
>Now rework the query into two, each with a single equi-join condition, and
>UNION the results:
>
> proc sql;
> create table matched_fast as
> select a.*, b.id as match
> from demo as a, demo as b
> where a.address=b.address and a.address is not null
> and a.id > b.id
> union
> select a.*, b.id as match
> from demo as a, demo as b
> where a.phone =b.phone and a.phone is not null
> and a.id > b.id
> order by id, match
> ;
> quit;
>
>Time: 1.7 seconds, or 99.9% less than the original (even after including
>the half second it took to create the indexes).
>
>This all demonstrates that the optimizer has its limitations. Sometimes a
>little hand tuning (often adding to the verbosity of the code) can bring
>significant performance benefits.
>
>A PROC COMPARE confirms results are identical:
>
> proc compare data=matched compare=matched_fast; run;
>
>So Amit should be able to use this as a model and get his results in a
>reasonable time. Of course that does nothing to assure that the underlying
>problem will be solved.
>
[snip]
|