LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2007, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Sun, 1 Apr 2007 17:58:52 -0400
Reply-To:   "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject:   Re: sql optimization

Summary: The technique used is already much faster than a "pure" SQL technique. An index can make it even faster. An other possibility is to use the undocumented MONOTONIC function. SASFILE is probably not necessary.

More below.

On Thu, 29 Mar 2007 17:58:05 -0800, toby989@HOTPOP.COM wrote:

>Is there a better (faster) way to select the first 10 employee records from each >of the 100 hospitals? > >Thanks Toby > > >proc sql; > >create table dd as select hospid as hid, empid, ppi from psdata.ps2004ms where >empid in (select empid from psdata.ps2004ms(obs=10) where hospid=hid) group by >hospid; > >quit;

Here are my experiments.

Test data:

1 data ps2004ms; 2 do hospid = 1 to 100; 3 do _n_ = 1 to 200; 4 empid + 1 + _n_/97531; 5 retain ppi 99; 6 output; 7 end; 8 end; 9 run;

NOTE: The data set WORK.PS2004MS has 20000 observations and 3 variables.

Toby's approach, using the SAS OBS= data set option (not part of standard SQL):

10 11 proc sql; 12 create table dd as 13 select hospid as hid, empid, ppi 14 from ps2004ms 15 where empid in (select empid 16 from ps2004ms(obs=10) 17 where hospid=hid); NOTE: Table WORK.DD created, with 1000 rows and 3 columns.

18 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.73 seconds cpu time 0.34 seconds

A more pure SQL approach; very slow:

19 20 proc sql; 21 create table dd2 as 22 select distinct ps2004ms.* 23 from ps2004ms join ps2004ms as x 24 on ps2004ms.hospid = x.hospid 25 and ps2004ms.empid >= x.empid 26 group by ps2004ms.hospid, ps2004ms.empid 27 having count(*) <= 10; NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.DD2 created, with 1000 rows and 3 columns.

28 quit; NOTE: PROCEDURE SQL used (Total process time): real time 22.29 seconds cpu time 7.49 seconds

Another approach, using MONOTONIC; appears to be a bit faster than Toby's way:

29 30 proc sql; 31 create table dd3 as 32 select ps2004ms.* 33 from ps2004ms 34 group by hospid 35 having monotonic() < min(monotonic() ) + 10; NOTE: The query requires remerging summary statistics back with the original data. NOTE: Table WORK.DD3 created, with 1000 rows and 3 columns.

36 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.58 seconds cpu time 0.11 seconds

A bigger test table:

37 38 data ps2004ms; 39 do hospid = 1 to 300; 40 do _n_ = 1 to 600; 41 empid + 1 + _n_/97531; 42 retain ppi 99; 43 output; 44 end; 45 end; 46 run;

NOTE: The data set WORK.PS2004MS has 180000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.09 seconds cpu time 0.10 seconds

We will also now look for 30 rows from each hospital instead of 10. Toby's way:

47 48 proc sql; 49 create table dd as 50 select hospid as hid, empid, ppi 51 from ps2004ms 52 where empid in (select empid 53 from ps2004ms(obs=30) 54 where hospid=hid); NOTE: Table WORK.DD created, with 9000 rows and 3 columns.

55 quit; NOTE: PROCEDURE SQL used (Total process time): real time 5.57 seconds cpu time 5.49 seconds

Notice how close the two times are. Logically, the subquery has to open and read the table 180 thousand times (the optimizer is probably not smart enough to bypass the subquery once it has 30 rows from a HOSPID group). Yet the real and cpu times are close, indicating little delay for disk reads. I conclude that a memory-resident copy of the table is made. Thus there is nothing to be gained by employing SASFILE.

Build an index:

56 57 proc datasets; Directory

Libref WORK Engine V9 Physical Name C:\DOCUME~1\HLS\LOCALS~1\Temp\SAS Temporary Files\_TD5408 File Name C:\DOCUME~1\HLS\LOCALS~1\Temp\SAS Temporary Files\_TD5408

Member File # Name Type Size Last Modified

1 DD DATA 222208 01Apr07:17:21:23 2 DD2 DATA 29696 01Apr07:17:21:16 3 DD3 DATA 29696 01Apr07:17:21:17 4 PS2004MS DATA 4391936 01Apr07:17:21:17 58 modify ps2004ms; 59 index create hospid; NOTE: Simple index hospid has been defined. 60 run;

NOTE: MODIFY was successful for WORK.PS2004MS.DATA. 61 quit;

NOTE: PROCEDURE DATASETS used (Total process time): real time 0.72 seconds cpu time 0.34 seconds

Toby's code again:

62 63 proc sql; 64 create table dd as 65 select hospid as hid, empid, ppi 66 from ps2004ms 67 where empid in (select empid 68 from ps2004ms(obs=30) 69 where hospid=hid); NOTE: Table WORK.DD created, with 9000 rows and 3 columns.

70 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.72 seconds cpu time 0.71 seconds

The index speeds things up dramatically!


Back to: Top of message | Previous page | Main SAS-L page