|
Summary: The technique used is already much faster than a "pure" SQL
technique. An index can make it even faster. An other possibility is to use
the undocumented MONOTONIC function. SASFILE is probably not necessary.
More below.
On Thu, 29 Mar 2007 17:58:05 -0800, toby989@HOTPOP.COM wrote:
>Is there a better (faster) way to select the first 10 employee records from
each
>of the 100 hospitals?
>
>Thanks Toby
>
>
>proc sql;
>
>create table dd as select hospid as hid, empid, ppi from psdata.ps2004ms where
>empid in (select empid from psdata.ps2004ms(obs=10) where hospid=hid) group by
>hospid;
>
>quit;
Here are my experiments.
Test data:
1 data ps2004ms;
2 do hospid = 1 to 100;
3 do _n_ = 1 to 200;
4 empid + 1 + _n_/97531;
5 retain ppi 99;
6 output;
7 end;
8 end;
9 run;
NOTE: The data set WORK.PS2004MS has 20000 observations and 3 variables.
Toby's approach, using the SAS OBS= data set option (not part of standard SQL):
10
11 proc sql;
12 create table dd as
13 select hospid as hid, empid, ppi
14 from ps2004ms
15 where empid in (select empid
16 from ps2004ms(obs=10)
17 where hospid=hid);
NOTE: Table WORK.DD created, with 1000 rows and 3 columns.
18 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.73 seconds
cpu time 0.34 seconds
A more pure SQL approach; very slow:
19
20 proc sql;
21 create table dd2 as
22 select distinct ps2004ms.*
23 from ps2004ms join ps2004ms as x
24 on ps2004ms.hospid = x.hospid
25 and ps2004ms.empid >= x.empid
26 group by ps2004ms.hospid, ps2004ms.empid
27 having count(*) <= 10;
NOTE: The query requires remerging summary statistics back with the original
data.
NOTE: Table WORK.DD2 created, with 1000 rows and 3 columns.
28 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 22.29 seconds
cpu time 7.49 seconds
Another approach, using MONOTONIC; appears to be a bit faster than Toby's way:
29
30 proc sql;
31 create table dd3 as
32 select ps2004ms.*
33 from ps2004ms
34 group by hospid
35 having monotonic() < min(monotonic() ) + 10;
NOTE: The query requires remerging summary statistics back with the original
data.
NOTE: Table WORK.DD3 created, with 1000 rows and 3 columns.
36 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.58 seconds
cpu time 0.11 seconds
A bigger test table:
37
38 data ps2004ms;
39 do hospid = 1 to 300;
40 do _n_ = 1 to 600;
41 empid + 1 + _n_/97531;
42 retain ppi 99;
43 output;
44 end;
45 end;
46 run;
NOTE: The data set WORK.PS2004MS has 180000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
cpu time 0.10 seconds
We will also now look for 30 rows from each hospital instead of 10. Toby's way:
47
48 proc sql;
49 create table dd as
50 select hospid as hid, empid, ppi
51 from ps2004ms
52 where empid in (select empid
53 from ps2004ms(obs=30)
54 where hospid=hid);
NOTE: Table WORK.DD created, with 9000 rows and 3 columns.
55 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 5.57 seconds
cpu time 5.49 seconds
Notice how close the two times are. Logically, the subquery has to open and
read the table 180 thousand times (the optimizer is probably not smart
enough to bypass the subquery once it has 30 rows from a HOSPID group). Yet
the real and cpu times are close, indicating little delay for disk reads. I
conclude that a memory-resident copy of the table is made. Thus there is
nothing to be gained by employing SASFILE.
Build an index:
56
57 proc datasets;
Directory
Libref WORK
Engine V9
Physical Name C:\DOCUME~1\HLS\LOCALS~1\Temp\SAS Temporary
Files\_TD5408
File Name C:\DOCUME~1\HLS\LOCALS~1\Temp\SAS Temporary
Files\_TD5408
Member File
# Name Type Size Last Modified
1 DD DATA 222208 01Apr07:17:21:23
2 DD2 DATA 29696 01Apr07:17:21:16
3 DD3 DATA 29696 01Apr07:17:21:17
4 PS2004MS DATA 4391936 01Apr07:17:21:17
58 modify ps2004ms;
59 index create hospid;
NOTE: Simple index hospid has been defined.
60 run;
NOTE: MODIFY was successful for WORK.PS2004MS.DATA.
61 quit;
NOTE: PROCEDURE DATASETS used (Total process time):
real time 0.72 seconds
cpu time 0.34 seconds
Toby's code again:
62
63 proc sql;
64 create table dd as
65 select hospid as hid, empid, ppi
66 from ps2004ms
67 where empid in (select empid
68 from ps2004ms(obs=30)
69 where hospid=hid);
NOTE: Table WORK.DD created, with 9000 rows and 3 columns.
70 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.72 seconds
cpu time 0.71 seconds
The index speeds things up dramatically!
|