| Date: | Fri, 30 Jan 2004 13:18:56 -0500 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | Re: Proc Sql vs. Datastep |
|
| Content-Type: | text/plain |
|---|
Howard:
I suspect that SQL does find a SORTEDBY value that confirms the sort order
of the data, or the data are being read from sorted external files into a
SAS dataset. This example shows that the distinct counts occur within
groups:
data testSum;
input Mid totdaysofwk Mweek totspendofwk ;
cards;
1 2 3 4
1 6 7 8
2 5 4 3
;
run;
proc sql;
select Mid, sum(totdaysofwk)/count(distinct Mweek) as meandays,
sum(totspendofwk) as totspend
from testSum
group by Mid;
quit;
Since the DISTINCT counts occur within groups, the order of data rows should
not matter that much unless the groups span very large numbers of rows.
Sig
-----Original Message-----
From: Howard Schreier [mailto:Howard_Schreier@ITA.DOC.GOV]
Sent: Thursday, January 29, 2004 10:16 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Proc Sql vs. Datastep
Two theories about this particular case:
(1) The input is sorted, but that fact is not recorded in the dataset header
so SQL does not know about it and re-sorts.
(2) SQL does not exploit the sort sequence (or employ any other single-pass
strategy) in accumulating the DISTINCT counts.
On Thu, 29 Jan 2004 14:15:00 -0500, Sigurd Hermansen <HERMANS1@WESTAT.COM>
wrote:
>NetComm888:
>A couple of comparisons of fast SAS Data step programs relative to slow
>SAS SQL programs have appeared on the 'L recently. This third such
>comparison has at last provoked my usual response.
>
>If you are trying to mimic the SQL GROUP BY, I don't believe that your
>Data step will work properly unless you have data in a specific sort
>order. SAS SQL automatically sorts a dataset as required unless the
>compiler has an index or a SORTEDBY value that indicates required sort
>order. While a relatively simple SAS SQL query will take a bit longer
>to execute, sorting time probably accounts for much of the difference
>that you are observing.
>
>SAS SQL does not optimize the LEFT JOIN, transitive INNER JOIN's, and
>some JOIN's that, as Paul Dorfman has demonstrated repeatedly, could be
>accelerated using hash indexes. Those exceptions aside, the time,
>effort, and risk of specialized Data step development, in my
>experience, generally off-sets the slightly longer waiting times and
>slightly greater CPU usage required to execute SQL queries. Sig
>
>-----Original Message-----
>From: NetComm888 [mailto:netcomm888@YAHOO.COM]
>Sent: Thursday, January 29, 2004 1:23 PM
>To: SAS-L@LISTSERV.UGA.EDU
>Subject: Proc Sql vs. Datastep
>
>
>I did the following comparison on a very large dataset, which is sorted
>by MId and MWeek. The goal is to summarize the spend and visits by
>MId. Proc Sql is much neater in terms of coding, very easy to
>understand and follow. But the problem is sql runs much slower than the
>datastep. The datastep achieves the same goal by using the retain
>statement. The coding is kinda ugly, but runs super faster. Could
>anybody explain why to me? I didn't expect Proc Sql could be this slow,
>compared with the datastep.
>
>
>proc sql;
> create table test.Midstats_sql as
> select Mid, sum(totdaysofwk)/count(distinct Mweek) as meandays,
> sum(totspendofwk) as totspend
> from test.weekstats_o
> group by Mid;
>quit;
>
>
>
>**************** The datastep is used to achieve the same goal as the
>sql above ****************; data test.Midstats_dat;
> set test.weekstats_o end=eof;
> if _n_=1 then do;
> retainMid=Mid;
> totdays=0;
> retainMwk=Mweek;
> totwks=1;
> totspend=0;
>end;
>
>retain retainMid totdays retainMwk totwks totspend;
>
>if Mid=retainMid then do;
> totspend=sum(totspend,totspendofwk);
> totdays=sum(totdays,totdaysofwk);
> if Mweek ^=retainMwk then do;
> totwks=totwks+1;
> retainMwk=Mweek;
> end;
>end;
>
>else if Mid^=retainMid then do;
> output test.Midstats_dat;
> retainMid=Mid;
> totdays=totdaysofwk;
> totspend=totspendofwk;
> retainMwk=Mweek;
> totwks=1;
>
>end;
>
>if eof then do;
> output test.Midstats_dat;;
>end;
>
>run;
|