| Date: | Mon, 23 Jun 2003 14:32:09 -0400 |
| Reply-To: | sashole@bellsouth.net |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Paul M. Dorfman" <sashole@BELLSOUTH.NET> |
| Organization: | Sashole of Florida |
| Subject: | Re: Proc Means and Sorting |
|
| In-Reply-To: | <sef6e7c7.006@firsthealth.com> |
| Content-Type: | text/plain; charset="us-ascii" |
> -----Original Message-----
> From: Jack Hamilton [mailto:JackHamilton@firsthealth.com]
>
> PROC MEANS is one of the procedures which can use
> multithreading in version 9. The thread sorting BY-group X
> might finish before the thread sorting BY-group A, so your
> paragraph 1) below doesn't apply.
Jack,
Hmmm... If a BY is being used, then MEANS accepts the data either
already sorted or grouped (with the NOTSORTED option), so there is no
need to sort them. Now it can use the THREADS option to act according to
the SAS log
NOTE: Multiple concurrent threads will be used to summarize data.
But I find it rather incredible if SAS fails to interleave the threads
being summarized into the original order. As already been written, there
is no in-your-face evidence of this in the docs, but let us look at it
this way. Suppose we have an unordered input and use the CLASS
statement. In this case, MEANS actually does sort the input [implicitly,
by populating its internal AVL tree(s), whence the nodes are returned in
key order]. And we know it is guaranteed that in such a case, the
aggregated output will be physically ordered by the CLASS variables, and
of course if will happen regardless of whether THREADS of NOTHREADS is
used, otherwise the procedure would deliver inconsistent results.
Further, we know that BY and CLASS will produce the same key-order
output if the input is sorted beforehand. I surely would expect it to be
the case irrespective of whether I used THREADS to improve the
performance or not!
On the practical side, I have just run MEANS against a sizeable test
input (~ 10m obs divided into ~10k groups by a distinct key to make the
use of the multiple threads - in my case two - more pronounced) a number
of times testing for all the cases mentioned above, with ordered,
unordered, and grouped input, with CLASS and BY (including NOTSORTED),
with THREADS and NOTHREADS. Saving the -l from checking out my logs, let
me just distill the results into the satements that the output has
always come in the expected order, that is:
1) If the input is sorted and BY is used, or CLASS is used (regardless
of the input order), the aggregated output is always in the BY (CLASS)
variables order.
2) If the input is grouped and BY is used with NOTSORTED, the input key
order is strictly maintained.
Kind regards,
-------------------
Paul M. Dorfman
Jacksonville, FL
-------------------
>
> I'd also be quite surprised if PROC MEANS stops working in
> the expected manner, but I don't see a guarantee in the
> documentation that it won't.
> Maybe I'm just overlooking something obvious, as this seems
> to be one of the fundamental characteristics of SAS processing.
>
>
>
>
> --
> JackHamilton@FirstHealth.com
> Manager, Technical Development
> Metrics Department, First Health
> West Sacramento, California USA
>
> >>> paul_dorfman@HOTMAIL.COM 06/23/2003 10:26 AM >>>
> Matt,
>
> I do not think that what you require is stated in the
> documentation as a separate, explicit paragraph, but I also
> think that it is not necessary. The documentation, in part,
> does say that
>
> "Comparison of the BY and CLASS Statements
> Using the BY statement is similar to using the CLASS
> statement and the NWAY option in that PROC MEANS summarizes
> each BY group as an independent subset of the input data...
> However, unlike the CLASS statement, the BY statement
> requires that you previously sort BY variables."
>
> From which it follows that:
>
> 1) With BY, input is processed one BY-group at a time. I
> cannot think of any concievable reason why any two BY-groups
> should be processed out of input order. (Somewhat
> counter-analogically to Proc SORT NOEQUALS, where not
> maintaining the relative order of the records within the same
> BY-group may be used to improve performance).
>
> 2) The doc's statement "BY statement requires that you
> previously sort BY variables" is ionly accurate if the
> NOTSORTED option is not used. Otherwise, the only actual
> requirement is that the BY-variables be *grouped*, in which
> case it goes without saying that the input order of the
> BY-variables will be maintained in the output.
>
> As I've never observed any deviations from this [expected]
> behavior, I would be quite surprised to see an evidence to
> the contrary.
>
> Kind regards,
> ---------------------------
> Paul M. Dorfman
> Jacksonville, FL
> ---------------------------
>
>
>
>
>
> >From: m n <iced_phoenix@YAHOO.COM>
> >Reply-To: m n <iced_phoenix@YAHOO.COM>
> >
> >Dear c.s.sas,
> >
> >Does SAS documentation (V8) make any guarantee that an
> output dataset
> >from proc means will maintain the same sort order as the original
> dataset?
> >In other words, if I give proc means a dataset sorted by x1, x2, x3
> and
> >set the by group to x1, x2, x3, am I guaranteed that the output set
> will
> >remain sorted (though summarized) ?
> >
> >Code Example:
> >
> > proc sort data=test;
> > by x1, x2, x3;
> > run;
> >
> > proc means data=test sum;
> > by x1, x2, x3;
> > var x4;
> > output out=test2 sum=;
> > run;
> >
> > /* Must I sort test2 by x1, x2, x3 here to guarantee a sorted
> dataset?
> >*/
> >
> >I would greatly appreciate a quote from SAS documentation
> that answers
> this
> >question. Thank you all for your help.
> >
> >Matt
>
> _________________________________________________________________
> Tired of spam? Get advanced junk mail protection with MSN 8.
http://join.msn.com/?page=features/junkmail
|