|
Fellow SAS-L-ers,
The sort/do-not-sort before a PROC SUMMARY discussion has been very
interesting. From an MVS perspective, there are good reasons to support
not sorting the data before summarization. However, there are also very
good reasons to support sorting the data before a summarization. Thus,
both "sides" of this discussion have, in my opinion, VERY valid points.
In regards to NOT SORTING the data first:
----------------------------------------
The complexity of a summarization is a function of the number of CLASS
variables and of the number of discrete values of those variables. As
the number of "Class variables" * "discrete values" combinations rises,
more computer memory is needed to hold the intermediate results
(subgroups) built during the summarization. Because of the matrix that
the SAS System must build to keep track of the combinations, it is easy
to think of PROC SUMMARY as a "Vertical Summarization" within computer
memory. (See SUGI 22 paper reference at end of posting). That is, it
builds a vertical table in memory of the running "Class variable" *
"discrete values" combinations and statistics.
For Instance, the following PROC SUMMARY:
PROC SUMMARY DATA=CDSALES;
CLASS STATE CITY CDID;
VAR SALESAMT;
OUTPUT OUT=SUMSALES SUM=;
RUN;
... builds eight major TYPEs (subgroups) in memory:
TYPE DESCRIPTION
---- -----------
0 total salesamt of all observations
1 total salesamt for each discrete cdid
2 total salesamt for each discrete city
3 total salesamt for each discrete city/cdid combination
4 total salesamt for each discrete state
5 total salesamt for each discrete state/cdid combination
6 total salesamt for each discrete state/city combination
7 total salesamt for each discrete state/city/cdid combn.
The major TYPE's (subgroups) will contain many "sub-subgroups", which
carry the requested statistics for the particular "discrete value" of
the CLASS field. For example TYPE 4, based on the discrete values of
variable STATE might conceptually look like this:
TYPE CLASS VARIABLE VALUE SALESAMT VALUE
---- -------------------- --------------
4.1 Alabama 123
4.2 Alaska 456
4.3 Arizona 789
4.4 Arkansas 101
... ... ...
The number of TYPE (subgroup) combinations is equal to 2**X, where X =
the number of CLASS variables. The number of "sub-subgroup"
combinations is totally data dependent.
A very large number of "sub-subgroup" combinations may cause your job
to abend with a memory abend or an out-of-work space abend. If you
have worked with SAS in the MVS environment for a while, you are
probably familiar with this scenario:-( I seem to experience the above
problem when I have HUGE SAS data sets to summarize with about five or
more CLASS variables that have a lot of, um, ah, "variability" in their
values.
In regards to Sorting the data first:
------------------------------------
When you DO sort first and use the BY statement, instead of the CLASS
statement the BY statement alerts the Summary Procedure that the data is
already sorted in order of the BY variables. PROC SUMMARY can ("relax" and)
not have to execute the extra code needed to build and keep track of
subgroups, as it summarizes. In this case, the Summary Procedure only needs
to load each BY group into memory as it is encountered--separately--and do the
individual summarization on that BY group. This is a long way away from the
large Vertical matrix needed, in memory, with the CLASS variable. And, so it
saves on computer memory during the Summarization.
However, the gain in the Summary Procedure comes at the cost of the EXCP's,
CPU time, and Memory used during the Sort Procedure. This may be
unnecessary overhead for most summarizations that have only a few CLASS
variables and little variability in the values of those CLASS variables.
Most organizations that implement Computer Chargeback systems charge for CPU
time and EXCP's--conspicuously omitting memory, which is VERY important. In
such organizations, sorting before summarizing would increase the cost of
the SAS program. This would make the sort-summarize scenario a lot less
attractive for everyday use.
So, what to do?:
---------------
I normally SORT first ONLY when I must summarize a VERY large SAS data set
with many (5 or more) of CLASS variables that I suspect will have a lot of
value variability. Often, this is not a conscious decision, but one borne
on the wings of a PROC SUMMARY that has ABENDed because of a memory
problem. In most cases, the BY statement gets me... by.
For my everyday Summaries, I just use the CLASS statement and let PROC
SUMMARY use the memory of my large mainframe computer(s). My current
environment is memory rich, and I enjoy the (relatively) reduced CPU time
and EXCP count of this approach.
When I have impossibly large SAS data sets to sort, I use a technique that I
developed, called "Horizontal Summarization". This is documented in the
SUGI 22 paper: "Summarizing Impossibly Large SAS Data Sets for the Data
Warehouse Server Using Horizontal Summarization". It can be found in the
SUGI 22 proceedings, now available either in printed form or on CD-ROM.
I hope that this suggestion proves helpful now, and in the future!
Of course, all of these opinions and insights are my own, and do not reflect
those of my organization or my associates.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Michael A. Raithel
E-mail: maraithel@mcimail.com
Author: Tuning SAS Applications in the MVS Environment
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
...There's a feeling I get when I look to the West, and my spirit is
crying for leaving...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
|