LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 1997, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 16 Apr 1997 15:59:44 -0400
Reply-To:   MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM
Sender:   "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:   "Michael A. Raithel" <MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM>
Subject:   Re: (MVS) Sort Before Summarization

Fellow SAS-L-ers,

The sort/do-not-sort before a PROC SUMMARY discussion has been very interesting. From an MVS perspective, there are good reasons to support not sorting the data before summarization. However, there are also very good reasons to support sorting the data before a summarization. Thus, both "sides" of this discussion have, in my opinion, VERY valid points.

In regards to NOT SORTING the data first: ----------------------------------------

The complexity of a summarization is a function of the number of CLASS variables and of the number of discrete values of those variables. As the number of "Class variables" * "discrete values" combinations rises, more computer memory is needed to hold the intermediate results (subgroups) built during the summarization. Because of the matrix that the SAS System must build to keep track of the combinations, it is easy to think of PROC SUMMARY as a "Vertical Summarization" within computer memory. (See SUGI 22 paper reference at end of posting). That is, it builds a vertical table in memory of the running "Class variable" * "discrete values" combinations and statistics.

For Instance, the following PROC SUMMARY:

PROC SUMMARY DATA=CDSALES; CLASS STATE CITY CDID; VAR SALESAMT; OUTPUT OUT=SUMSALES SUM=; RUN;

... builds eight major TYPEs (subgroups) in memory:

TYPE DESCRIPTION ---- ----------- 0 total salesamt of all observations 1 total salesamt for each discrete cdid 2 total salesamt for each discrete city 3 total salesamt for each discrete city/cdid combination 4 total salesamt for each discrete state 5 total salesamt for each discrete state/cdid combination 6 total salesamt for each discrete state/city combination 7 total salesamt for each discrete state/city/cdid combn.

The major TYPE's (subgroups) will contain many "sub-subgroups", which carry the requested statistics for the particular "discrete value" of the CLASS field. For example TYPE 4, based on the discrete values of variable STATE might conceptually look like this:

TYPE CLASS VARIABLE VALUE SALESAMT VALUE ---- -------------------- -------------- 4.1 Alabama 123 4.2 Alaska 456 4.3 Arizona 789 4.4 Arkansas 101 ... ... ...

The number of TYPE (subgroup) combinations is equal to 2**X, where X = the number of CLASS variables. The number of "sub-subgroup" combinations is totally data dependent.

A very large number of "sub-subgroup" combinations may cause your job to abend with a memory abend or an out-of-work space abend. If you have worked with SAS in the MVS environment for a while, you are probably familiar with this scenario:-( I seem to experience the above problem when I have HUGE SAS data sets to summarize with about five or more CLASS variables that have a lot of, um, ah, "variability" in their values.

In regards to Sorting the data first: ------------------------------------

When you DO sort first and use the BY statement, instead of the CLASS statement the BY statement alerts the Summary Procedure that the data is already sorted in order of the BY variables. PROC SUMMARY can ("relax" and) not have to execute the extra code needed to build and keep track of subgroups, as it summarizes. In this case, the Summary Procedure only needs to load each BY group into memory as it is encountered--separately--and do the individual summarization on that BY group. This is a long way away from the large Vertical matrix needed, in memory, with the CLASS variable. And, so it saves on computer memory during the Summarization.

However, the gain in the Summary Procedure comes at the cost of the EXCP's, CPU time, and Memory used during the Sort Procedure. This may be unnecessary overhead for most summarizations that have only a few CLASS variables and little variability in the values of those CLASS variables.

Most organizations that implement Computer Chargeback systems charge for CPU time and EXCP's--conspicuously omitting memory, which is VERY important. In such organizations, sorting before summarizing would increase the cost of the SAS program. This would make the sort-summarize scenario a lot less attractive for everyday use.

So, what to do?: ---------------

I normally SORT first ONLY when I must summarize a VERY large SAS data set with many (5 or more) of CLASS variables that I suspect will have a lot of value variability. Often, this is not a conscious decision, but one borne on the wings of a PROC SUMMARY that has ABENDed because of a memory problem. In most cases, the BY statement gets me... by.

For my everyday Summaries, I just use the CLASS statement and let PROC SUMMARY use the memory of my large mainframe computer(s). My current environment is memory rich, and I enjoy the (relatively) reduced CPU time and EXCP count of this approach.

When I have impossibly large SAS data sets to sort, I use a technique that I developed, called "Horizontal Summarization". This is documented in the SUGI 22 paper: "Summarizing Impossibly Large SAS Data Sets for the Data Warehouse Server Using Horizontal Summarization". It can be found in the SUGI 22 proceedings, now available either in printed form or on CD-ROM.

I hope that this suggestion proves helpful now, and in the future!

Of course, all of these opinions and insights are my own, and do not reflect those of my organization or my associates.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Michael A. Raithel E-mail: maraithel@mcimail.com Author: Tuning SAS Applications in the MVS Environment ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ...There's a feeling I get when I look to the West, and my spirit is crying for leaving... ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Back to: Top of message | Previous page | Main SAS-L page