Date: Fri, 24 Aug 2001 16:47:24 -0400
Reply-To: rpresley <rpresley@GMCF.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: rpresley <rpresley@GMCF.ORG>
Subject: monitor maximum SAS WORK library space during a job
Content-Type: text/plain; charset="iso-8859-1"
I want to be able to determine the maximum disk space utilized by SAS for
the WORK library during any given job.
In our environment, SAS v8.1 on Unix, the most common limiting resource is
disk space. This was especially true when our analyst tended to create many
intermediate data sets on the way to a solution. I have been preaching the
virtues of SQL and Data Step Views to avoid using precious disk space to
write intermediate data sets. Of course we pay for this in CPU time. But
when we run out of disk space we stop. When we use more CPU time it just
takes more time.
We often cascade or chain together VIEWS similar to the use of nested
queries in SQL.
For example suppose we have three physical data sets; ONE, TWO, and THREE.
We may create a view, V_A by merging ONE and TWO. We may then create
another view by subsisting that view: V_A_SUB. We may create a view by
collapsing across groups in THREE and name this view V_B. We may then
subset V_B to form the view V_B_SUB. Finally we may merge V_A_SUB and
V_B_SUB. In order to perform this last merge, conceptually we would need to
"create" the data sets V_A_SUB and V_B_SUB. Conceptually in order to create
these data sets we would need to "create" their predecessors. Examining the
logs and the performance it is apparent that SAS does not have to create a
temporary physical data set for the entire contents of a view and all of its
predecessors in order to process subsequent views. If the view can be
conceptualized as a subset of one data set only then there is probably no
need to create any temporary physical data set. This seems analogous to
using a where clause as a data set option. If a view is a combination of
two or more large data sets it may be necessary to write a temporary
physical data set in order to do further processing on that view. But in
most instances I have observed that the temporary physical files created in
the work directory are not as large as one would need if the physical data
sets corresponding to each view were constructed at the same time. I
suspect there is some clever optimization going on in the background.
Conceptually an SQL equijoin would require the creation of the entire
Cartesian cross product of the two data sets. In reality such a large
physical data set is not created. No doubt some of the same optimization
strategies that apply to SQL joins will also apply to the cascading or
chaining of multiple VIEWS that were created by SQL or data steps.
I can easily calculate how much disk space a data view would occupy if it
were a physical file. If I could determine how much physical disk space was
being used by SAS in the work library at any given moment then I could start
to develop an understanding of how our ordering and constructing of cascaded
/ chained VIEWS affects the amount of WORK space required. I could also see
just how much we are reducing our need for disk space by using views.
Your answer to this particular question will be appreciated. If you can
also point me to a resource where I can learn more about what determines the
need for temporary files I will be very pleased.
Rodney J. Presley, PhD
Georgia Medical Care Foundation
57 Executive Park South, NE
Atlanta, GA 30329-2224
404-982-0411 ext. 7574