|
In an old thread about SAS data set compression, Seth Grimes posted the
following reply to my posting:
<<My original posting can be found beneath the Sig line, below>>
>In dealing with datasets with very wide but sparse records -- that is,
>about 2950 variables in each observation with 60-70% zero -- if I don't
>compress the SAS dataset is about 6 times larger than a flat file that uses
>variable-length, delimited fields. Compressing the SAS dataset results in
>a file that's a small percentage larger than the flat file. I figure that
>using variable-length fields in the SAS program would carry too much
>overhead to be worthwhile.
>
Seth, you make a good point about the benefits of compressing SAS data sets!
Peter Crawford made a point along the same lines when he suggested that SAS data
set compressing will become more important and come in handy for the longer text
variables in Versions 7 and 8 of the SAS System. There is no doubt that SAS
data set compression is a good tool in reducing the size of the footprint of SAS
data sets.
My only gripe is that currently, SAS Version 6.09E, the CPU time overhead of
compressing/de-compressing SAS data sets during processing is heavy. If the
trade-off of DASD space vs. CPU time is acceptable in your organization for the
huge SAS data set, then compression is good for you. If not; then you have a
lot of 'splaining to do to your Computer Performance staff. Either way; as long
as programmers know the Yin and Yang of the choices--Bigger SAS data sets, less
CPU time during processing; Smaller SAS data sets, more CPU time during
processing--they will make the choice that is right for their applications and
their organizations!
Seth, best of luck as you give your SAS observations the Sardine treatment and
squash them into compressed SAS data sets!
I hope that this answer proves helpful now, and in the future!
Of course, all of these opinions and insights are my own, and do not reflect
those of my organization or my associates.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Michael A. Raithel
"The man who wrote the book on performance."
E-mail: raithem@westat.com
Author: Tuning SAS Applications in the MVS Environment
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
When you cease to make a contribution you begin to die. -- Eleanor Roosevelt
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
<<My original posting as presented by Seth in his posting>>
>
> Tim Berryhill posted the following comment to Matt Santoni's recent thread:
>
> >Interesting compression statistics.
>
> >> ----------
> >> From: mvs1000[SMTP:mvs1000@YAHOO.COM]
> >> Reply To: mvs1000
> >> Sent: Thursday, October 14, 1999 8:39 AM
> >> To: SAS-L@LISTSERV.UGA.EDU
> >> Subject: SAS arrays - again
> >>
> ><SNIP>
> >> NOTE: The data set WORK.TEMP1 has 6386 observations and 5 variables.
> >> NOTE: Compressing data set WORK.TEMP1 increased size by 13.79 percent.
> >> Compressed is 33 pages; un-compressed would require 29 pages.
> >> NOTE: The DATA statement used 27.26 seconds.
> ><SNIP>
> >> NOTE: The data set WORK.TEMP2 has 6386 observations and 4 variables.
> >> NOTE: Compressing data set WORK.TEMP2 increased size by 30.43 percent.
> >> Compressed is 30 pages; un-compressed would require 23 pages.
> >> NOTE: The DATA statement used 57.01 seconds.
> >>
>
> Tim, your wry but poignant comment underscores one of the pitfalls of SAS data
> set compression that not all SAS programmers may be aware of. Namely, if you
> apply SAS compression to a SAS data set and the data is not ideally suited to
> compressing, you can actually end up with a data set that is larger than the
> original. When you compound this particular occurrence with the increase in
CPU
> time expended to access the observations in the compressed SAS data set, you
> have a real, bone-fida, big-time LOSE/LOSE situation.
>
> So, how can you end up with a "compressed" SAS data set that is larger than
the
> original. Well, it is quite easy, really. On the all-powerful operating
system
> known as OS/390, or as MVS, the SAS System puts a 12-byte header, containing
> compression control information, on each observation in the compressed data
set.
> The SAS System compresses data within the observations according to this
chart:
>
> Type of Character Length of Original Redundant Character String Compressed
> Length
> --------------------------
> --------------------------------------------------------------------
> -------------------------------
> Binary Zeros 3 to 66 2
> Blanks 3 to 129 2
> Missing Values N/A Not Compressed
> All Others 3 to 63 3
>
> If you have observations where none of the data compresses out, you have an
> increase in size of 12 bytes per observation, so your overall SAS data set
size
> increases. Not good; not good at all! For compression to do more than break
> even, you need to compress out at least 13 bytes per observation; just to be
one
> byte ahead of the 12-byte overhead compression imposes.
>
> The four inter-related elements that I look at in deciding upon likely SAS
data
> set compression candidates are:
>
> 1. A large percentage of the observations in a SAS data set must compress.
> 2. A large portion of each individual observation must compress.
> 3. Observations must contain a significant amount of adjacent redundancy.
> 4. Observations must be reduced in size by more than the Compression
Control
> header (12 bytes).
>
> Beyond the elements, above, a general rule of thumb that I use is that SAS
data
> sets with short, or very short, observations are usually poor candidates for
> compression.
>
> Overall, I have never been a big fan of SAS data set compression on the big
> iron. True, it can reduce the overall size of a SAS data set and thus reduce
my
> DASD storage charges. True, it can reduce the EXCP count (I/O's) of all
> programs that access the compressed SAS data set and thus reduce my EXCP
> charges. But, even _MORE_TRUE_ it greatly increases the CPU time of all
> programs that access the compressed SAS data set, greatly increasing my CPU
time
> charges. Since most organizations that I have worked with that have IS Charge
> Back software favor charging more for CPU time, the two YINs (reduced data set
> size and reduced EXCP count) are outweighed by the big YANG (greatly increased
> CPU time). Of course, off of the big iron, this may be a non-issue.
>
> Best of luck to those of you who are trying to put their SAS data sets on a
> storage diet via SAS data set compression. I hope that it doesn't come back
to
> byte you!
>
> I hope that this answer proves helpful now, and in the future!
>
> Of course, all of these opinions and insights are my own, and do not reflect
> those of my organization or my associates.
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Michael A. Raithel
> "The man who wrote the book on performance."
> E-mail: raithem@westat.com
> Author: Tuning SAS Applications in the MVS Environment
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> ..you can't start a fire; you can't start a fire without a spark... -- Bruce
> Springsteen
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Syst
--
Seth Grimes Alta Plana database & Web / design & development
grimes@altaplana.com http://altaplana.com 301-873-8225
|