Date: Thu, 5 Aug 2004 10:47:00 -0700
Reply-To: "Choate, Paul@DDS" <pchoate@DDS.CA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Choate, Paul@DDS" <pchoate@DDS.CA.GOV>
Subject: Re: Size of ICK file when sorting
Yes indeed:
104 data char (compress=char)
105 bin (compress=binary);
106 array bigchar bigchar1-bigchar50;
107 do i = 1 to 1000;
108 do over bigchar;
109 bigchar = ranuni(1);
110 end;
111 output;
112 end;
113
114 run;
NOTE: The data set WORK.CHAR has 1000 observations and 51 variables.
NOTE: Compressing data set WORK.CHAR increased size by 7.69 percent.
Compressed is 28 pages; un-compressed would require 26 pages.
NOTE: The data set WORK.BIN has 1000 observations and 51 variables.
NOTE: Compressing data set WORK.BIN increased size by 7.69 percent.
Compressed is 28 pages; un-compressed would require 26 pages.
NOTE: DATA statement used (Total process time):
real time 0.09 seconds
cpu time 0.04 seconds
Where does compress=binary create a benefit?
Paul Choate
DDS Data Extraction
(916) 654-2160
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Jack
Hamilton
Sent: Thursday, August 05, 2004 10:30 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Size of ICK file when sorting
"Choate, Paul@DDS" <pchoate@DDS.CA.GOV> wrote:
>The documentation only recommends binary
>compression on long records with lots of binary data:
Personally, I would not claim that SAS documentation is always clear
and complete and unambiguous and never needs empirical verification.
=====
1 data char (compress=char)
2 bin (compress=binary);
3
4 length bigchar $50.;
5
6 bigchar = repeat('00'x, 49);
7
8 do i = 1 to 1000;
9 output;
10 end;
11
12 run;
NOTE: The data set WORK.CHAR has 1000 observations and 2 variables.
NOTE: Compressing data set WORK.CHAR decreased size by 60.00 percent.
Compressed is 6 pages; un-compressed would require 15 pages.
NOTE: The data set WORK.BIN has 1000 observations and 2 variables.
NOTE: Compressing data set WORK.BIN decreased size by 53.33 percent.
Compressed is 7 pages; un-compressed would require 15 pages.
=====
Binary compression in this case is fairly effective, even though the
data set doesn't meet the requirements in the documentation.
"Effective" is, of course, a subjective term.
=====
72 data char (compress=char)
73 bin (compress=binary);
74
75 znum = 0;
76
77 array z z1-z100;
78 do over z;
79 z = znum;
80 end;
81
82 do i = 1 to 1000;
83 output;
84 end;
85
86 drop i;
87
88 run;
NOTE: The data set WORK.CHAR has 1000 observations and 101 variables.
NOTE: Compressing data set WORK.CHAR decreased size by 94.12 percent.
Compressed is 3 pages; un-compressed would require 51 pages.
NOTE: The data set WORK.BIN has 1000 observations and 101 variables.
NOTE: Compressing data set WORK.BIN decreased size by 94.12 percent.
Compressed is 3 pages; un-compressed would require 51 pages.
=====
Here, the criteria are clearly met, yet binary compress performs no
better than character compression.
--
JackHamilton@FirstHealth.com
Manager, Technical Development
Metrics Department, First Health
West Sacramento, California USA
>>> "Choate, Paul@DDS" <pchoate@DDS.CA.GOV> 08/05/2004 9:56 AM >>>
The documentation only recommends binary compression on long records
with
lots of binary data:
<sasdoc9>
This method is highly effective for compressing medium to large
(several
hundred bytes or larger) blocks of binary data (numeric variables).
Because
the compression function operates on a single record at a time, the
record
length needs to be several hundred bytes or larger for effective
compression.
</sasdoc9>
I don't know how well it works if character data is interspersed in
the
binary data.
Paul Choate
DDS Data Extraction
(916) 654-2160
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Jack
Hamilton
Sent: Thursday, August 05, 2004 9:51 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Size of ICK file when sorting
There's no fixed rule for when data sets should be compressed. Some
data sets compress well, and others actually get larger when
compressed.
You just have to try it. The closest you can come to a rule is "If
you
have character variables with many repeating characters (including
blanks), then use compression".
I haven't used COMPRESS=BINARY enough to come up with a rule of thumb
for its use.
--
JackHamilton@FirstHealth.com
Manager, Technical Development
Metrics Department, First Health
West Sacramento, California USA
>>> "Chuck Enright" <chuck_sas@cfedata.com> 08/04/2004 7:04 PM >>>
Jack,
If my primary goal is to minimize the disk space used, with processing
time a
secondary goal, should I avoid using the system option and use the
dataset
option only for permanent datasets?
Quoting Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>:
> What are your compression options? Is it possible that the input
data
> set is compressed and the output data set is not?
>
>
> --
> JackHamilton@FirstHealth.com
> Manager, Technical Development
> Metrics Department, First Health
> West Sacramento, California USA
>
> >>> <sophe88@YAHOO.COM> 08/03/2004 11:08 AM >>>
> I try to sort a 3 GB file under SAS 9.1 for Windows.
>
> proc sort data=mylib.abc(drop=var1 var2) nodupkeys out=mylib.out1;
by
> id; run;
>
> When the sorting kicks off, I notice in the library location there
is
> a file with extension .Lck ticking up in size as the sorting goes
on.
>
> But I saw the .LCK file (which is supposed to replace mylib.out1)
> actually was exceeding mylib.abc in size and the sorting showed no
> sign to stop.
>
> I know var1 and var2 all are character var with length=100 and I
have
> about 20% dup records by ID. What is wrong here?
>
> PD
>