Date: Wed, 2 Feb 2005 09:25:23 -0500
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: SAS Merge
Content-Type: text/plain; charset="iso-8859-1"
Amrita:
On a Unix platform we found that compressing source system files using the
GNU gzip and piping compressed data through a zcat pipe, SAS view filter,
and projection (data step view with input statements, and SQL select
statement with WHERE clause) worked much faster than reading uncompressed
system files. This strategy works well when the process subsets rows and
columns on input.
A similar strategy partitions source data into related subsets to eliminate
unrelated columns and repetition of repeated values. A useful method
eliminates empty space that empty text variables occupy in 'flatfile'
databases.
Using these strategies we have many fewer production bottlenecks. Even
though 'fuzzy linkage' of very large volumes of data tends to explosive
demands for memory and disk space, only rarely do we have to test the limits
of our servers.
Sig
-----Original Message-----
From: SAS(r) Discussion
To: SAS-L@LISTSERV.UGA.EDU
Sent: 2/2/2005 12:00 AM
Subject: Re: SAS Merge
Hi,
The SAS dataset only has a few variables for counts while the flat file
has
the chunk of the data. The selects and models are run using SAS for the
most
part. We also use some products by Group1 Software. We currently have a
20
million file with live data which has distributions similar to the
final 130
million record file. We can use that for estimations...thanks for the
suggestion.
Amrita
In a message dated 2/1/2005 10:42:10 P.M. Eastern Standard Time,
_nospam@HOWLES.COM_ (mailto:nospam@HOWLES.COM) writes:
The last step creates both an external flat file and a SAS data set.
Are
you going to keep both (*two* 700-GB footprints)? If not, why create
both?
Another way of getting at this: are the "selects and models" to be "run
during the week" done with SAS, or something else, or a mix? What if
any
non-SAS products are involved here?
In any case, have you tried generating a 700-GB test file with fake
data
but somewhat realistic distributions to gauge the performance of your
weekday jobs? If not, you may have some unpleasant surprises later.