| Date: | Wed, 14 Sep 2005 16:33:26 -0400 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | Re: SAS in a production environment |
|
| Content-Type: | text/plain; charset="US-ASCII" |
> ....extra work to write the overhead code that rolls up the daily
files when you need them. But this is fun SAS
programming!
De gustibus no est disputandum (To each his own fun)!
I have done a lot of work under both of these data storage schema, and
have developed a strong preference for stacking tables with relatively
few variables and identical structures. I don't know what 'merged into
one large historical' means, but creating longer and longer series of
column variables will really grind production work down to slow motion.
Many data analysis projects spend much of their limited resources
managing variables labels. On the other extreme, replacing sequence
identifiers in data with file names shifts information from a database
system to a file system. That typically increases access time and cost,
and, perhaps worse, introduces many possibilities for errors as file
systems change and applications move from platform to platform. Key
values move with data tables, and do not depend on directory path names,
OS file names, or physical media.
SAS scans data and selects subsets so quickly that selecting from long,
narrow data structures works about as fast as indexing. Restructuring
rows to columns also works fast enough to justify storing data in a
manner that supports rapid selection and subsetting.
I think that it is safe to say now that relational databases have
survived the fitness competition during the evolution of database
technologies. Why step back a few rungs on the evolutionary ladder?
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Scott Czepiel
Sent: Wednesday, September 14, 2005 12:15 AM
To: SAS-L@listserv.uga.edu
Subject: Re: SAS in a production environment
Based on my direct experience with pretty much the exact same scenario,
and
having tried both options, my endorsement is for option #1: lots of
small
files as opposed to one large file.
The primary disadvantage to scenario 2 is that it doesn't scale. In the
short run perhaps the performance hit is not very noticeable, but in the
long run, you'll wake up one day to find that your monolithic dataset
takes
half an hour to set. Especially if you are adding data daily, it won't
be
long before your programs slow to a crawl. And who's nightmare will it
be if
your disk throws a bad sector somewhere in the middle of that one file?
If
you have 900 daily files and one goes bad, it's no seat: you lose one
day's
worth of data and can quickly reconstruct (do keep your raw flat files
around somewhere and back them up in 4 different places, preferably in 4
different time zones, on 4 different media, and residing physically in
at
least 2 different hemispheres). But if you are relying on the integrity
of
one single file to store everything, disaster is just one errant mouse
click
away.
Of course, option 1 means some extra work to write the overhead code
that
rolls up the daily files when you need them. But this is fun SAS
programming! Create an "index" dataset containing datestamps and a
string
var holding the dataset name. It won't take long to write a macro that
will
build a set statement that grabs all the daily datasets within a given
range
that is passed as an argument to the macro. Enjoy!
--
sc
work: www.otxresearch.com <http://www.otxresearch.com>
home: www.czep.net <http://www.czep.net>
On 9/13/05, SAS User <sasuser@hotpop.com> wrote:
>
> I would like to solicit the opinions of other SAS users.
>
> We are looking to bring some nightly production processing into a SAS
> environment. We have some predictive models that run off 13 months of
> historical data and they run fairly fast and we are happy with the
> performance.
>
> We are trying to decide on the best strategy to store this input data.
> I realize this is a general question to a specific process and if I
> need to elaborate more, I will do so.
>
> The short question:
>
> Is it better to store this historical data in multiple SAS datasets or
> one large dataset?
>
> We can approach the storage in 2 ways;
>
> Scenario 1:
> Data comes nightly in a flat file format and is converted to a daily
> SAS dataset and fed into our models. We time/date stamp the input
> files and write our SAS code to run against the most current daily SAS
> dataset and pull data from the other 100's of historical daily
> datasets to feed our models and produce the desired output.
>
> Or
>
> Scenario 2:
> Data comes nightly in a flat file format and is converted to a daily
> SAS dataset and merged into one large historical which is then fed
> into our models to produce the desired output. The Daily SAS file is
> then deleted.
>
> Does is matter? Are there reasons to select one storage method over
> another?
>
> We are running the models and storing the datasets on an 8 processes
> or UNIX box.
>
> Thank you for your opinion.
>
|