LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2005, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 14 Sep 2005 16:33:26 -0400
Reply-To:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:   Re: SAS in a production environment
Comments:   To: sczepiel@gmail.com
Content-Type:   text/plain; charset="US-ASCII"

> ....extra work to write the overhead code that rolls up the daily files when you need them. But this is fun SAS programming!

De gustibus no est disputandum (To each his own fun)!

I have done a lot of work under both of these data storage schema, and have developed a strong preference for stacking tables with relatively few variables and identical structures. I don't know what 'merged into one large historical' means, but creating longer and longer series of column variables will really grind production work down to slow motion. Many data analysis projects spend much of their limited resources managing variables labels. On the other extreme, replacing sequence identifiers in data with file names shifts information from a database system to a file system. That typically increases access time and cost, and, perhaps worse, introduces many possibilities for errors as file systems change and applications move from platform to platform. Key values move with data tables, and do not depend on directory path names, OS file names, or physical media.

SAS scans data and selects subsets so quickly that selecting from long, narrow data structures works about as fast as indexing. Restructuring rows to columns also works fast enough to justify storing data in a manner that supports rapid selection and subsetting.

I think that it is safe to say now that relational databases have survived the fitness competition during the evolution of database technologies. Why step back a few rungs on the evolutionary ladder? Sig

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Scott Czepiel Sent: Wednesday, September 14, 2005 12:15 AM To: SAS-L@listserv.uga.edu Subject: Re: SAS in a production environment

Based on my direct experience with pretty much the exact same scenario, and having tried both options, my endorsement is for option #1: lots of small files as opposed to one large file.

The primary disadvantage to scenario 2 is that it doesn't scale. In the short run perhaps the performance hit is not very noticeable, but in the

long run, you'll wake up one day to find that your monolithic dataset takes half an hour to set. Especially if you are adding data daily, it won't be long before your programs slow to a crawl. And who's nightmare will it be if your disk throws a bad sector somewhere in the middle of that one file? If you have 900 daily files and one goes bad, it's no seat: you lose one day's worth of data and can quickly reconstruct (do keep your raw flat files around somewhere and back them up in 4 different places, preferably in 4

different time zones, on 4 different media, and residing physically in at least 2 different hemispheres). But if you are relying on the integrity of one single file to store everything, disaster is just one errant mouse click away.

Of course, option 1 means some extra work to write the overhead code that rolls up the daily files when you need them. But this is fun SAS programming! Create an "index" dataset containing datestamps and a string var holding the dataset name. It won't take long to write a macro that will build a set statement that grabs all the daily datasets within a given range that is passed as an argument to the macro. Enjoy!

-- sc work: www.otxresearch.com <http://www.otxresearch.com> home: www.czep.net <http://www.czep.net>

On 9/13/05, SAS User <sasuser@hotpop.com> wrote: > > I would like to solicit the opinions of other SAS users. > > We are looking to bring some nightly production processing into a SAS > environment. We have some predictive models that run off 13 months of > historical data and they run fairly fast and we are happy with the > performance. > > We are trying to decide on the best strategy to store this input data.

> I realize this is a general question to a specific process and if I > need to elaborate more, I will do so. > > The short question: > > Is it better to store this historical data in multiple SAS datasets or

> one large dataset? > > We can approach the storage in 2 ways; > > Scenario 1: > Data comes nightly in a flat file format and is converted to a daily > SAS dataset and fed into our models. We time/date stamp the input > files and write our SAS code to run against the most current daily SAS

> dataset and pull data from the other 100's of historical daily > datasets to feed our models and produce the desired output. > > Or > > Scenario 2: > Data comes nightly in a flat file format and is converted to a daily > SAS dataset and merged into one large historical which is then fed > into our models to produce the desired output. The Daily SAS file is > then deleted. > > Does is matter? Are there reasons to select one storage method over > another? > > We are running the models and storing the datasets on an 8 processes > or UNIX box. > > Thank you for your opinion. >


Back to: Top of message | Previous page | Main SAS-L page