Date: Fri, 6 Jan 2006 13:13:19 -0500
Reply-To: Mike Rhoads <RHOADSM1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mike Rhoads <RHOADSM1@WESTAT.COM>
Subject: Re: Simultaneous append and prune a large data set
Content-Type: text/plain; charset="us-ascii"
Clint,
This may not be particularly practical, but at least theoretically it
would be an option to avoid making a copy.
It relies on the capability of updating SAS data sets in place (MODIFY
statement etc.). Logically it would involve "pairing up" records to be
deleted with records to be added, and using MODIFY to replace each one
of the former with one of the latter. If more adds than deletes, put
the extra ones at the end. If more deletes than adds, then use REMOVE
to get rid of the extra old records.
This could of course lead to slow growth of the file over time as you
get more "wasted" space, so it would probably be necessary to do a
recopy on some periodic basis to reclaim that space. It would also
eliminate any real order to the records, if that is of interest. You
could of course maintain an index to get around that issue.
It would also require quite a bit of programming and testing, and I
don't know how performance would be. Like I said, it's an idea and not
necessarily a recommendation.
Good luck!
Mike Rhoads
Westat
RhoadsM1@Westat.com
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Rickards, Clinton (GE Consumer Finance)
Sent: Friday, January 06, 2006 11:17 AM
To: SAS-L (E-mail)
Subject: Simultaneous append and prune a large data set
All,
I think I know the answer to this but I'm hoping for someone can give me
some alternatives.
I have a job I inherited which maintains a large data set (220+ GB). The
goal is to keep a rolling 180 days of data. The job extracts the current
day's data into a temp data set and PROC APPENDs that data to the full
data set. A data step then rewrites the full data set, using a where
clause to prune data more than 180 days old. Here is the essence of the
program:
%let keepdate = "06jul2005"d; <<<< recalculated daily...
proc append base=lib.full data=work.daily;
run;
data lib.full;
set lib.full (where=(tran_date < &keepdate));
run;
The problem is that in the data step, SAS is making a second copy of
lib.full, consuming all available disk space. We are working with our
sys admins to clear space and to allocate more disk space.
My basic understanding has always been that there is no way to avoid
making a copy of the full data set, either into WORK or into another
library. Our work space averages 270-300GB free, big enough but close
given other users on the system. I am considering modifying the program
to look like this:
%let keepdate = "06jul2005"d; <<<< recalculated daily...
libname otherlib "in our saswork directory";
data otherlib.days179;
set lib.full (where=(tran_date < &keepdate));
run;
proc datasets lib=lib;
delete full;
run;
data lib.full;
set otherlib.days179
work.daily ;
run;
Any other alternatives or thoughts?
TIA,
Clint