Date: Tue, 4 Jan 2005 15:10:05 -0500
Reply-To: "James, Steve" <SPJ1@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "James, Steve" <SPJ1@CDC.GOV>
Subject: Re: Big Table / Little Table Update
Content-Type: text/plain; charset="us-ascii"
Mike,
The master data set has about 42 million records in it. I don't know
how representative the numbers from this run are since it's just test
code and the transaction table is actually created from the master (see
log below).
150 records per patient sounds high to me. As I mentioned before I
don't know the data that well. I'll have to find out more. I'm trying
to run the program again and make some changes to bufsize and ibufsize
that Michael Raithel suggested. It'll be interesting to see if it makes
a difference.
Steve
8 data Master (index=(key)) Trans ;
9 set j.vaDetailsPre04;
10 if datepart(storedate) ge mdy(12,31,2004) then output Trans
;
11 else output Master ;
12 run;
NOTE: There were 42179057 observations read from the data set
J.VADETAILSPRE04.
NOTE: The data set WORK.MASTER has 42156526 observations and 29
variables.
NOTE: The data set WORK.TRANS has 22531 observations and 29 variables.
NOTE: DATA statement used (Total process time):
real time 1:06:05.36
cpu time 17:24.85
-----Original Message-----
From: Mike Rhoads [mailto:RHOADSM1@WESTAT.com]
Sent: Monday, January 03, 2005 5:39 PM
To: James, Steve; SAS-L@LISTSERV.UGA.EDU
Subject: RE: Big Table / Little Table Update
Steve,
Glad you're making (some) progress! A few thoughts / suggestions /
questions follow.
While you're still testing things out, I might use a separate PROC SQL
step
for each PROC SQL statement. The extra overhead should be minimal, and
that
would give you a better idea of how each statement is performing
time-wise.
You can remove the extra PROC SQLs when / if you put the code into
production.
If I'm understanding the log, you have 22531 transactions for 5323
patients.
Your final DATA step reports that 780,472 records were updated, which
implies that your master file averages around 150 records per patient --
does that sound correct?
Do you know about how many patients, and records, are in your master?
It
would be interesting to know approximately what percentage get updated
each
day.
Mike Rhoads
Westat
RhoadsM1@Westat.com
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
James,
Steve
Sent: Monday, January 03, 2005 5:00 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Big Table / Little Table Update
Thanks for the helpful tips last week about my problem concerning
updating a big table with duplicate key values from a much smaller
transaction table. Several people gave me some good advice as well as
code. I'm testing a solution from Mike Rhoads (see log below) and I was
able to get the CPU time down to under 4 minutes (from 42 minutes).
However the clock time remains quite long. For the run today it was
over 2 hours just for the update. When I ran it last week (after most
people had left for the holiday), it ran in about 45 minutes of clock
time with about the same CPU time as before. I can't understand why
there's such a discrepancy: 4 minutes vs 2 hours? Is it I/O?
How can I convince my co-worker that while his solution takes 42 minutes
of CPU time but roughly the same amount of real time that he should
switch methods to the one taking less CPU time?
We're running on a Sunfire 6800 w/ 8 processors running SAS 9.1 under
Solaris 64 bit.
Steve James
Centers for Disease Control and Prevention
sjames@cdc.gov
21
/*----------------------------------------------------------------------
---
22 Append Trans data to Master.
23
------------------------------------------------------------------------
-*/
24 proc sql;
25 insert into Master select * from Trans
26 ;
NOTE: 22531 rows were inserted into WORK.MASTER.
27
28
/*----------------------------------------------------------------------
---
29 Create a table containing the Keys that are in Master that
also are in
30 Trans. Since you have already appended Trans to Master,
there should be
31 no IDs in Trans that are not in Master.
32
------------------------------------------------------------------------
-*/
33 create table KeysInTrans as
34 select key, max(storedate) as date
35 from Trans
36 group by key
37 ;
NOTE: Table WORK.KEYSINTRANS created, with 5323 rows and 2 columns.
38
39
/*----------------------------------------------------------------------
--
40 Get the most recent visit date for each Key.
41
------------------------------------------------------------------------
--*/
42 CREATE TABLE MostRecentVisit AS
43 SELECT T.Key, MAX(StoreDate) AS Date
44 FROM KeysInTrans AS T, Master as M
45 where T.Key = M.Key
46 GROUP BY T.Key ;
NOTE: Table WORK.MOSTRECENTVISIT created, with 5323 rows and 2 columns.
47
48 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1:39:35.63
cpu time 4:29.72
49
50
/*----------------------------------------------------------------------
---
51 Now read in record in Trans and update the corresponding
records in Master.
52
------------------------------------------------------------------------
-*/
3 The SAS
System 11:10 Monday, January 3, 2005
53
54 data Master ;
55 set MostRecentVisit end=lastob ;
56 do until (_iorc_ = %sysrc(_dsenom));
57 modify Master key=key ;
58 select (_iorc_);
59 when (%sysrc(_sok)) do;
60 if storedate eq date
61 then cdc_avrflag=1 ;
62 else cdc_avrflag=0 ;
63 replace Master ;
64 records_updated + 1 ;
65 end; /* end of where sysrc=_sok */
66 when (%sysrc(_dsenom)) do;
67 _error_ = 0;
68 end ;
69 otherwise;
70 end; /* end of select (_iorc_) statement */
71 end; /* end of do until () statement */
72 if lastob then put 'the number of records updated were: '
records_updated comma9. ;
73 run;
the number of records updated were: 780,472
NOTE: There were 5323 observations read from the data set
WORK.MOSTRECENTVISIT.
NOTE: The data set WORK.MASTER has been updated. There were 780472
observations rewritten, 0 observations added and 0 observations
deleted.
NOTE: DATA statement used (Total process time):
real time 2:21:58.26
cpu time 3:52.51