Date: Fri, 7 Jan 2005 10:18:06 -0500
Reply-To: "James, Steve" <SPJ1@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "James, Steve" <SPJ1@CDC.GOV>
Subject: Re: Big Table / Little Table Update
Content-Type: text/plain; charset="us-ascii"
Sigurd,
Thanks for your response. I've been waiting to post my progress to
SAS-L but have been having some problems getting jobs to finish. I'd
run a job and it would run all night and not finish. The systems people
have noticed other problems as well and they're going to reboot at
lunchtime. Hopefully that will speed things up.
I tried using some code using the MOIDIFY command that Mike Rhoads gave
me but while it was much faster as far as cpu time goes, the clock time
was like 50-100 times longer. Michael Raithel suggested I change the
bufsize, bufno and ibufsize values, and throughput increased
significantly. I don't remember specifics but it increased by at least
5 to 10 times. Initially I tried bufsize=32767, ibufsize=32767, and
bufno=10 (defaults were 0,0, and 1 respectively). That's when I got the
5 to 10-fold improvement. Subsequent trials was where I started
experiencing the slowdown and jobs never finished for me to determine
what effect the changes had (unless, of course, I was responsible for
the slowdown). Initially I increased the bufsize to the maximum the
system would take (2 billion and something if I recall), but the log
said that it wouldn't take something that large for some reason. I
dropped it down to a smaller number and the jobs never finished.
My co-worker initially tried using hash tables and the way he did it was
to recreate the table with a data step. That would take about 42
minutes of cpu time and over an hour of real time. That's what I was
hoping to improve upon. I don't know exactly what he was doing with the
hash tables; his code was quite involved and I didn't want to wade
through it and figure it out. But I'll pass your ideas on to him and let
him figure out if it will improve his code.
Thanks again to all the SAS-L folks who've been such a big help to me on
this project.
Steve
-----Original Message-----
From: Sigurd Hermansen [mailto:HERMANS1@WESTAT.com]
Sent: Thursday, January 06, 2005 7:51 PM
To: James, Steve; SAS-L@LISTSERV.UGA.EDU
Subject: RE: Big Table / Little Table Update
Steve:
I've been on the sidelines working on pressing tasks. Ordinarily I would
have jumped in an suggested a SQL solution....
Not this time.... I would go with a hash index, or even a bit array if
you
have enough free memory to hold the ID's of the patients in each update:
- put the ID's for an update in a hash index / bit array as prescribed
in
the Dorfman Letters
to SAS-L (using a temporary array in a Data step);
- UPDATE the flag in the master file IF the ID in a row gets a hit on
the
index/array.
- append the update dataset to the master dataset.
It should take less than a minute to build the index. Scanning a 42M row
dataset should take no more than a few minutes elapsed time on a
reasonably
quick machine. The index/array search should take a negligible amount of
time. The whole process should scale up linearly.
I don't particularly like the design of your database. I prefer to see
separate tables for patient data and for events related to patients. As
the
database stands, the flag will make it easy to select the last update to
the
database for a patient.
Sig
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
James,
Steve
Sent: Tuesday, January 04, 2005 3:10 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Big Table / Little Table Update
Mike,
The master data set has about 42 million records in it. I don't know
how
representative the numbers from this run are since it's just test code
and
the transaction table is actually created from the master (see log
below).
150 records per patient sounds high to me. As I mentioned before I
don't
know the data that well. I'll have to find out more. I'm trying to run
the
program again and make some changes to bufsize and ibufsize that Michael
Raithel suggested. It'll be interesting to see if it makes a
difference.
Steve
8 data Master (index=(key)) Trans ;
9 set j.vaDetailsPre04;
10 if datepart(storedate) ge mdy(12,31,2004) then output Trans
;
11 else output Master ;
12 run;
NOTE: There were 42179057 observations read from the data set
J.VADETAILSPRE04.
NOTE: The data set WORK.MASTER has 42156526 observations and 29
variables.
NOTE: The data set WORK.TRANS has 22531 observations and 29 variables.
NOTE: DATA statement used (Total process time):
real time 1:06:05.36
cpu time 17:24.85
-----Original Message-----
From: Mike Rhoads [mailto:RHOADSM1@WESTAT.com]
Sent: Monday, January 03, 2005 5:39 PM
To: James, Steve; SAS-L@LISTSERV.UGA.EDU
Subject: RE: Big Table / Little Table Update
Steve,
Glad you're making (some) progress! A few thoughts / suggestions /
questions follow.
While you're still testing things out, I might use a separate PROC SQL
step
for each PROC SQL statement. The extra overhead should be minimal, and
that
would give you a better idea of how each statement is performing
time-wise.
You can remove the extra PROC SQLs when / if you put the code into
production.
If I'm understanding the log, you have 22531 transactions for 5323
patients.
Your final DATA step reports that 780,472 records were updated, which
implies that your master file averages around 150 records per patient --
does that sound correct?
Do you know about how many patients, and records, are in your master? It
would be interesting to know approximately what percentage get updated
each
day.
Mike Rhoads
Westat
RhoadsM1@Westat.com
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
James,
Steve
Sent: Monday, January 03, 2005 5:00 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Big Table / Little Table Update
Thanks for the helpful tips last week about my problem concerning
updating a
big table with duplicate key values from a much smaller transaction
table.
Several people gave me some good advice as well as code. I'm testing a
solution from Mike Rhoads (see log below) and I was able to get the CPU
time
down to under 4 minutes (from 42 minutes). However the clock time
remains
quite long. For the run today it was over 2 hours just for the update.
When I ran it last week (after most people had left for the holiday), it
ran
in about 45 minutes of clock time with about the same CPU time as
before. I
can't understand why there's such a discrepancy: 4 minutes vs 2 hours?
Is
it I/O?
How can I convince my co-worker that while his solution takes 42 minutes
of
CPU time but roughly the same amount of real time that he should switch
methods to the one taking less CPU time?
We're running on a Sunfire 6800 w/ 8 processors running SAS 9.1 under
Solaris 64 bit.
Steve James
Centers for Disease Control and Prevention
sjames@cdc.gov
21
/*----------------------------------------------------------------------
---
22 Append Trans data to Master.
23
------------------------------------------------------------------------
-*/
24 proc sql;
25 insert into Master select * from Trans
26 ;
NOTE: 22531 rows were inserted into WORK.MASTER.
27
28
/*----------------------------------------------------------------------
---
29 Create a table containing the Keys that are in Master that
also are in
30 Trans. Since you have already appended Trans to Master,
there should be
31 no IDs in Trans that are not in Master.
32
------------------------------------------------------------------------
-*/
33 create table KeysInTrans as
34 select key, max(storedate) as date
35 from Trans
36 group by key
37 ;
NOTE: Table WORK.KEYSINTRANS created, with 5323 rows and 2 columns.
38
39
/*----------------------------------------------------------------------
--
40 Get the most recent visit date for each Key.
41
------------------------------------------------------------------------
--*/
42 CREATE TABLE MostRecentVisit AS
43 SELECT T.Key, MAX(StoreDate) AS Date
44 FROM KeysInTrans AS T, Master as M
45 where T.Key = M.Key
46 GROUP BY T.Key ;
NOTE: Table WORK.MOSTRECENTVISIT created, with 5323 rows and 2 columns.
47
48 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 1:39:35.63
cpu time 4:29.72
49
50
/*----------------------------------------------------------------------
---
51 Now read in record in Trans and update the corresponding
records in Master.
52
------------------------------------------------------------------------
-*/
3 The SAS
System 11:10 Monday, January 3, 2005
53
54 data Master ;
55 set MostRecentVisit end=lastob ;
56 do until (_iorc_ = %sysrc(_dsenom));
57 modify Master key=key ;
58 select (_iorc_);
59 when (%sysrc(_sok)) do;
60 if storedate eq date
61 then cdc_avrflag=1 ;
62 else cdc_avrflag=0 ;
63 replace Master ;
64 records_updated + 1 ;
65 end; /* end of where sysrc=_sok */
66 when (%sysrc(_dsenom)) do;
67 _error_ = 0;
68 end ;
69 otherwise;
70 end; /* end of select (_iorc_) statement */
71 end; /* end of do until () statement */
72 if lastob then put 'the number of records updated were: '
records_updated comma9. ;
73 run;
the number of records updated were: 780,472
NOTE: There were 5323 observations read from the data set
WORK.MOSTRECENTVISIT.
NOTE: The data set WORK.MASTER has been updated. There were 780472
observations rewritten, 0 observations added and 0 observations
deleted.
NOTE: DATA statement used (Total process time):
real time 2:21:58.26
cpu time 3:52.51