Date: Thu, 30 Jun 2005 14:59:22 -0700
Reply-To: "Pardee, Roy" <pardee.r@GHC.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Pardee, Roy" <pardee.r@GHC.ORG>
Subject: Re: Deleting an enumerated list of records from a huge table--how
to do it efficiently?
Content-Type: text/plain; charset="us-ascii"
Nice! Thanks very much. I need to stop being such a prisoner of sql...
Thanks!
-Roy
-----Original Message-----
From: Dorfman, Paul [mailto:paul.dorfman@FCSO.COM]
Sent: Thursday, June 30, 2005 2:44 PM
To: SAS-L@LISTSERV.UGA.EDU; Pardee, Roy
Subject: Re: Deleting an enumerated list of records from a huge
table--how to do it efficiently?
Roy,
I do not have a 3G file, but here is an experiment on a smaller scale:
62 data huge (index = (pk = (vistnum proc_code))) ;
63 do vistnum = 0 to 1e7 - 1 ;
64 proc_code = put (vistnum, z8.) ;
65 output ;
66 end ;
67 run ;
NOTE: The data set USER.HUGE has 10000000 observations and 2 variables.
INFO: Multiple concurrent threads will be used to create the index.
NOTE: Composite index pk has been defined.
NOTE: DATA statement used (Total process time):
real time 29.87 seconds
cpu time 40.79 seconds
68
69 data del ;
70 do vistnum = 0 to 1e7 - 1 by 1e3 ;
71 proc_code = put (vistnum, z8.) ;
72 output ;
73 end ;
74 run ;
NOTE: The data set USER.DEL has 10000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 0.02 seconds
cpu time 0.02 seconds
75
76 data huge ;
77 set del ;
78 modify huge key = pk ;
79 remove ;
80 run ;
NOTE: There were 10000 observations read from the data set USER.DEL.
NOTE: The data set USER.HUGE has been updated. There were 0
observations rewritten, 0 observations added and 10000 observations
deleted.
NOTE: DATA statement used (Total process time):
real time 1.79 seconds
cpu time 1.55 seconds
I bet it should not take much longer than that to do the same with HUGE
20 times larger...
Kind regards,
-----------------
Paul M. Dorfman
Jacksonville, FL
-----------------
On Thu, 30 Jun 2005 14:04:07 -0700, Pardee, Roy <pardee.r@GHC.ORG>
wrote:
>I've got a 3.5GB dataset of procedure codes with a primary key
>constraint on the fields (VistNum, ProcCode). VistNum is numeric,
>ProcCode is a char(8).
>
>I've got a way smaller dset of (VistNum, ProcCode) values that I want
>to delete from my huge dataset.
>
>The one thing that comes to mind is:
>
> delete from huge
> where put(VistNum, best.) || ProcCode in (select put(VistNum, best.)
>|| ProcCode from tiny) ;
>
>But that will take forever b/c sas will have to do a full table scan.
>
>Now if my data were in say, mssql, I could say:
>
> delete huge
> from huge INNER JOIN tiny
> on huge.VistNum = tiny.VistNum AND
> huge.ProcCode = tiny.ProcCode ;
>
>And it would be fast b/c the join could make use of the pk index. But
>that's nonstandard sql & it appears that sas does not support it.
>
>Is there a more graceful & efficient way?
>
>Thanks!
>
>-Roy
>
>Roy Pardee
>Research Analyst/Programmer
>Center For Health Studies (Cancer Research Network)
>Group Health Cooperative
>(206) 287-2078
>
>This message and any attached files might contain confidential
>information protected by federal and state law. The information is
>intended only for the use of the individual(s) or entities originally
>named as addressees. The improper disclosure of such information may be
>subject to civil or criminal penalties. If this message reached you in
>error, please contact the sender and destroy this message. Disclosing,
>copying, forwarding, or distributing the information by unauthorized
>individuals or entities is strictly prohibited by law.