Date: Wed, 31 Jan 2007 21:56:53 -0500
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject: Re: Number of Observations Read By MODIFY Statement
On Wed, 31 Jan 2007 02:20:18 -0500, Peter Crawford
<peter.crawford@BLUEYONDER.CO.UK> wrote:
>in the final points made, Pudding Man says
>>Likely goes without saying: for this type of app., I 'spect UPDATE
>>or MERGE would be much better choices (disk space permitting).
>>
>> Prost,
>
>well, Prost to you too, Pudding but
>if you need modify it's probably because you need update-in-place
>(with the audit trails, perhaps)
>So sometimes it's neccessary to be "slower"
Moreover, the performance of MODIFY with BY can be very *good* if an
appropriate index on the master data set is in place.
>
>regards
>Peter Crawford
>
>
>On Tue, 30 Jan 2007 18:52:44 -0600, Pudding Man <pudding.man@GMAIL.COM>
>wrote:
>
>>On 1/30/07, Howard Schreier <hs AT dc-sug DOT org> <nospam@howles.com>
>wrote:
>>> On Sun, 28 Jan 2007 23:46:36 GMT, Peetie Wheatstraw
>>> <peetie.wheatstraw@GMAIL.COM> wrote:
>>
>> ...
>>
>>> >On Sun, 28 Jan 2007 08:55:00 -0500, nospam@HOWLES.COM ("Howard
>Schreier <hs
>>> AT dc-sug DOT
>>> >org>") wrote:
>>> >
>>> >>Start with a table of 10 million rows
>>> >>
>>> >> data big;
>>> >> do key = 1e7 to 1 by -1;
>>> >> output;
>>> >> end;
>>> >> run;
>>> >>
>>> >>and a table of 3 rows
>>> >>
>>> >> data little;
>>> >> do key = 616, 61616, 6161616;
>>> >> output;
>>> >> end;
>>> >> run;
>>> >>
>>> >>Now make a DATA step delete the 3 matching rows from the big table
>>> >>
>>> >> data big;
>>> >> modify big little;
>>> >> by key;
>>> >> remove;
>>> >> run;
>>> >>
>>> >>The log reports
>>> >>
>>> >>NOTE: There were 1 observations read from the data set WORK.BIG.
>>> >>
>>> >>NOTE: The data set WORK.BIG has been updated. There were 0
>observations
>>> >>rewritten, 0 observations added and 3 observations deleted.
>>> >>
>>> >>NOTE: There were 3 observations read from the data set WORK.LITTLE.
>>> >>
>>> >>NOTE: DATA statement used (Total process time):
>>> >> real time 31.26 seconds
>>> >> cpu time 31.12 seconds
>>> >>
>>> >>Obviously it had to read a lot of observations from BIG. So why does
>it say
>>> >>it read only one?
>>>
>>> Yes. The form of MODIFY which accepts two data sets and a BY statement
>is
>>> extremely inefficient in the absence of a suitable index on the master
>data set.
>>
>>So it would appear. This is "Matching Access" per the doc.
>>
>>> I ran this variation on my test:
>>>
>>> data big;
>>> do key = 1e6 to 1 by -1;
>>> output;
>>> end;
>>> run;
>>>
>>> data little;
>>> do key = 1 to 10;
>>> output;
>>> end;
>>> run;
>>>
>>> data big;
>>> modify big little;
>>> by key;
>>> remove;
>>> run;
>>>
>>> My understanding is that for each observation in LITTLE, SAS had to
>read BIG
>>> sequentially, from the beginning, until it finds the match.
>>>
>>>So, it read
>>>
>>> 1M + 999,999 + 999,998 . . . + 999,991
>>>
>>> or almost 10 million observations. The step took 13.5 seconds. Then I
>>> changed the 10 to 20 and ran the whole thing again. So it had to read
>>>
>>> 1M + 999,999 + 999,998 . . . + 999,981
>>>
>>> or almost 20 million. Indeed, the time went up to 27 seconds.
>>>
>>> Of course in both cases it reported
>>>
>>> NOTE: There were 1 observations read from the data set WORK.BIG.
>>
>>My conjecture is that it:
>>
>>a.) Completely buffers BIG (memory allowing)
>>b.) Reads a key from LITTLE
>>c.) Searches the buffer until the key if found
>>d.) Go to a.) until LITTLE is fully read
>>
>>"Task Mngr." wouldn't confirm full buffering, but the timings
>>were the same with/without SASFILE BIG.
>>
>>If LITTLE stores the *first* rather than the *last* few recs, it
>whizzbangs
>>(W2k/sas 913):
>>
>>NOTE: The data set WORK.BIG1 has 1000000 observations and 1 variables.
>> ...
>>223 data little;
>>224
>>225 do key = 999991 to 1000000;
>>226 output;
>>227 end;
>>228 run;
>>
>>NOTE: The data set WORK.LITTLE has 10 observations and 1 variables.
>>NOTE: DATA statement used (Total process time):
>> real time 0.00 seconds
>> user cpu time 0.00 seconds
>> system cpu time 0.01 seconds
>> Memory 144k
>>
>>
>>229
>>230
>>231 data big1;
>>232 modify big1 little end = end1;
>>233 by key;
>>234 _ct + 1;
>>235 if end1 then put _ct=;
>>236 remove;
>>237 run;
>>
>>_ct=10
>>NOTE: There were 1 observations read from the data set WORK.BIG1.
>>NOTE: The data set WORK.BIG1 has been updated. There were 0
>>observations rewritten, 0
>> observations added and 10 observations deleted.
>>NOTE: There were 10 observations read from the data set WORK.LITTLE.
>>NOTE: DATA statement used (Total process time):
>> real time 0.00 seconds
>> user cpu time 0.01 seconds
>> system cpu time 0.00 seconds
>> Memory 276k
>>
>>> It seems to me that SAS shjould do one of two things: (1) maintain an
>actual
>>> count and report that (I can't see why that's difficult)
>>
>>Concur.
>>
>>or (2) suppress the
>>> note altogether.
>>
>>Wouldn't be compatible with SET, MERGE, UPDATE and
>>MODIFY("Sequential Access"). One can't even discern the final obs
>>count from the step note.
>>
>>The doc for "Matching Access" could also use a good overhaul.
>>
>>Compared to a sequential MERGE, Howards MODIFY code took about 5
>>times more cpu cycles.
>>
>>Likely goes without saying: for this type of app., I 'spect UPDATE
>>or MERGE would be much better choices (disk space permitting).
>>
>> Prost,
>> Puddin'
>>
>>*****************************************************************
>>*** Puddin' Man PuddingDotMan at GmailDotCom **
>>*****************************************************************;
>>
>>"Law without Equity is no law at all: 'tis a form of jungle rule,
>> and is always supported primarily by force of arms."
|