Date: Thu, 26 Jun 2008 04:35:07 -0400
Reply-To: Gerhard Hellriegel <gerhard.hellriegel@T-ONLINE.DE>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Gerhard Hellriegel <gerhard.hellriegel@T-ONLINE.DE>
Subject: Re: POINT vs FIRSTOBS/OBS - are my conjectures reasonable?
On Wed, 25 Jun 2008 10:51:16 -0700, Nordlund, Dan (DSHS/RDA)
<NordlDJ@DSHS.WA.GOV> wrote:
>> -----Original Message-----
>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
>> Behalf Of Keintz, H. Mark
>> Sent: Wednesday, June 25, 2008 10:07 AM
>> To: SAS-L@LISTSERV.UGA.EDU
>> Subject: POINT vs FIRSTOBS/OBS - are my conjectures reasonable?
>>
>> {Best if read using a monopitch font}
>>
>> Dear colleagues in the Academy of SAS-L:
>>
>> Part of a paper I am preparing for NESUG will refer to the
>> performance advantages of program A below (using FIRSTOBS/OBS
>> to control a range of retrieved records over program B
>> (using the POINT= feature). Both programs retrieve the
>> same 30 million cases from a dataset of 100 million.
>>
>> A is usually faster for large datasets, on the order of
>> 15% on my production platform (I have controlled for
>> disk caching) and in the specific case below it approaches
>> 50%, as per this table:
>>
>> Wall time CPU seconds
>> Version A 11.6 9.4
>> Version B 18.5 17.8
>>
>>
>> I attribute most of the performance superiority to the
>> reasons below. My questions are these:
>>
>> 1. Do you think my conjectures are correct?
>> 2. Is there any logic that would suggest which
>> is most significant?
>>
>> Conjectures on reasons for superiority of A over B:
>>
>> 1. Compile vs Execution setting of range limits.
>> Version A sets record range limits at compile
>> time, while version B sets and evaluates the
>> limits at execution time. In effect, in
>> version A SAS "knows" that a consecutive block
>> of records will be read. Yes, I realize that
>> using POINT= on consecutive records will take
>> advantage of data set buffers, but SAS doesn't
>> know that it will be retrieving consecutive
>> records, so it can't build assign a pointer to
>> the next part of the buffer automatically.
>>
>> 2. Overhead in using the POINT= feature, such as:
>> - getting to the specified record
>>
>> - converting the floating point value of _N_
>> in the loop iterator to an integer value
>> that I presume must be used by the data
>> engine for direct access. I evaluated this
>> effect in Program C below, which converts
>> the floating point _N_ to a positive integer,
>> which I suspect is part of the overhead in
>> version B. Program C takes about 3.3 seconds
>> (out of the 17.8 taken by B).
>>
>>
>> **version A - retrieve using FIRSTOBS/OBS **;
>> data _null_;
>> set tmp.test (firstobs=010000001 obs=020000000)
>> tmp.test (firstobs=050000001 obs=060000000)
>> tmp.test (firstobs=080000001 obs=090000000)
>> open=defer;
>> run;
>>
>>
>> **version B - retrieve using POINT= **;
>> data _null_;
>> do _n_=010000001 to 020000000
>> ,050000001 to 060000000
>> ,080000001 to 090000000;
>> set tmp.test point=_n_;
>> end;
>> stop;
>> run;
>>
>>
>> ** Program C, convert a floating point to integer**;
>> data _null_;
>> do _n_=010000001 to 020000000
>> ,050000001 to 060000000
>> ,080000001 to 090000000;
>> integer=put(_n_,pib8.);
>> end;
>> stop;
>> run;
>>
>Mark,
>
>I don't have any comment on the timing issue, but I do have a comment
about comparing the use of point to the use of firstobs and obs,
especially for newcomers to SAS but maybe even some oldtimers.
>
>Someone new to SAS might infer from your examples that
>
> set tmp.test (firstobs=010000001 obs=020000000);
>
>will cause SAS to start reading at physical record number 10000001 and
stop at physical record number 20000000, and in your example it will.
They might then decide that they want to get all the records in that range
for which SOME_FIELD=SOME_VALUE and they write the following
>
>data want;
> set tmp.test(firstobs=010000001 obs=020000000
> where=(SOME_FIELD=SOME_VALUE));
>run;
>
>They might be surprised to find that first record that SAS returns is the
10,000,001-th record on which SOME_FIELD=SOME_VALUE (if it exists) and
continues to look for the next 10 million records where
SOME_FIELD=SOME_VALUE.
>
>So this is a suggetion to mention in your presentation that firstobs and
obs are not absolute pointer/record numbers, but are relative to any
filtering done in a where clause.
>
>Hope this is helpful,
>
>Dan
>
>Daniel J. Nordlund
>Washington State Department of Social and Health Services
>Planning, Performance, and Accountability
>Research and Data Analysis Division
>Olympia, WA 98504-5204
I think it is clear, that A is faster. The blocked read makes sense. I
think, that blockwise read is not possible with POINT=. It could be, if a
optimizer detects that you read consecutive. Obviously it doesn't. I think
for the first record of a block, it is all the same: SAS must calculate
the byte-adress of the record. For all the following records in a block
with the firstobs= option, that is not necessary. For POINT= it is.
You have 3 open's in A instead of only one in B. Dataset open's take long
time. That might eat a bit of the speed of A in comparison to B.
Did you try it also without open=DEFER? That uses more memory, but might
be a bit faster(?)
Gerhard
|