Date: Thu, 26 Jun 2008 15:20:23 -0400
Reply-To: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Subject: Re: POINT vs FIRSTOBS/OBS - are my conjectures reasonable?
In-Reply-To: <200806260835.m5PJBnEE007066@malibu.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"
Gerhard makes this observation and asks a question:
> I think it is clear, that A is faster. The blocked read makes sense. I
> think, that blockwise read is not possible with POINT=. It could be, if
> a
> optimizer detects that you read consecutive. Obviously it doesn't. I
> think
> for the first record of a block, it is all the same: SAS must calculate
> the byte-adress of the record. For all the following records in a block
> with the firstobs= option, that is not necessary. For POINT= it is.
> You have 3 open's in A instead of only one in B. Dataset open's take
> long
> time. That might eat a bit of the speed of A in comparison to B.
> Did you try it also without open=DEFER? That uses more memory, but
> might
> be a bit faster(?)
>
> Gerhard
Gehard, to answer your question, yes I did rerun the program
without OPEN=DEFER. After repeated tests, it appears that
dropping OPEN=DEFER actually adds to clock time (but only 1%)
and has no effect on cpu. It's not that surprising, since
there are just as many datasets opens, with or without
OPEN=DEFER. Also, with OPEN=DEFER, I'm allowing SAS to
make some simplifying assumptions, not the least of which
is that each incoming dataset must (after implementation
of KEEP= and DROP= parameters) provide exactly the same set
of variables to the data step.
Regards,
mark
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Gerhard Hellriegel
> Sent: Thursday, June 26, 2008 4:35 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: POINT vs FIRSTOBS/OBS - are my conjectures reasonable?
>
> On Wed, 25 Jun 2008 10:51:16 -0700, Nordlund, Dan (DSHS/RDA)
> <NordlDJ@DSHS.WA.GOV> wrote:
>
> >> -----Original Message-----
> >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> >> Behalf Of Keintz, H. Mark
> >> Sent: Wednesday, June 25, 2008 10:07 AM
> >> To: SAS-L@LISTSERV.UGA.EDU
> >> Subject: POINT vs FIRSTOBS/OBS - are my conjectures reasonable?
> >>
> >> {Best if read using a monopitch font}
> >>
> >> Dear colleagues in the Academy of SAS-L:
> >>
> >> Part of a paper I am preparing for NESUG will refer to the
> >> performance advantages of program A below (using FIRSTOBS/OBS
> >> to control a range of retrieved records over program B
> >> (using the POINT= feature). Both programs retrieve the
> >> same 30 million cases from a dataset of 100 million.
> >>
> >> A is usually faster for large datasets, on the order of
> >> 15% on my production platform (I have controlled for
> >> disk caching) and in the specific case below it approaches
> >> 50%, as per this table:
> >>
> >> Wall time CPU seconds
> >> Version A 11.6 9.4
> >> Version B 18.5 17.8
> >>
> >>
> >> I attribute most of the performance superiority to the
> >> reasons below. My questions are these:
> >>
> >> 1. Do you think my conjectures are correct?
> >> 2. Is there any logic that would suggest which
> >> is most significant?
> >>
> >> Conjectures on reasons for superiority of A over B:
> >>
> >> 1. Compile vs Execution setting of range limits.
> >> Version A sets record range limits at compile
> >> time, while version B sets and evaluates the
> >> limits at execution time. In effect, in
> >> version A SAS "knows" that a consecutive block
> >> of records will be read. Yes, I realize that
> >> using POINT= on consecutive records will take
> >> advantage of data set buffers, but SAS doesn't
> >> know that it will be retrieving consecutive
> >> records, so it can't build assign a pointer to
> >> the next part of the buffer automatically.
> >>
> >> 2. Overhead in using the POINT= feature, such as:
> >> - getting to the specified record
> >>
> >> - converting the floating point value of _N_
> >> in the loop iterator to an integer value
> >> that I presume must be used by the data
> >> engine for direct access. I evaluated this
> >> effect in Program C below, which converts
> >> the floating point _N_ to a positive integer,
> >> which I suspect is part of the overhead in
> >> version B. Program C takes about 3.3 seconds
> >> (out of the 17.8 taken by B).
> >>
> >>
> >> **version A - retrieve using FIRSTOBS/OBS **;
> >> data _null_;
> >> set tmp.test (firstobs=010000001 obs=020000000)
> >> tmp.test (firstobs=050000001 obs=060000000)
> >> tmp.test (firstobs=080000001 obs=090000000)
> >> open=defer;
> >> run;
> >>
> >>
> >> **version B - retrieve using POINT= **;
> >> data _null_;
> >> do _n_=010000001 to 020000000
> >> ,050000001 to 060000000
> >> ,080000001 to 090000000;
> >> set tmp.test point=_n_;
> >> end;
> >> stop;
> >> run;
> >>
> >>
> >> ** Program C, convert a floating point to integer**;
> >> data _null_;
> >> do _n_=010000001 to 020000000
> >> ,050000001 to 060000000
> >> ,080000001 to 090000000;
> >> integer=put(_n_,pib8.);
> >> end;
> >> stop;
> >> run;
> >>
> >Mark,
> >
> >I don't have any comment on the timing issue, but I do have a comment
> about comparing the use of point to the use of firstobs and obs,
> especially for newcomers to SAS but maybe even some oldtimers.
> >
> >Someone new to SAS might infer from your examples that
> >
> > set tmp.test (firstobs=010000001 obs=020000000);
> >
> >will cause SAS to start reading at physical record number 10000001 and
> stop at physical record number 20000000, and in your example it will.
> They might then decide that they want to get all the records in that
> range
> for which SOME_FIELD=SOME_VALUE and they write the following
> >
> >data want;
> > set tmp.test(firstobs=010000001 obs=020000000
> > where=(SOME_FIELD=SOME_VALUE));
> >run;
> >
> >They might be surprised to find that first record that SAS returns is
> the
> 10,000,001-th record on which SOME_FIELD=SOME_VALUE (if it exists) and
> continues to look for the next 10 million records where
> SOME_FIELD=SOME_VALUE.
> >
> >So this is a suggetion to mention in your presentation that firstobs
> and
> obs are not absolute pointer/record numbers, but are relative to any
> filtering done in a where clause.
> >
> >Hope this is helpful,
> >
> >Dan
> >
> >Daniel J. Nordlund
> >Washington State Department of Social and Health Services
> >Planning, Performance, and Accountability
> >Research and Data Analysis Division
> >Olympia, WA 98504-5204
>
>
> I think it is clear, that A is faster. The blocked read makes sense. I
> think, that blockwise read is not possible with POINT=. It could be, if
> a
> optimizer detects that you read consecutive. Obviously it doesn't. I
> think
> for the first record of a block, it is all the same: SAS must calculate
> the byte-adress of the record. For all the following records in a block
> with the firstobs= option, that is not necessary. For POINT= it is.
> You have 3 open's in A instead of only one in B. Dataset open's take
> long
> time. That might eat a bit of the speed of A in comparison to B.
> Did you try it also without open=DEFER? That uses more memory, but
> might
> be a bit faster(?)
>
> Gerhard
|