| Date: | Mon, 31 Jul 2000 08:54:37 -0700 |
| Reply-To: | "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV> |
| Subject: | Re: index question in V8 |
|
| Content-Type: | text/plain; charset="iso-8859-1" |
|---|
RE: index question in V8 -- further ponderings...
Having the desired keys in sort order does indeed provide a
faster route for SAS to take. Obviously, stepping the record
pointer through a file is going to be more efficient than the less
than half dozen tests through the leaf nodes of what-ever-method-
-or-variation-of-Btree-indexing-implementation that SAS has
implemented as the leaves are transversed to find the first of the
desired key and then only an indirect addressing increment to all
of the record pointers for the remaining occurences of that key.
So there really shouldn't be too much overhead difference between
the two for just a simplistic review of one key value only. Some
ISAMs are written this way, and whether SAS's implementation
is similar-or-not is not my question-or-comment at this time.
The real question and puzzlement lies on a little different tack.
If we narrow down the focus to exclude sequential stepping through
records and just consider use of indexing. In addition, if we were
to consider only balanced leaf-nodes. If we also considered the
following simple scenario. Efficiency is not really the question
at this time if we take all portions involved here to the most
optimum. (i.e. the btree-leafnode layouts, number of lookups,
number of fetches, everything, etc. all even)
The data-content-value should be irrelevant, so we will call two
key values KEYVALUE1 and KEYVALUE2.
If we have a dataset with 100,000 records in it, let us say that
there are five occurences of KEYVALUE1 evenly distributed
throughout the file at records 1,25001,50001,75001,99991 and
there are five occurences of KEYVALUE2 evenly distributed
throughout the file at records 2,25002,50002,75002,99992.
To give a name to the rest-of-the-world, we'll call the remaining
999,990 keys KEYVALUE3.
Without sort key or indexing, to find the five records would
require reading through all 100,000 records. IF, one of our basic
isam leaf node contains the key value and the list of record numbers
(for the upcoming isolated question, please also disregard what type
of list or linked-list may have been implemented in the node)
KEYVALUE1 - 1,25001,50001,75001,99991
With our example there are only three key values. And if all three
were in the first node there would only be one node increment to get
to it and up to three value tests and then the beginning of the list
of record numbers should be available. Even if the three values are
spread into 3 different nodes, and let's consider that regardless if
they are at the same branch level or not, and disregard if there is
one to three more tests to locate each one. The concept here, yes,
shows that one read to the index file, and then indexing with 1 to 9
memory tests to lookup and then fetching the 5 memory record pointers
and the physical reading of only 5 records is a gross example of why
indexing will be faster than physically reading through all 100,000
records.
The real question is that if five records for KEYVALUE1 are evenly
distributed and basically in the same positions throughout the file
(and the index file) as the five records for KEYVALUE2, how would
the SAS index usage choose one via index and the other not? If
the logic routine is the same for both and the criteria for both is
virtually the same, then why or how would it choose differently?
With the small benchmarking times in the testing shown below we know
that there is variance. (i.e. the machine times (real/cpu) differ
0%-35% between passes where indexing was selected by the machine and
indexing was forced by the user) But, consider the general magnitude
between how the software choice to choose indexing faired vs. forced.
The percentages are just a general magnitude to give the worse/better
evaluation some flavor.
The SAS software decision to choose *not* to use indexing for the below
originally submitted examples scored WORSE 11 times vs. BETTER 2 times
for the comparisons of REAL (clock) time. The average magnitude
difference between 12% and 14% is nothing to consider since the variance
is already bigger than that for multiple runs. We can also assume that
users won't really care which way it went if the clock time is about the
same. (still, 11 to 2 makes you ponder) also note that a few percentages
of 3, 7, 9 brought the average down to 12%, many of the key values
tested were worse by 17,21, 24 percent. For those key values, the user
would be disappointed that indexing wasn't used. So the question remains.
Even if we discount the slop in our workstations task swapping and the
REAL (clock) times, the CPU times further support the question.
CPU times scored WORSE 9 times vs. BETTER 3 times with magnitudes much
higher, some 43,50,55,57,60,76 percent worse. The average for worse was
41% and 22% for better.
I also realize that variance of rerunning these timing tests can actually
fluctuate comparisons back and forth, but out of the 9 or 11 WORSE choices
it usually swapped only 1 or 2, still leaving upwards of half the choices
in a 26 value set being a significant pondering!
To avoid attachments and email reformattings and wrapping, the result
information has had the spaces changed to underscores. For easier viewing,
format the section with a monospace font and replace the underscores back
with spaces. Some folks may have to adjust any wrapping back.
The original examples below used 300,000 records. The discussion above
used 100,000 for trying to simplify the discussion points.
REAL TIMES:
LTR_NOFORCE_FORCE__GT-or-LT___INDEX-SOFTWARE-SELECTED___SOFTWARE-DECISION___
WORSE-BY___BETTER-BY___---BY
____________________________________________________________________________
____________________________
_A___1.62___1.79___force_GT_____________not__________________better_________
______________10%___________
_B___2.09___1.96___FORCE_LT__________selected_________________---___________
_________________________6%_
_C___1.60___1.31___FORCE_LT__________selected_________________---___________
________________________18%_
_D___2.10___1.59___FORCE_LT__________selected_________________---___________
________________________24%_
_E___1.87___2.18___force_GT_____________not__________________better_________
______________17%___________
_F___2.04___2.01___FORCE_LT__________selected_________________---___________
_________________________1%_
_G___1.35___1.17___FORCE_LT__________selected_________________---___________
________________________13%_
_H___1.14___0.87___FORCE_LT_____________not__________________worse__________
___24%______________________
_I___0.96___1.03___force_GT__________selected________________worse__________
____7%______________________
_J___1.04___0.92___FORCE_LT_____________not__________________worse__________
___12%______________________
_K___1.26___1.14___FORCE_LT__________selected________________worse__________
___10%______________________
_L___1.26___1.14___FORCE_LT_____________not__________________worse__________
___10%______________________
_M___1.15___1.23___force_GT__________selected________________worse__________
____7%______________________
_N___1.34___1.03___FORCE_LT__________selected_________________---___________
________________________23%_
_O___1.15___1.12___FORCE_LT_____________not__________________worse__________
____3%______________________
_P___1.15___1.11___FORCE_LT__________selected_________________---___________
_________________________3%_
_Q___1.07___0.93___FORCE_LT_____________not__________________worse__________
___13%______________________
_R___0.96___0.96___FORCE_LT__________selected_________________---___________
_________________________0%_
_S___0.92___1.03___force_GT__________selected________________worse__________
___12%______________________
_T___0.92___0.73___FORCE_LT_____________not__________________worse__________
___21%______________________
_U___1.31___1.15___FORCE_LT__________selected_________________---___________
________________________12%_
_V___1.29___1.07___FORCE_LT_____________not__________________worse__________
___17%______________________
_W___1.15___1.04___FORCE_LT__________selected_________________---___________
________________________10%_
_X___1.17___1.15___FORCE_LT__________selected_________________---___________
_________________________2%_
_Y___1.32___1.21___FORCE_LT__________selected_________________---___________
_________________________8%_
_Z___0.03___0.03___FORCE_LT__________selected_________________---___________
_________________________0%_
_____REAL_times_______________________________________worse=11,__better=2___
__134_________27_______122__
____________________________________________________________________________
avg12%_____avg14%___________
CPU TIMES:
LTR_NOFORCE_FORCE__GT-or-LT___INDEX-SOFTWARE-SELECTED___SOFTWARE-DECISION___
WORSE-BY___BETTER-BY___---BY
____________________________________________________________________________
____________________________
_A____0.51__0.40___FORCE_LT_____________not__________________worse__________
___22%______________________
_B____0.42__0.37___FORCE_LT__________selected__________________---__________
________________________12%_
_C____0.53__0.45___FORCE_LT__________selected__________________---__________
________________________15%_
_D____0.45__0.37___FORCE_LT__________selected__________________---__________
________________________18%_
_E____0.45__0.43___FORCE_LT_____________not__________________worse__________
____4%______________________
_F____0.34__0.32___FORCE_LT__________selected__________________---__________
_________________________6%_
_G____0.29__0.24___FORCE_LT__________selected__________________---__________
___17%______________________
_H____0.35__0.20___FORCE_LT_____________not__________________worse__________
___43%______________________
_I____0.18__0.21___force_GT__________selected________________better_________
______________17%___________
_J____0.35__0.26___FORCE_LT_____________not__________________worse__________
___26%______________________
_K____0.29__0.26___FORCE_LT__________selected__________________---__________
________________________10%_
_L____0.48__0.24___FORCE_LT_____________not__________________worse__________
___50%______________________
_M____0.23__0.32___force_GT__________selected________________better_________
______________39%___________
_N____0.26__0.18___FORCE_LT__________selected__________________---__________
________________________31%_
_O____0.46__0.20___FORCE_LT_____________not__________________worse__________
___57%______________________
_P____0.32__0.21___FORCE_LT__________selected__________________---__________
________________________34%_
_Q____0.43__0.17___FORCE_LT_____________not__________________worse__________
___60%______________________
_R____0.26__0.24___FORCE_LT__________selected__________________---__________
_________________________8%_
_S____0.17__0.12___FORCE_LT__________selected__________________---__________
________________________29%_
_T____0.40__0.18___FORCE_LT_____________not__________________worse__________
___55%______________________
_U____0.18__0.20___force_GT__________selected________________better_________
______________11%___________
_V____0.38__0.09___FORCE_LT_____________not__________________worse__________
___76%______________________
_W____0.24__0.21___FORCE_LT__________selected__________________---__________
________________________13%_
_X____0.23__0.21___FORCE_LT__________selected__________________---__________
_________________________9%_
_Y____0.21__0.18___FORCE_LT__________selected__________________---__________
________________________14%_
_Z____0.01__0.01___FORCE_LT__________selected__________________---__________
_________________________0%_
______CPU_times_________________________________________worse=9,__better=3__
__410_________67_______199__
____________________________________________________________________________
avg41%_____avg22%___________
* make sample data ;
data table1(index=(ltr));
do i = 1 to 300000;
x = uniform(0);
m = mod(x*100,26);
if m le 13 then m = int(m/2);
else m = int(m*2);
mm = mod(m,26);
if 1 le mm le 26 then ltr =
substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",mm,1);
output;
end;
run;
* look at ltr % distribution ;
proc freq data=table1;
table ltr / list missing;
run;
options msglevel=i;
%macro testind;
%let letters = ABCDEFGHIJKLMNOPQRSTUVWXYZ;
%do i = 1 %to 26;
options notes;
%put;
%put LETTER: %substr(&letters,&i,1);
data table2;
set table1;
where ltr eq "%substr(&letters,&i,1)";
run;
options notes;
data table3;
set table1 (idxwhere=yes);
where ltr eq "%substr(&letters,&i,1)";
run;
%end;
%mend;
%testind;
Still curious,
Mark Terjeson
Washington State Department of Social and Health Services
Division of Research and Data Analysis (RDA)
(360) 902-0741
(360) 902-0705 fax
mailto:terjemw@dshs.wa.gov
> >From: "Lund, Pete" <Peter.Lund@CFC.WA.GOV>
> >Reply-To: "Lund, Pete" <Peter.Lund@CFC.WA.GOV>
> >To: SAS-L@LISTSERV.UGA.EDU
> >Subject: index question in V8
> >Date: Thu, 27 Jul 2000 15:24:51 -0700
> >
> >I've been creating some indexed datasets today and noticed that there
> >didn't
> >seem to be any performance improvement in resulting steps. Turning on
> >MSGLEVEL=I I was able to see that there were a number of cases in which
> it
> >seemed logical that an index should be used and it wasn't. The following
> >two data steps returned within 2 of the same number of records - one used
> >the index, the other did not. (Note: this is V8 only, in 6.12 the index
> is
> >used both times).
> >
> >data test1;
> > set auths200004;
> > where service eq '4510'; *<---- index used - returned about 3,800
> out
> >of 280,000 records;
> >run;
> >
> >data test2;
> > set auths200004;
> > where service eq '4501'; *<---- index not used - also returned
> about
> >3,800 out of 280,000 records;
> >run;
> >
> >Mark Terjeson and I have been playing around with this and would welcome
> >any
> >insight. The following little program creates a dataset with 300,000
> >observations and a simple index on one of the variables. There is then a
> >macro that will create a dataset subsetting on each value of the index
> >variable. A note is written to the log to show whether or not the index
> >was
> >used. There is a pattern to when the index is used, but seems to be
> >unrelated to the number of records in the index group.
> >
> >IMPORTANT NOTE: in V6 the index is used every time.
> >
> >Here's the code - if you have time, run it and see if you can shed any
> >light
> >on what's happening:
> >
> >
> > * make sample data ;
> >data table1(index=(ltr));
> > do i = 1 to 300000;
> > x = uniform(0);
> > m = mod(x*100,26);
> > if m le 13 then m = int(m/2);
> > else m = int(m*2);
> > mm = mod(m,26);
> > if 1 le mm le 26 then ltr =
> >substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",mm,1);
> > output;
> > end;
> >run;
> >
> > * look at ltr % distribution ;
> >proc freq data=table1;
> > table ltr / list missing;
> >run;
> >
> >options msglevel=i;
> >
> >%macro testind;
> > %let letters = ABCDEFGHIJKLMNOPQRSTUVWXYZ;
> > %do i = 1 %to 26;
> > options nonotes;
> > %put;
> > %put LETTER: %substr(&letters,&i,1);
> > data table2;
> > set table1;
> > where ltr eq "%substr(&letters,&i,1)";
> > run;
> >
> > options notes;
> > %end;
> >%mend;
> >%testind;
> >
> >----------------------------------------------------------------------
> >Pete Lund
> >WA State Caseload Forecast Council
> >515 15th Ave SE
> >Olympia, WA 98504-0962
> >(360) 902-0086 voice
> >(360) 902-0084 fax
> >(360) 971-0962 pager
> >peter.lund@cfc.wa.gov
> >----------------------------------------------------------------------
>
> ________________________________________________________________________
> Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
|