LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (July 2000, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 31 Jul 2000 08:54:37 -0700
Reply-To:   "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Terjeson, Mark" <TERJEMW@DSHS.WA.GOV>
Subject:   Re: index question in V8
Comments:   To: "Lund, Pete" <Peter.Lund@cfc.wa.gov>
Content-Type:   text/plain; charset="iso-8859-1"

RE: index question in V8 -- further ponderings...

Having the desired keys in sort order does indeed provide a faster route for SAS to take. Obviously, stepping the record pointer through a file is going to be more efficient than the less than half dozen tests through the leaf nodes of what-ever-method- -or-variation-of-Btree-indexing-implementation that SAS has implemented as the leaves are transversed to find the first of the desired key and then only an indirect addressing increment to all of the record pointers for the remaining occurences of that key. So there really shouldn't be too much overhead difference between the two for just a simplistic review of one key value only. Some ISAMs are written this way, and whether SAS's implementation is similar-or-not is not my question-or-comment at this time.

The real question and puzzlement lies on a little different tack.

If we narrow down the focus to exclude sequential stepping through records and just consider use of indexing. In addition, if we were to consider only balanced leaf-nodes. If we also considered the following simple scenario. Efficiency is not really the question at this time if we take all portions involved here to the most optimum. (i.e. the btree-leafnode layouts, number of lookups, number of fetches, everything, etc. all even)

The data-content-value should be irrelevant, so we will call two key values KEYVALUE1 and KEYVALUE2.

If we have a dataset with 100,000 records in it, let us say that there are five occurences of KEYVALUE1 evenly distributed throughout the file at records 1,25001,50001,75001,99991 and there are five occurences of KEYVALUE2 evenly distributed throughout the file at records 2,25002,50002,75002,99992.

To give a name to the rest-of-the-world, we'll call the remaining 999,990 keys KEYVALUE3.

Without sort key or indexing, to find the five records would require reading through all 100,000 records. IF, one of our basic isam leaf node contains the key value and the list of record numbers (for the upcoming isolated question, please also disregard what type of list or linked-list may have been implemented in the node) KEYVALUE1 - 1,25001,50001,75001,99991

With our example there are only three key values. And if all three were in the first node there would only be one node increment to get to it and up to three value tests and then the beginning of the list of record numbers should be available. Even if the three values are spread into 3 different nodes, and let's consider that regardless if they are at the same branch level or not, and disregard if there is one to three more tests to locate each one. The concept here, yes, shows that one read to the index file, and then indexing with 1 to 9 memory tests to lookup and then fetching the 5 memory record pointers and the physical reading of only 5 records is a gross example of why indexing will be faster than physically reading through all 100,000 records.

The real question is that if five records for KEYVALUE1 are evenly distributed and basically in the same positions throughout the file (and the index file) as the five records for KEYVALUE2, how would the SAS index usage choose one via index and the other not? If the logic routine is the same for both and the criteria for both is virtually the same, then why or how would it choose differently?

With the small benchmarking times in the testing shown below we know that there is variance. (i.e. the machine times (real/cpu) differ 0%-35% between passes where indexing was selected by the machine and indexing was forced by the user) But, consider the general magnitude between how the software choice to choose indexing faired vs. forced. The percentages are just a general magnitude to give the worse/better evaluation some flavor.

The SAS software decision to choose *not* to use indexing for the below originally submitted examples scored WORSE 11 times vs. BETTER 2 times for the comparisons of REAL (clock) time. The average magnitude difference between 12% and 14% is nothing to consider since the variance is already bigger than that for multiple runs. We can also assume that users won't really care which way it went if the clock time is about the same. (still, 11 to 2 makes you ponder) also note that a few percentages of 3, 7, 9 brought the average down to 12%, many of the key values tested were worse by 17,21, 24 percent. For those key values, the user would be disappointed that indexing wasn't used. So the question remains.

Even if we discount the slop in our workstations task swapping and the REAL (clock) times, the CPU times further support the question.

CPU times scored WORSE 9 times vs. BETTER 3 times with magnitudes much higher, some 43,50,55,57,60,76 percent worse. The average for worse was 41% and 22% for better.

I also realize that variance of rerunning these timing tests can actually fluctuate comparisons back and forth, but out of the 9 or 11 WORSE choices it usually swapped only 1 or 2, still leaving upwards of half the choices in a 26 value set being a significant pondering!

To avoid attachments and email reformattings and wrapping, the result information has had the spaces changed to underscores. For easier viewing, format the section with a monospace font and replace the underscores back with spaces. Some folks may have to adjust any wrapping back.

The original examples below used 300,000 records. The discussion above used 100,000 for trying to simplify the discussion points.

REAL TIMES:

LTR_NOFORCE_FORCE__GT-or-LT___INDEX-SOFTWARE-SELECTED___SOFTWARE-DECISION___ WORSE-BY___BETTER-BY___---BY ____________________________________________________________________________ ____________________________ _A___1.62___1.79___force_GT_____________not__________________better_________ ______________10%___________ _B___2.09___1.96___FORCE_LT__________selected_________________---___________ _________________________6%_ _C___1.60___1.31___FORCE_LT__________selected_________________---___________ ________________________18%_ _D___2.10___1.59___FORCE_LT__________selected_________________---___________ ________________________24%_ _E___1.87___2.18___force_GT_____________not__________________better_________ ______________17%___________ _F___2.04___2.01___FORCE_LT__________selected_________________---___________ _________________________1%_ _G___1.35___1.17___FORCE_LT__________selected_________________---___________ ________________________13%_ _H___1.14___0.87___FORCE_LT_____________not__________________worse__________ ___24%______________________ _I___0.96___1.03___force_GT__________selected________________worse__________ ____7%______________________ _J___1.04___0.92___FORCE_LT_____________not__________________worse__________ ___12%______________________ _K___1.26___1.14___FORCE_LT__________selected________________worse__________ ___10%______________________ _L___1.26___1.14___FORCE_LT_____________not__________________worse__________ ___10%______________________ _M___1.15___1.23___force_GT__________selected________________worse__________ ____7%______________________ _N___1.34___1.03___FORCE_LT__________selected_________________---___________ ________________________23%_ _O___1.15___1.12___FORCE_LT_____________not__________________worse__________ ____3%______________________ _P___1.15___1.11___FORCE_LT__________selected_________________---___________ _________________________3%_ _Q___1.07___0.93___FORCE_LT_____________not__________________worse__________ ___13%______________________ _R___0.96___0.96___FORCE_LT__________selected_________________---___________ _________________________0%_ _S___0.92___1.03___force_GT__________selected________________worse__________ ___12%______________________ _T___0.92___0.73___FORCE_LT_____________not__________________worse__________ ___21%______________________ _U___1.31___1.15___FORCE_LT__________selected_________________---___________ ________________________12%_ _V___1.29___1.07___FORCE_LT_____________not__________________worse__________ ___17%______________________ _W___1.15___1.04___FORCE_LT__________selected_________________---___________ ________________________10%_ _X___1.17___1.15___FORCE_LT__________selected_________________---___________ _________________________2%_ _Y___1.32___1.21___FORCE_LT__________selected_________________---___________ _________________________8%_ _Z___0.03___0.03___FORCE_LT__________selected_________________---___________ _________________________0%_ _____REAL_times_______________________________________worse=11,__better=2___ __134_________27_______122__ ____________________________________________________________________________ avg12%_____avg14%___________

CPU TIMES:

LTR_NOFORCE_FORCE__GT-or-LT___INDEX-SOFTWARE-SELECTED___SOFTWARE-DECISION___ WORSE-BY___BETTER-BY___---BY ____________________________________________________________________________ ____________________________ _A____0.51__0.40___FORCE_LT_____________not__________________worse__________ ___22%______________________ _B____0.42__0.37___FORCE_LT__________selected__________________---__________ ________________________12%_ _C____0.53__0.45___FORCE_LT__________selected__________________---__________ ________________________15%_ _D____0.45__0.37___FORCE_LT__________selected__________________---__________ ________________________18%_ _E____0.45__0.43___FORCE_LT_____________not__________________worse__________ ____4%______________________ _F____0.34__0.32___FORCE_LT__________selected__________________---__________ _________________________6%_ _G____0.29__0.24___FORCE_LT__________selected__________________---__________ ___17%______________________ _H____0.35__0.20___FORCE_LT_____________not__________________worse__________ ___43%______________________ _I____0.18__0.21___force_GT__________selected________________better_________ ______________17%___________ _J____0.35__0.26___FORCE_LT_____________not__________________worse__________ ___26%______________________ _K____0.29__0.26___FORCE_LT__________selected__________________---__________ ________________________10%_ _L____0.48__0.24___FORCE_LT_____________not__________________worse__________ ___50%______________________ _M____0.23__0.32___force_GT__________selected________________better_________ ______________39%___________ _N____0.26__0.18___FORCE_LT__________selected__________________---__________ ________________________31%_ _O____0.46__0.20___FORCE_LT_____________not__________________worse__________ ___57%______________________ _P____0.32__0.21___FORCE_LT__________selected__________________---__________ ________________________34%_ _Q____0.43__0.17___FORCE_LT_____________not__________________worse__________ ___60%______________________ _R____0.26__0.24___FORCE_LT__________selected__________________---__________ _________________________8%_ _S____0.17__0.12___FORCE_LT__________selected__________________---__________ ________________________29%_ _T____0.40__0.18___FORCE_LT_____________not__________________worse__________ ___55%______________________ _U____0.18__0.20___force_GT__________selected________________better_________ ______________11%___________ _V____0.38__0.09___FORCE_LT_____________not__________________worse__________ ___76%______________________ _W____0.24__0.21___FORCE_LT__________selected__________________---__________ ________________________13%_ _X____0.23__0.21___FORCE_LT__________selected__________________---__________ _________________________9%_ _Y____0.21__0.18___FORCE_LT__________selected__________________---__________ ________________________14%_ _Z____0.01__0.01___FORCE_LT__________selected__________________---__________ _________________________0%_ ______CPU_times_________________________________________worse=9,__better=3__ __410_________67_______199__ ____________________________________________________________________________ avg41%_____avg22%___________

* make sample data ; data table1(index=(ltr)); do i = 1 to 300000; x = uniform(0); m = mod(x*100,26); if m le 13 then m = int(m/2); else m = int(m*2); mm = mod(m,26); if 1 le mm le 26 then ltr = substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",mm,1); output; end; run;

* look at ltr % distribution ; proc freq data=table1; table ltr / list missing; run;

options msglevel=i;

%macro testind; %let letters = ABCDEFGHIJKLMNOPQRSTUVWXYZ; %do i = 1 %to 26; options notes; %put; %put LETTER: %substr(&letters,&i,1); data table2; set table1; where ltr eq "%substr(&letters,&i,1)"; run;

options notes;

data table3; set table1 (idxwhere=yes); where ltr eq "%substr(&letters,&i,1)"; run;

%end; %mend; %testind;

Still curious, Mark Terjeson Washington State Department of Social and Health Services Division of Research and Data Analysis (RDA) (360) 902-0741 (360) 902-0705 fax mailto:terjemw@dshs.wa.gov

> >From: "Lund, Pete" <Peter.Lund@CFC.WA.GOV> > >Reply-To: "Lund, Pete" <Peter.Lund@CFC.WA.GOV> > >To: SAS-L@LISTSERV.UGA.EDU > >Subject: index question in V8 > >Date: Thu, 27 Jul 2000 15:24:51 -0700 > > > >I've been creating some indexed datasets today and noticed that there > >didn't > >seem to be any performance improvement in resulting steps. Turning on > >MSGLEVEL=I I was able to see that there were a number of cases in which > it > >seemed logical that an index should be used and it wasn't. The following > >two data steps returned within 2 of the same number of records - one used > >the index, the other did not. (Note: this is V8 only, in 6.12 the index > is > >used both times). > > > >data test1; > > set auths200004; > > where service eq '4510'; *<---- index used - returned about 3,800 > out > >of 280,000 records; > >run; > > > >data test2; > > set auths200004; > > where service eq '4501'; *<---- index not used - also returned > about > >3,800 out of 280,000 records; > >run; > > > >Mark Terjeson and I have been playing around with this and would welcome > >any > >insight. The following little program creates a dataset with 300,000 > >observations and a simple index on one of the variables. There is then a > >macro that will create a dataset subsetting on each value of the index > >variable. A note is written to the log to show whether or not the index > >was > >used. There is a pattern to when the index is used, but seems to be > >unrelated to the number of records in the index group. > > > >IMPORTANT NOTE: in V6 the index is used every time. > > > >Here's the code - if you have time, run it and see if you can shed any > >light > >on what's happening: > > > > > > * make sample data ; > >data table1(index=(ltr)); > > do i = 1 to 300000; > > x = uniform(0); > > m = mod(x*100,26); > > if m le 13 then m = int(m/2); > > else m = int(m*2); > > mm = mod(m,26); > > if 1 le mm le 26 then ltr = > >substr("ABCDEFGHIJKLMNOPQRSTUVWXYZ",mm,1); > > output; > > end; > >run; > > > > * look at ltr % distribution ; > >proc freq data=table1; > > table ltr / list missing; > >run; > > > >options msglevel=i; > > > >%macro testind; > > %let letters = ABCDEFGHIJKLMNOPQRSTUVWXYZ; > > %do i = 1 %to 26; > > options nonotes; > > %put; > > %put LETTER: %substr(&letters,&i,1); > > data table2; > > set table1; > > where ltr eq "%substr(&letters,&i,1)"; > > run; > > > > options notes; > > %end; > >%mend; > >%testind; > > > >---------------------------------------------------------------------- > >Pete Lund > >WA State Caseload Forecast Council > >515 15th Ave SE > >Olympia, WA 98504-0962 > >(360) 902-0086 voice > >(360) 902-0084 fax > >(360) 971-0962 pager > >peter.lund@cfc.wa.gov > >---------------------------------------------------------------------- > > ________________________________________________________________________ > Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com


Back to: Top of message | Previous page | Main SAS-L page