LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2000, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 8 Dec 2000 20:15:16 -0000
Reply-To:     sashole@bellsouth.net
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject:      Re: Afternoon Performance
Comments: To: HERMANS1@WESTAT.COM
Content-Type: text/plain; format=flowed

Sig,

It reminds me of a similar problem someone on the list has had in the past circa 1999. I think the thread -- where you actually led the effort of collecting performance statistics on subquery vs join vs DATA step and analyzing them -- had something like 'Elegant SQL...' in its title.

This time, Rob Workman offered a key-indexing solution based on the same idea I used then. I do not think that finding a solution faster than the one offered by Rob is possible. Back in 1999 you came up with a very lucid explanation why: Because the key type matches the method perfectly. SAS date is by its very nature a *limited range integer*, and as everyone knows, the fastest algorithm for looking such things up is the key-indexed search. So, whenever a problem of searching/matching/joining, etc. by SAS date or time (but not datetime!) arises, key-indexing deserves the first and close look.

In this particular case, using the key-indexed search is all the more appropriate that the relative order of records must be maintained, and thus it appears natural to scan the records serially qualifying each record for output based on the presence of its key in the key-indexed table prepared beforehand during a preliminary pass.

To test, I prepared a test file with mere 1 million input records:

data a; do _n_=1 to 1e+6; date = ceil(rannor(1)*1e+4); time = ceil(ranuni(1)*86400); output; end; run;

Then I tested the following in batch under OS/390:

*key-indexing*; data noon; array x (-100000:+100000) _temporary_; do until (e1); set a (where=(time < 43200)) end=e1; x (date) = 1; end; do until (e2); set a end=e2; if not x (date) then output; end; run; NOTE: The data set WORK.NOON has 7860 observations and 2 variables. NOTE: The DATA statement used 2.67 CPU seconds and 9276K. *sql*; data testvw/view=testvw ; set a; rowN=_N_; run; proc sql; create table sql as select date, rown from testvw group by date having min(time) ge 43200 order by rown ; quit; NOTE: The View WORK.TESTVW.VIEW used 1.47 CPU seconds and 8086K. NOTE: The PROCEDURE SQL used 9.08 CPU seconds and 7412K.

So, the run-time ratio is about 1:4, consistent with our 1999 findings. Of course, for SQL, it would be quite difficult, if not impossible, to optimize to such an extent, for the optimizer has no idea about the actual key range until after the entire file has been scanned. On the other hand, this would be feasible if SAS SQL had options for specifying the key range and tentative input size (if the input is not a SAS data file), thus introducing a healthy dose of heuristics. Having this information available, the optimizer could perhaps be programmed to devise alternative plans in addition to ones currently availalble. Such plans could more efficiently address situations similar to above by resorting to key-indexing, bitmapping or hashing, depending on the key range and input size. At the moment, the only heuristical option I know of is BUFFERSIZE=, through which one can tell SQL implicitly that there is enough memory to hash the keys from one of the files being joined.

Kind regards, =================== Paul M. Dorfman Jacksonville, Fl ===================

>From: Sigurd Hermansen <HERMANS1@WESTAT.COM> >Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM> >To: SAS-L@LISTSERV.UGA.EDU >Subject: Re: Afternoon Performance >Date: Fri, 8 Dec 2000 13:36:39 -0500 > >Depending on the version of SAS you are using, SAS SQL would probably >improve a query to something close to the optimum. I suspect that you do >not have to do anything special if you are using SAS V8.1. > >The SQL solution might actually involve sorting, but certainly not >resorting >of the entire dataset. In SQL you can declare the result you need and the >compiler will select a query plan from a set of logical alternatives. In >this case, you declare that you want to SELECT a date value, a time value, >and a row counter (to preserve the original order of the date-time values >in >the original data table); FROM a data source; GROUP'ed BY date; HAVING >(groups limited to) min(time) ge (or gt) input('12:00:00',time8.) ORDER'ed >BY rowN. I have created a datastep view that supplies the row counter. > >I have copied a test program and included formats for the output. See >below: >data test; > input DATE TIME; >cards; > -3652 32809 > 12998 32795 > 10223 39481 > -3652 52551 > -877 46762 > 10223 46764 > 1898 68817 > 12998 58839 > 4673 46200 > 7448 71080 > -3652 60625 > 12998 80787 > 4673 63297 > 4673 77658 > 10223 56646 > -877 59547 > 7448 82521 > 12998 46852 > -877 33790 > -3652 81979 > -877 68913 >; >run; >data testvw/view=testvw ; > set test; > rowN=_N_; >run; >proc sql; >SELECT date format=mmddyy10. ,time format=time8. ,rowN >FROM testvw >GROUP BY date >HAVING min(time) ge input('12:00:00',time8.) >ORDER BY rowN >; >quit; > >I believe that the SQL solution handles the requirements that you have >about >as efficiently as possible. The SAS SQL "optimizer" will find a good >solution for your environment. You still have to solve the problem of >whether 12:00:00 belongs in the time interval before noon or after noon! >Check with Achilles and the Tortoise for an early discussion of the >dilemma. >Sig > >-----Original Message----- >From: ludwig boltzmann [mailto:lboltzmann@MAIL.NU] >Sent: Friday, December 08, 2000 12:11 PM >To: SAS-L@LISTSERV.UGA.EDU >Subject: Afternoon Performance > > >Hello Sas-l: > >I'm in need of an expert 'performance' advice. I have a _huge_ (of the >order >of 200-300m) SAS data set comprising certain 'events'. The events are >defined by DATE and TIME, both in the forms of SAS date and time. For a >sample, it might look like this: > > DATE TIME > -3652 32809 > 12998 32795 > 10223 39481 > -3652 52551 > -877 46762 > 10223 46764 > 1898 68817 > 12998 58839 > 4673 46200 > 7448 71080 > -3652 60625 > 12998 80787 > 4673 63297 > 4673 77658 > 10223 56646 > -877 59547 > 7448 82521 > 12998 46852 > -877 33790 > -3652 81979 > -877 68913 > >Apparently there are more than 1 event per day. I need to select only those >days when _all_ events have occurred after noon, i.e. 12:00:00. In other >words, if the initial data set was sorted by DATE TIME, formatted, and >printed it would look like > > DATE TIME >1950-01-01 9:06:49 > 14:35:51 > 16:50:25 > 22:46:19 >1957-08-07 9:23:10 > 12:59:22 > 16:32:27 > 19:08:33 >1965-03-13 19:06:57 >1972-10-17 12:50:00 > 17:34:57 > 21:34:18 >1980-05-23 19:44:40 > 22:55:21 >1987-12-28 10:58:01 > 12:59:24 > 15:44:06 >1995-08-03 9:06:35 > 13:00:52 > 16:20:39 > 22:26:27 > >Out of 7 days above only the following groups would qualify: > > DATE TIME >1980-05-23 19:44:40 > 22:55:21 >1972-10-17 12:50:00 > 17:34:57 > 21:34:18 >1965-03-13 19:06:57 > >If the input was sorted like that a solution would be more or less obvious >(and I would not be asking an advice). The problem is, it is _not_ sorted >by >DATE TIME (instead, it is ordered by some alternate key not related to this >problem), and because of the size of the data I want to avoid sorting >altogether. Moreover, the resuling subset needs to remain in the original >relative order. So with the sample like up above the actual output I need >would be: > >DATE TIME >1898 68817 >4673 46200 >7448 71080 >4673 63297 >4673 77658 >7448 82521 > >It doesn't really matter if it is SQL or anything else - the only goal here >is high performance. Thanks in advance for any ideas. >- Ludwig > > > >_____________________________________________________________ >Get Your Own .NU Web Address Now! ---> http://WWW.NUNAMES.NU

_____________________________________________________________________________________ Get more from the Web. FREE MSN Explorer download : http://explorer.msn.com


Back to: Top of message | Previous page | Main SAS-L page