LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2010, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Sat, 6 Nov 2010 21:45:42 -0700
Reply-To:   oloolo <dynamicpanel@YAHOO.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   oloolo <dynamicpanel@YAHOO.COM>
Subject:   Re: Fwd: http://www.datamineit.com/DMI_publications.htm -- Much Faster Bootstraps Using SAS
Comments:   To: "J.D. Opdyke" <jdopdyke@gmail.com>, tobydunn@hotmail.com
Comments:   cc: mkeintz@wharton.upenn.edu
In-Reply-To:   <AANLkTimtdx2FfS7c5d=2bGwdzyMC2zvpGyneLjzXyDc6@mail.gmail.com>
Content-Type:   text/plain; charset=utf-8

To Opdyke Generally I don't spend my time on papers from unknown journals, but since the fight is on, I took 30min out for this. I have examined your SAS code and realized it is not a code to generate bootstrap sample as I thought it would be. Let's make it straight on what your analysis is embeded in your SAS code. You want to obtain bootstrapped simple summary statistics out of original data set and you customly coded a SAS program using sequential sampling with replace algorithm (A4.8). I didn't spend time to read your proof and assume it is correct. There are always opportunities particular custom-write code will outperform SAS's more general purpose procedure in terms of time and (or) memory consumption by exploit this difference. But you'd better title it approprietly as, say, Faster bootstrap selected descriptive statistics with one pass of data. Oh, wait, you didn't pass the data once, you pass the data several times, and at least 1+(num_bsmps) in order to generate bootstrapped descriptive statistics: first load the data into a temporary array, then go over this array to generate sample with replacement.Therefore one alternative measurement that is independent of hardware is how many times to pass over the data. I am familar with sequential algorithms, which is a hot topic given the explode of DataBase, but definitly much broader than your sequential sampling with replacement. I actually have implemented several, including sequential sampling. While I admit that since PROC SURVEYSELECT will generate an output dataset for further various bootstrap task, it is much slower in real time, you should compare apple to apple by comparing how much real time both algorithms use in generating a bootstrap sample set. I made some change to your OPDY code to suffice this. The modified code and SAS log on my out-dated I/O restrictive workstation is attached to the end of my writting and I welcome you and all other recipts of this email to test it on your hardware environment. Of course, any comments are welcome. So your original chanllenge is now: Can you write faster SAS code to generate bootstrap sample than PROC SURVEYSELECT while maintain sound statistical property for the sample? At last, while you throw us a chanllenge, let me give you one back for good will. Actually I will give you several for you to choose: 1. I want to bootstrap the standard error estimates of parameters from a simple OLS regression. 2. I want to obtain a bootstrapped estimiate of 95% CI on my AUC estimate of a binary classifier. 3. I want to obtain a bootstrapped estimate of spearman rank correlation for two variables in the data. BTW, SAS uses Floyd's method by default for SRS, but didn't explicit which one for URS. You may want to confirm this with SAS. Since you mentioned that you've already contacted on senior SAS official, I believe you may obtain this piece of info easily. ****@@@@@ APPENDIX @@@@ ****; /* My test result. The customized code looks faster superficially (28sec vs 37sec real time), but the statistical property is not as good as SAS output. If you spot some error in the X_boot macro below, pls kindly le tme know. Besides, the sampling part can be further improved, its current version is not qualified as a real sequential one since it reads in all records in one strata and go over each records &num_bsmps times. I know how to improve it, but I will wait until you solve the above three chanllenges. */ ************; 1787 1788 dm log 'clear'; 1789 %let t0=%sysfunc(datetime(), datetime18.); %put &t0; 07NOV10:00:28:02 1790 options nonotes nomprint nomlogic; 1791 %X_Boot(outdsn=Boot_X_samples, 1792 bsmp_size=100000, 1793 num_bsmps=50, 1794 indata=MFBUS.price_data_6strata_100000, 1795 byvars=geography segment, 1796 bootvar=price 1797 ); 1798 %let t1=%sysfunc(datetime(), datetime18.); %put &t1; 07NOV10:00:28:30 1799 1800 option notes; 1801 %let indata=MFBUS.price_data_6strata_100000; 1802 %let bsmp_size=100000; 1803 %let num_bsmps=50; 1804 %let byvars=geography segment; 1805 proc surveyselect data=&indata. method=urs sampsize=&bsmp_size. rep=&num_bsmps. 1806 seed=-1 outhits 1807 out=Boot_PSS_Samps(drop=expectedhits samplingweight NumberHits) 1807! noprint; 1808 strata &byvars.; 1809 run; NOTE: The data set WORK.BOOT_PSS_SAMPS has 30000000 observations and 4 variables. NOTE: PROCEDURE SURVEYSELECT used (Total process time): real time 36.26 seconds cpu time 27.96 seconds *------------ examine statistics ----------------------* Using Bootstrap sample from X_Boot macro 5 21:56 Monday, July 6, 2009 The MEANS Procedure N geography segment Obs Variable Mean Std Dev ------------------------------------------------------------------------ geog1 segment1 50 mean 249512.99 575.8174202 std 220214.20 529.3069682 segment2 50 mean -243.5332090 1865.46 std 575937.15 1784.06 geog2 segment1 50 mean 248809.31 556.8638277 std 219808.91 503.6015858 segment2 50 mean -1021.98 1640.79 std 577149.34 2114.87 geog3 segment1 50 mean 248585.33 568.2178631 std 219588.75 477.5457668 segment2 50 mean -735.8167639 1343.60 std 579573.54 1629.92 ------------------------------------------------------------------------ ------------------------------------------------------------------------ Using Bootstrap sample from SURVEYSELECT 4 21:56 Monday, July 6, 2009 The MEANS Procedure N geography segment Obs Variable Mean Std Dev ------------------------------------------------------------------------ geog1 segment1 50 mean 249558.27 608.1574937 std 220380.89 483.7052949 segment2 50 mean -28.2200958 1925.06 std 576301.67 2015.30 geog2 segment1 50 mean 249069.34 757.4270800 std 219917.88 591.7976633 segment2 50 mean -447.0822998 1644.27 std 576295.29 2050.00 geog3 segment1 50 mean 248557.63 689.3976551 std 219597.41 542.3234011 segment2 50 mean -1058.64 1831.72 std 579522.11 1841.37 ------------------------------------------------------------------------ **********************************************************************; %macro X_Boot(outdsn=data1, bsmp_size=, num_bsmps=, indata=, byvars=, bootvar=); *** the only assumption made within this macro is that the byvars are all character variables; *** obtain last byvar, count byvars, and assign each byvar into macro variables for easy access/processing; %let last_byvar = %scan(&byvars.,-1); %let num_byvars = %sysfunc(countw(&byvars.)); %do i=1 %to &num_byvars.; %let byvar&i. = %scan(&byvars.,&i.); %end; *** macro obtains number of observations in a dataset; %macro nobs(dset); %if %sysfunc(exist(&dset)) %then %do; %let dsid=%sysfunc(open(&dset)); %let nobs=%sysfunc(attrn(&dsid,nobs)); %let dsid=%sysfunc(close(&dsid)); %end; %else %let nobs=0; &nobs %mend nobs; *** initialize macro variables used later; %let bmean =; %let bstd =; %let b975 =; %let b025 =; *** obtain counts and cumulated counts for each strata; proc summary data=&indata. nway; class &byvars.; var &bootvar.; output out=byvar_nobs(keep=_FREQ_ &byvars.) n=junk; run; %let n_byvals = %nobs(byvar_nobs); data cum_temp(keep=_FREQ_ cum_prev_freq); set byvar_nobs(keep=_FREQ_); retain cum_prev_freq 0; prev_freq = lag(_FREQ_); if _n_=1 then prev_freq = 0; cum_prev_freq = sum(cum_prev_freq, prev_freq); run; *** put counts, cumulated counts, and byvar values into macro strings; proc sql noprint; select cum_prev_freq into :cum_prev_freqs separated by ' ' from cum_temp; quit; proc sql noprint; select _freq_ into :freqs separated by ' ' from cum_temp; quit; %do i=1 %to &num_byvars.; proc sql noprint; select &&byvar&i. into :byvals&i. separated by ' ' from byvar_nobs; quit; %end; *** get size of largest stratum; proc summary data=byvar_nobs(keep=_FREQ_) nway; var _FREQ_; output out=byvar_nobs(keep=max_freq) max=max_freq; run; data _null_; set byvar_nobs; call symputx('max_freq',max_freq); run; *** save results of each stratum in cumulated macro variables instead of outputting to a dataset on the data step to lessen intermediate memory requirements ***; /*+------------- this part can be further improved -----------------+*/ data &outdsn; set &indata.(keep=&byvars. &bootvar.); by &byvars.; array bmeans{&num_bsmps.} bm1-bm&num_bsmps.; array temp{&max_freq.} _TEMPORARY_; retain byval_counter 0 cum_prev_freq 0; temp[_n_-cum_prev_freq]=&bootvar.; if last.&last_byvar. then do; byval_counter+1; freq = 1* scan("&freqs.", byval_counter,' '); num_bsmps = &num_bsmps.*1; bsmp_size = &bsmp_size.*1; do replicate=1 to num_bsmps; do n=1 to bsmp_size; &bootvar = temp[floor(ranuni(-1)*freq) + 1] ; output; keep replicate &byvars &bootvar; end; end; cum_prev_freq = 1*scan("&cum_prev_freqs.",byval_counter+1,' '); end; run; %mend; ________________________________ From: J.D. Opdyke <jdopdyke@gmail.com> To: tobydunn@hotmail.com; dynamicpanel@yahoo.com; sas-l@listserv.uga.edu Cc: mkeintz@wharton.upenn.edu Sent: Sat, November 6, 2010 5:37:58 PM Subject:Fwd: http://www.datamineit.com/DMI_publications.htm -- Much Faster Bootstraps Using SAS Toby, Can you write faster bootstrap code or not? That is what this thread is about, and that is what my paper is about. To date, you have not done that. If you’re going to criticize, please keep your eye on the ball. When you have time to attempt to write bootstrap code, I'm sure all of us would love to see it. Mine is not only in the paper, but can be downloaded in a text file at http://www.datamineit.com/DMI_publications.htm -- I could not make it any easier to run -- just copy and past the code into SAS program editor, and click the little running man. Writing trivial pieces of code to make up data, and trivial “nobs” code that make no difference in terms of real runtime (or by any criteria, actually – I’ll prove that below for those who’d like to see it) smacks of a little desperation Toby. Its good to be a little determined to try to be better and to try to write better code, but until you have, like I said, silence is the better part of valor. Spitting out silly pretend data in loops is what it is: silly pretend data just useful to make the point using serious code, like my bootstrap code. I just included the pretend data code so people could test the bootstrap code with EXACTLTY the datasets I used. Really, a little sad that you’ve fixated on code making up pretend data. I do not think you have perspective -- you're missing the boat, and that is fast bootstrap code. My challenge remains, and it is genuine: let us, the SAS Users community, continue to improve methodologies and approaches and write better and better code, in this case, for implementing bootstraps (not making up pretend data). That should be the goal, not ad hominem attacks on code as “scking” before you’ve even (admittedly) read the paper or even (admittedly) tested and implemented the code across a range of scenarios (everyone makes mistakes, but a class act would have apologized – you have yet to). That is unprofessional and irresponsible and purile. That aside, I’m sure sometime, perhaps soon, my bootstrap code will be bested, at least on certain platforms under certain conditions (if for no other reason than I've put the approach out there). To encourage that, in fact, as a responsible SAS User, I had already sent my paper and code directly to a very senior SAS officer exactly for that purpose, thus proving that my paper was a genuine, and not ego-driven, challenge. Saying code "scks" when you haven't even tested it, or read its documentation thoroughly (by you own admission), is an ego-related mistake not worthy of the Listserv. I have no problem standing by those statements (can you stand by yours?). That said, when considering challengers to my paper and code, it is important to remember that the great utility of my algorithm lies in the fact that all you need is Base SAS to run it: no other modules, and no expensive grid platforms. The size of the audience becomes a material criterion when assessing the utility of code, and if a method requires a $10million specialized SAS product on a specialized and expensive platform, it is not going to be relevant to most of the millions of SAS Users out there, even if its faster. For those of you who want to read on as I continue to school Toby (said in the NICEST possible way! Some of my good friends are from his neck of the woods), please read on: otherwise, I think this thread has reached the end of its utility. =============== Toby, The little “nobs” macro I just used is just something I’ve used for years out of convenience. It is NOT the fastest way to get the “nobs” of a dataset into a macro variable – I know, because I’ve tested it. The fastest code that I’VE found (there may be faster ways – if so, I haven’t seen them) is the below: %let dsid = %sysfunc(open(temp)); %let testnobs = %sysfunc(ifc(&dsid. ,%nrstr(%sysfunc(attrn(&dsid.,nobs)); %let rc = %sysfunc(close(&dsid.));) ,%nrstr(0) ) ); %put testnobs = &testnobs.; That said, the only way this difference becomes material to a SAS user by ANY efficiency criterion is if they are obtaining “nobs” thousands of times in some type of looping macro (which is how I tested the different methods): otherwise, a real runtime difference of about a hundredth of a second, when you’re only getting “nobs” a few times, is trivial, and hardly worthy of taking up space on this Listserv. Those without perspective cannot tell the difference between trivial coding curiosities, and code that drops the runtimes of SAS Procs from over a month to under a day; for clients paying serious $, the difference between the two is obvious. So again, in the nicest way possible, I must say: put your $ where your verbosity is -- either write faster BOOTSTRAP code, or lets let this thread die Toby. Good luck. ============================================ J.D. Opdyke, Managing Director-Quantitative Strategies DataMineIt 17 McKinley Road Marblehead, MA 01945 phone: 617-943-6463, 781-639-6463 fax: 781-639-6463 email: JDOpdyke@DataMineIt.com web: www.DataMineIt.com ============================================ ######################################################### Statement of Confidentiality The information contained in this electronic message, and any attachments to this message, are intended for the exclusive use of the addressee(s), may contain confidential or privileged information, and are protected by law. If you are not the intended recipient, please notify J.D. Opdyke as soon as possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of this message and any attachments. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. ######################################################### ---------- Forwarded message ---------- From: toby dunn <tobydunn@hotmail.com> Date: Sat, Nov 6, 2010 at 4:33 PM Subject: RE: http://www.datamineit.com/DMI_publications.htm -- Much Faster Bootstraps Using SAS To: jdopdyke@gmail.com, sas-l@listserv.uga.edu, dynamicpanel@yahoo.com Cc: mkeintz@wharton.upenn.edu John, Nothing in either paper programmatically is overly impressive nor anything my intermediate programmers couldn't handle. Personally, I'd send it back to my guys at work to have it reworked as both macros are inefficient and could be made better. Not to say the concept is a bad, just the implementation is horrendous IMO. As I said before messing around with this is low, and by that I mean really low, on my priority list. I have five presentations to prep for Monday and Tuesday and a book to start writing on. My plate is full up on SAS papers and books, a long list of ideas of papers to write, so needless to say I'm not too overly worried about adding more. I made a few minor mods to the MakeData Macro since I really didnt run it before posting it, my own fault, it actualy runs faster than your original one and you can specify the output data name without modifying the code internally to gain the greatest degree of flexibility in thyat area. Not to mention its just easier to read and comprehend. %Macro MakeData( Strata= , Segments= , Geography= , DSO= ) ; %Local SList GList K J ; %Do K = 1 %To &Segments ; %Let SList = &SList "SEGMENT&K" ; %End ; %Do J = 1 %To &Geography ; %Let GList = &GList "GEOG&J" ; %End ; Data &DSO ( Keep = Geography Segment Price ) ; Length Geography $ 5 Segment $ 8 ; Do I = 1 To &Strata ; Do Geography = %SysFunc( TranWrd( %Str(&GList) , %STR( ) , %STR( , ) ) ) ; Do Segment = %SysFunc( TranWrd( %Str(&SList) , %STR( ) , %STR( , ) ) ) ; Select ( Segment ) ; When ( 'SEGMENT1' ) Price = Rand( 'Uniform' ) * 10 * I ; When ( 'SEGMENT2' ) Price = Rand( 'Normal' ) * 10 * I ; When ( 'SEGMENT3' ) Price = Rand( 'LogNormal' ) * 10 * I ; Otherwise ; End ; Output ; End ; End ; End ; Run ; %Mend MakeData ; The NOBS macro I sent earlier, well the design is better as it declares local macro variables explicitly, uses NLOBS rather than NOBS, uses less macro variables, and has the added benefit of a little error handling. Toby Dunn "I'm a hell bent 100% Texan til I die" "Don't touch my Willie, I don't know you that well" ________________________________ Date: Sat, 6 Nov 2010 14:32:06 -0400 Subject: Fwd: http://www.datamineit.com/DMI_publications.htm -- Much Faster Bootstraps Using SAS From: jdopdyke@gmail.com To: SAS-L@listserv.uga.edu; dynamicpanel@yahoo.com CC: mkeintz@wharton.upenn.edu; tobydunn@hotmail.com Toby Thanks for your rep lies 1) Sorry for the misspelling 2) I agree that that is the only limitation of the Class statement vs. the by statement -- too many "by" variables can crash the code (because the entire matrix is held in memory, which is why it is sometimes noticeably faster). That’s not often an issue, and I didn’t mention it because I didn’t want to distract from the real point of the dialogue, which is how much faster and efficient my faster bootstrap code is than any other SAS implementation. 3) to accommodate you, here's some SAS Macro code you can try to figure out. Its a peer reviewed, original, recursive, combinatorial algorithm that’s been vetted extensively. Happy to answer any questions you may have. http://www.springerlink.com/content/n43n44773r0h0r11/?p=4ea1146c0de5490e8933b05dd9d5de57π=3 4) EFFICIENCY: I state in the paper that "efficiency" can mean different things. I focus on the "efficiency" that most users find the most important in most settings: real runtime. If you had read the paper and not just skimmed it, you would have read that, as well as seen the CPU runtimes in Appendix B. 5) I have never seen the speed (or lack thereof) of SQL code, with or without optimizer, come close to that of efficiently written hash code, regardless of platform or SAS version (hashing is only DIRECTLY available in SAS (one could always use DLLs) in more recent versions). By all means, shoot me the code, with data specs, to test and I’ll let you know what I find out. And as I said previously, and as I state in my paper, of course hashing is constrained by the amount of memory you have. 6) Yes, I know the reference to the song. The linguistic device, crudely applied here in my opinion, is called double entendre. Purile in my view, not a tagline by which I'd want to be known, but to each their own. Rather, I was referring to your "scks" comment -- until you've read, (not skimmed), a coding paper, understood it, and actually run the code under a wide range of settings, such an assessment is irresponsible and unprofessional. Don't "bet" something's inefficient -- test it and see if you can do better. If/when you cannot, silence is the better part of valor. 7) Cussing can be a little fun in a barroom when a little drunk Toby, but this is a professional Listserver, where the only (collegial) competition should be whoever can write the best SAS code. “The thought did cross my mind to run and rework your code. I simply just currently don't have the time at the moment to do so.” I’ll look forward to your substantive reply, with actual SAS code. Maybe you can present it at WUSS Toby. Sincerely, J.D. P.S. – Let’s stick to the bootstrap code – writing a couple of Put statements in a four line macro is hardly what I’d call “reworking.” But I’m glad SAS Macro is becoming a little more familiar to you. Good luck with it. ============================================ J.D. Opdyke, Managing Director-Quantitative Strategies DataMineIt 17 McKinley Road Marblehead, MA 01945 phone: 617-943-6463, 781-639-6463 fax: 781-639-6463 email: JDOpdyke@DataMineIt.com web: www.DataMineIt.com ============================================ ######################################################### Statement of Confidentiality The information contained in this electronic message, and any attachments to this message, are intended for the exclusive use of the addressee(s), may contain confidential or privileged information, and are protected by law. If you are not the intended recipient, please notify J.D. Opdyke as soon as possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of this message and any attachments. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. ######################################################### ---------- Forwarded message ---------- From: toby dunn <tobydunn@hotmail.com> Date: Sat, Nov 6, 2010 at 2:21 PM Subject: RE: http://www.datamineit.com/DMI_publications.htm -- Much Faster Bootstraps Using SAS To: jdopdyke@gmail.com, mkeintz@wharton.upenn.edu, dynamicpanel@yahoo.com, sas-l@listserv.uga.edu In an effort to better explain and well to set a better tone as the SAS-L listserve, is well civil 99% of the time. I only had time to rework you MakeData macro I'm not sure the Listserve will preserve the formatting: %MacroMakeData( Strata= , Segments= , Geography= , DSO= ) ; %LocalSList GList K J ; %DoK = 1%To&Segements ; %LetSList = SList "SEGMENT&K" ; %End; %DoJ = 1%To&Geography ; %LetGList = GList "GEOG&J" ; %End; Data &DSO ( Keep = Geography Segment Price ) ; Length Geography $ 5 Segment $ 8; Do I = 1To &Strata ; Do Geography = %SysFunc( TranWrd( %Str(&SList) , %STR( ) , %STR( , ) ) ) ; Do Segment = %SysFunc( TranWrd( %Str(&GList) , %STR( ) , %STR( , ) ) ) ; Select ( Segment ) ; When ( 'SEGMENT1') Price = Rand( 'Uniform') * 10* I ; When ( 'SEGMENT2') Price = Rand( 'Normal') * 10* I ; When ( 'SEGMENT3') Price = Rand( 'LogNormal') * 10* I ; End ; Output ; End ; End ; End ; Run ; %MendMakeData ; I also noticed you define %Macros within a %Macro. Not a good as it forces a recompilation of the interior macros with each execution of the wrapping macro. I'd recommend defining those before your %OPDY_Boot macro. The %NOBS macro can be rewritten to the following: %MacroNobs( Data= ) ; %LocalOPEN CLOSE ; %If%SysFunc( Exist( &Data ) ) %Then%Do; %LetOPEN = %Sysfunc( Open( &Data , IS ) ) ; %SysFunc( Attrn( &OPEN , NLOBS ) ) %LetClose = %SysFunc( Close ( &OPEN ) ) ; %End; %Else%Do; %PutERROR: DataSet [&Data] Does Not Exist!!! ; %PutERROR: Number Of Obs Will Be Set To 0!!! ; 0 %End; %MendNOBS ; Also Doing things like: proc summary data=byvar_nobs(keep=_FREQ_) nway; var _FREQ_; output out=byvar_nobs(keep=max_freq) max=max_freq; run; data _null_; set byvar_nobs; call symputx('max_freq',max_freq); run; Makes no sense just use Proc SQL to do this. Toby Dunn "I'm a hell bent 100% Texan til I die" "Don't touch my Willie, I don't know you that well" ________________________________ Date: Sat, 6 Nov 2010 11:44:58 -0400 Subject: http://www.datamineit.com/DMI_publications.htm -- Much Faster Bootstraps Using SAS From: jdopdyke@gmail.com To: mkeintz@WHARTON.UPENN.EDU; dynamicpanel@YAHOO.COM; SAS-L@LISTSERV.UGA.EDU; tobydunn@HOTMAIL.COM Mark, 1) If you could show me how to get access post this reply on the LISTSERV I would greatly appreciate it. Short of that, would you mind posting this reply for me? Tody made a number of errors that SAS Users reading his comments should be made aware of. 2) I have very much enjoyed reading your “Outperforming SAS® Indices for Sorted Datasets” paper – thank you for the contribution. 3) I have attached my faster bootstraps paper to this email. Feel free to take the code for a spin. Any feedback you may have is welcome, and I’d be happy to answer any questions you may have about it. I reply to Tody's comments below. Sincerely, J.D. Opdyke ============================================ J.D. Opdyke, Managing Director-Quantitative Strategies DataMineIt 17 McKinley Road Marblehead, MA 01945 phone: 617-943-6463, 781-639-6463 fax: 781-639-6463 email: JDOpdyke@DataMineIt.com web: www.DataMineIt.com ============================================ ######################################################### Statement of Confidentiality The information contained in this electronic message, and any attachments to this message, are intended for the exclusive use of the addressee(s), may contain confidential or privileged information, and are protected by law. If you are not the intended recipient, please notify J.D. Opdyke as soon as possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of this message and any attachments. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. ######################################################### Tody, 1) BY STATEMENT: A “By statement” is, in fact, used in subsequent analyses (see Proc Univariate on page 13). I did not use it in the Proc Summary following Proc SurveySelect because using a Class statement instead is a better default: a By statement will crash if the data happens not to be sorted according to the order of your By variables, and I’ve found Class to be slightly faster under certain circumstances. But the two are essentially equivalent, so your statement about not leveraging “by statements” is incorrect. 2) EFFICIENCY: “Efficiency” as defined in the paper is real runtimes -- “the speed with which SAS users can obtain actual results” (page 2). By that criterion, as a factual matter, my code is more efficient than any other SAS code out there Tody, including the only Proc (Proc SurveySelect) that allows SAS users to implement m-out-of-n bootstraps. You may be a little unfamiliar with advanced SAS Macro code (and maybe bootstraps), which perhaps is why you cannot follow it. I’ve been coding in SAS for over 20 years -- once you learn advanced SAS Macro code, you should be able to follow along as the code is only a few pages long and pretty straightforward for advanced users (e.g. multiple ampersand resolution in loops, strings of macro variable names and macro variable values output by proc sql, etc.). 3) HASHING: You sound like you’re a little new to SAS, so I’ll simply explain (as in the paper) how/why hashing would be used in this setting. Currently, for almost all circumstances, there is no faster way to “merge” datasets in SAS than hashing (technically its not a “merge,” but the end result is the same), and one way to conduct bootstraps in SAS is to output the bootstrap samples containing the record numbers to sample in one dataset, then merge that with the original dataset that contains a record counter. Hashing can be used to do that faster than any other method in SAS under almost any circumstances, even efficiently designed indices. The only drawback is that, because its all done in memory, if your memory is too small Tody, the hash code will abort. The hashing code to do that is in the two SAS papers I cite: very easy to find and download (and write, if you know SAS Macro). 4) SUBSTANTIVE FEEDBACK: If you have substantive feedback on the paper, I’d very much like to hear it, but nothing in what you wrote above is substantive, let alone correct. Rather than saying “I’ll bet its inefficient,” Just try the code out Tody – then without the false bravado/insecure bluster, just comment on the speed of the code. When put in writing in the paper, its actually quite embarrassing how the rest of the methods are so completely dominated by the OPDY algorithm. 5) PROFANITY: I’m no prude, but we don’t need to hear about your wllie scking – resorting to profanity is a sign of fear of one’s own ignorance. Maintain your professionalism, and keep the internet, let alone Listserv, clean, please. A little class Tody, goes a long way. 6) A FINAL CHALLENGE: Also, I’ll put out a challenge to you: write faster bootstrap code and post it. If you cannot, I think its time to eat a little crow. Welcome to SAS! Date: Mon, 1 Nov 201018:14:30-0400 Reply-To: oloolo <dynamicpanel@YAHOO.COM> Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> From: oloolo <dynamicpanel@YAHOO.COM> Subject: Re: Much Faster Bootstraps Using SAS? Comments: To: Toby Dunn <tobydunn@HOTMAIL.COM> I skimed over the paper from what I can tell, he doesn't know how to use PROC SURVEYSELECT to generate bootstrap samples and how to leverage BY statement in subsequent analysis On Mon, 1 Nov 201021:28:32+0000, toby dunn <tobydunn@HOTMAIL.COM> wrote: Im reading it and from the looks of the code he could use a good programmer to help him out. Aside from that I can't find his Hash code in his paper, still a little confused why he would use a Hash for this. His Macro code sucks to the point I bet it is inefficient. While he may very well be correct for however I am skeptical that he is correct muchless can code efficiencly enough to make a statement. Toby Dunn "I'm a hell bent 100% Texan til I die" "Don't touch my Willie, I don't know you that well" Date: Mon, 1 Nov 201020:46:14+0000 From: mkeintz@WHARTON.UPENN.EDU Subject: Much Faster Bootstraps Using SAS? To: SAS-L@LISTSERV.UGA.EDU Is anybody on the L familiar with Opdyke's results reported below? I've just learned of this paper's existance and have not read it yet. Regards, Mark =========================================================================== ===================== http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1692130 Much Faster Bootstraps Using SAS(r) J.D. Opdyke DataMineIt InterStat, October 2010 Abstract: Seven bootstrap algorithms coded in SAS(r) are compared. The fastest ("OPDY"), which uses no modules beyond Base SAS(r), achieves speed increases almost two orders of magnitude faster (over 80x faster) than the relevant "built-in" SAS(r) procedure (Proc SurveySelect). It is even much faster than hashing, but unlike hashing it requires virtually no storage space, and its memory usage efficiency allows it to execute bootstraps on input datasets larger (sometimes by orders of magnitude) than the largest a hash table can use before aborting. This makes OPDY arguably the only truly scalable bootstrap algorithm in SAS(r). ============================================ J.D. Opdyke, Managing Director-Quantitative Strategies DataMineIt 17 McKinley Road Marblehead, MA 01945 phone: 617-943-6463, 781-639-6463 fax: 781-639-6463 email: JDOpdyke@DataMineIt.com web: www.DataMineIt.com ============================================ ######################################################### Statement of Confidentiality The information contained in this electronic message, and any attachments to this message, are intended for the exclusive use of the addressee(s), may contain confidential or privileged information, and are protected by law. If you are not the intended recipient, please notify J.D. Opdyke as soon as possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of this message and any attachments. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. #########################################################


Back to: Top of message | Previous page | Main SAS-L page