|
To Opdyke
Generally I don't spend my time on papers from unknown journals, but since
the fight is on, I took 30min out for this.
I have examined your SAS code and realized it is not a code to generate
bootstrap sample as I thought it would be. Let's make it straight on what your
analysis is embeded in your SAS code. You want to obtain bootstrapped simple
summary statistics out of original data set and you customly coded a SAS program
using sequential sampling with replace algorithm (A4.8). I didn't spend time to
read your proof and assume it is correct.
There are always opportunities particular custom-write code will
outperform SAS's more general purpose procedure in terms of time and (or) memory
consumption by exploit this difference. But you'd better title it approprietly
as, say, Faster bootstrap selected descriptive statistics with one pass of
data. Oh, wait, you didn't pass the data once, you pass the data several times,
and at least 1+(num_bsmps) in order to generate bootstrapped descriptive
statistics: first load the data into a temporary array, then go over this array
to generate sample with replacement.Therefore one alternative measurement that
is independent of hardware is how many times to pass over the data.
I am familar with sequential algorithms, which is a hot topic given the
explode of DataBase, but definitly much broader than your sequential sampling
with replacement. I actually have implemented several, including sequential
sampling.
While I admit that since PROC SURVEYSELECT will generate an output dataset
for further various bootstrap task, it is much slower in real time, you should
compare apple to apple by comparing how much real time both algorithms use in
generating a bootstrap sample set. I made some change to your OPDY code to
suffice this. The modified code and SAS log on my out-dated I/O restrictive
workstation is attached to the end of my writting and I welcome you and all
other recipts of this email to test it on your hardware environment. Of course,
any comments are welcome. So your original chanllenge is now: Can you write
faster SAS code to generate bootstrap sample than PROC SURVEYSELECT while
maintain sound statistical property for the sample?
At last, while you throw us a chanllenge, let me give you one back for
good will. Actually I will give you several for you to choose:
1. I want to bootstrap the standard error estimates of parameters from a
simple OLS regression.
2. I want to obtain a bootstrapped estimiate of 95% CI on my AUC estimate
of a binary classifier.
3. I want to obtain a bootstrapped estimate of spearman rank correlation
for two variables in the data.
BTW, SAS uses Floyd's method by default for SRS, but didn't explicit which one
for URS. You may want to confirm this with SAS. Since you mentioned that you've
already contacted on senior SAS official, I believe you may obtain this piece of
info easily.
****@@@@@ APPENDIX @@@@ ****;
/*
My test result. The customized code looks faster superficially (28sec vs 37sec
real time), but the statistical property is not as good as SAS output. If you
spot some error in the X_boot macro below, pls kindly le tme know. Besides, the
sampling part can be further improved, its current version is not qualified as a
real sequential one since it reads in all records in one strata and go over each
records &num_bsmps times. I know how to improve it, but I will wait until you
solve the above three chanllenges.
*/
************;
1787
1788 dm log 'clear';
1789 %let t0=%sysfunc(datetime(), datetime18.); %put &t0;
07NOV10:00:28:02
1790 options nonotes nomprint nomlogic;
1791 %X_Boot(outdsn=Boot_X_samples,
1792 bsmp_size=100000,
1793 num_bsmps=50,
1794 indata=MFBUS.price_data_6strata_100000,
1795 byvars=geography segment,
1796 bootvar=price
1797 );
1798 %let t1=%sysfunc(datetime(), datetime18.); %put &t1;
07NOV10:00:28:30
1799
1800 option notes;
1801 %let indata=MFBUS.price_data_6strata_100000;
1802 %let bsmp_size=100000;
1803 %let num_bsmps=50;
1804 %let byvars=geography segment;
1805 proc surveyselect data=&indata. method=urs sampsize=&bsmp_size.
rep=&num_bsmps.
1806 seed=-1 outhits
1807 out=Boot_PSS_Samps(drop=expectedhits
samplingweight NumberHits)
1807! noprint;
1808 strata &byvars.;
1809 run;
NOTE: The data set WORK.BOOT_PSS_SAMPS has 30000000 observations and 4
variables.
NOTE: PROCEDURE SURVEYSELECT used (Total process time):
real time 36.26 seconds
cpu time 27.96 seconds
*------------ examine statistics ----------------------*
Using Bootstrap sample from X_Boot
macro 5
21:56 Monday, July 6, 2009
The MEANS Procedure
N
geography segment Obs Variable Mean
Std Dev
------------------------------------------------------------------------
geog1 segment1 50 mean 249512.99
575.8174202
std 220214.20
529.3069682
segment2 50 mean -243.5332090
1865.46
std 575937.15
1784.06
geog2 segment1 50 mean 248809.31
556.8638277
std 219808.91
503.6015858
segment2 50 mean -1021.98
1640.79
std 577149.34
2114.87
geog3 segment1 50 mean 248585.33
568.2178631
std 219588.75
477.5457668
segment2 50 mean -735.8167639
1343.60
std 579573.54
1629.92
------------------------------------------------------------------------
------------------------------------------------------------------------
Using Bootstrap sample from
SURVEYSELECT 4
21:56 Monday, July 6, 2009
The MEANS Procedure
N
geography segment Obs Variable Mean
Std Dev
------------------------------------------------------------------------
geog1 segment1 50 mean 249558.27
608.1574937
std 220380.89
483.7052949
segment2 50 mean -28.2200958
1925.06
std 576301.67
2015.30
geog2 segment1 50 mean 249069.34
757.4270800
std 219917.88
591.7976633
segment2 50 mean -447.0822998
1644.27
std 576295.29
2050.00
geog3 segment1 50 mean 248557.63
689.3976551
std 219597.41
542.3234011
segment2 50 mean -1058.64
1831.72
std 579522.11
1841.37
------------------------------------------------------------------------
**********************************************************************;
%macro X_Boot(outdsn=data1, bsmp_size=, num_bsmps=, indata=, byvars=, bootvar=);
*** the only assumption made within this macro is that the byvars are all
character variables;
*** obtain last byvar, count byvars, and assign each byvar into macro variables
for easy access/processing;
%let last_byvar = %scan(&byvars.,-1);
%let num_byvars = %sysfunc(countw(&byvars.));
%do i=1 %to &num_byvars.;
%let byvar&i. = %scan(&byvars.,&i.);
%end;
*** macro obtains number of observations in a dataset;
%macro nobs(dset);
%if %sysfunc(exist(&dset)) %then %do;
%let dsid=%sysfunc(open(&dset));
%let nobs=%sysfunc(attrn(&dsid,nobs));
%let dsid=%sysfunc(close(&dsid));
%end;
%else %let nobs=0;
&nobs
%mend nobs;
*** initialize macro variables used later;
%let bmean =;
%let bstd =;
%let b975 =;
%let b025 =;
*** obtain counts and cumulated counts for each strata;
proc summary data=&indata. nway;
class &byvars.;
var &bootvar.;
output out=byvar_nobs(keep=_FREQ_ &byvars.) n=junk;
run;
%let n_byvals = %nobs(byvar_nobs);
data cum_temp(keep=_FREQ_ cum_prev_freq);
set byvar_nobs(keep=_FREQ_);
retain cum_prev_freq 0;
prev_freq = lag(_FREQ_);
if _n_=1 then prev_freq = 0;
cum_prev_freq = sum(cum_prev_freq, prev_freq);
run;
*** put counts, cumulated counts, and byvar values into macro strings;
proc sql noprint;
select cum_prev_freq into :cum_prev_freqs separated by ' ' from cum_temp;
quit;
proc sql noprint;
select _freq_ into :freqs separated by ' ' from cum_temp;
quit;
%do i=1 %to &num_byvars.;
proc sql noprint;
select &&byvar&i. into :byvals&i. separated by ' ' from byvar_nobs;
quit;
%end;
*** get size of largest stratum;
proc summary data=byvar_nobs(keep=_FREQ_) nway;
var _FREQ_;
output out=byvar_nobs(keep=max_freq) max=max_freq;
run;
data _null_;
set byvar_nobs;
call symputx('max_freq',max_freq);
run;
*** save results of each stratum in cumulated macro variables instead of
outputting to a
dataset on the data step to lessen intermediate memory requirements
***;
/*+------------- this part can be further improved -----------------+*/
data &outdsn;
set &indata.(keep=&byvars. &bootvar.);
by &byvars.;
array bmeans{&num_bsmps.} bm1-bm&num_bsmps.;
array temp{&max_freq.} _TEMPORARY_;
retain byval_counter 0 cum_prev_freq 0;
temp[_n_-cum_prev_freq]=&bootvar.;
if last.&last_byvar. then do;
byval_counter+1;
freq = 1* scan("&freqs.", byval_counter,' ');
num_bsmps = &num_bsmps.*1;
bsmp_size = &bsmp_size.*1;
do replicate=1 to num_bsmps;
do n=1 to bsmp_size;
&bootvar = temp[floor(ranuni(-1)*freq) + 1] ;
output;
keep replicate &byvars &bootvar;
end;
end;
cum_prev_freq = 1*scan("&cum_prev_freqs.",byval_counter+1,' ');
end;
run;
%mend;
________________________________
From: J.D. Opdyke <jdopdyke@gmail.com>
To: tobydunn@hotmail.com; dynamicpanel@yahoo.com; sas-l@listserv.uga.edu
Cc: mkeintz@wharton.upenn.edu
Sent: Sat, November 6, 2010 5:37:58 PM
Subject:Fwd: http://www.datamineit.com/DMI_publications.htm -- Much Faster
Bootstraps Using SAS
Toby,
Can you write faster bootstrap code or not?
That is what this thread is about, and that is what my paper is about.
To date, you have not done that. If you’re going to criticize, please keep your
eye on the ball. When you have time to attempt to write bootstrap code, I'm
sure all of us would love to see it. Mine is not only in the paper, but can be
downloaded in a text file at http://www.datamineit.com/DMI_publications.htm -- I
could not make it any easier to run -- just copy and past the code into SAS
program editor, and click the little running man.
Writing trivial pieces of code to make up data, and trivial “nobs” code that
make no difference in terms of real runtime (or by any criteria, actually – I’ll
prove that below for those who’d like to see it) smacks of a little desperation
Toby. Its good to be a little determined to try to be better and to try to
write better code, but until you have, like I said, silence is the better part
of valor. Spitting out silly pretend data in loops is what it is: silly pretend
data just useful to make the point using serious code, like my bootstrap code.
I just included the pretend data code so people could test the bootstrap code
with EXACTLTY the datasets I used. Really, a little sad that you’ve fixated on
code making up pretend data. I do not think you have perspective -- you're
missing the boat, and that is fast bootstrap code.
My challenge remains, and it is genuine: let us, the SAS Users community,
continue to improve methodologies and approaches and write better and better
code, in this case, for implementing bootstraps (not making up pretend data).
That should be the goal, not ad hominem attacks on code as “scking” before
you’ve even (admittedly) read the paper or even (admittedly) tested and
implemented the code across a range of scenarios (everyone makes mistakes, but a
class act would have apologized – you have yet to). That is unprofessional and
irresponsible and purile.
That aside, I’m sure sometime, perhaps soon, my bootstrap code will be bested,
at least on certain platforms under certain conditions (if for no other reason
than I've put the approach out there). To encourage that, in fact, as a
responsible SAS User, I had already sent my paper and code directly to a very
senior SAS officer exactly for that purpose, thus proving that my paper was a
genuine, and not ego-driven, challenge. Saying code "scks" when you haven't
even tested it, or read its documentation thoroughly (by you own admission), is
an ego-related mistake not worthy of the Listserv. I have no problem standing
by those statements (can you stand by yours?).
That said, when considering challengers to my paper and code, it is important to
remember that the great utility of my algorithm lies in the fact that all you
need is Base SAS to run it: no other modules, and no expensive grid platforms.
The size of the audience becomes a material criterion when assessing the utility
of code, and if a method requires a $10million specialized SAS product on a
specialized and expensive platform, it is not going to be relevant to most of
the millions of SAS Users out there, even if its faster.
For those of you who want to read on as I continue to school Toby (said in the
NICEST possible way! Some of my good friends are from his neck of the woods),
please read on: otherwise, I think this thread has reached the end of its
utility.
===============
Toby,
The little “nobs” macro I just used is just something I’ve used for years out of
convenience. It is NOT the fastest way to get the “nobs” of a dataset into a
macro variable – I know, because I’ve tested it. The fastest code that I’VE
found (there may be faster ways – if so, I haven’t seen them) is the below:
%let dsid = %sysfunc(open(temp));
%let testnobs = %sysfunc(ifc(&dsid.
,%nrstr(%sysfunc(attrn(&dsid.,nobs));
%let rc = %sysfunc(close(&dsid.));)
,%nrstr(0) )
);
%put testnobs = &testnobs.;
That said, the only way this difference becomes material to a SAS user by ANY
efficiency criterion is if they are obtaining “nobs” thousands of times in some
type of looping macro (which is how I tested the different methods): otherwise,
a real runtime difference of about a hundredth of a second, when you’re only
getting “nobs” a few times, is trivial, and hardly worthy of taking up space on
this Listserv.
Those without perspective cannot tell the difference between trivial coding
curiosities, and code that drops the runtimes of SAS Procs from over a month to
under a day; for clients paying serious $, the difference between the two is
obvious. So again, in the nicest way possible, I must say: put your $ where
your verbosity is -- either write faster BOOTSTRAP code, or lets let this thread
die Toby. Good luck.
============================================
J.D. Opdyke, Managing Director-Quantitative Strategies
DataMineIt
17 McKinley Road
Marblehead, MA 01945
phone: 617-943-6463, 781-639-6463
fax: 781-639-6463
email: JDOpdyke@DataMineIt.com
web: www.DataMineIt.com
============================================
#########################################################
Statement of Confidentiality
The information contained in this electronic message, and any attachments to
this message, are intended for the exclusive use of the addressee(s), may
contain confidential or privileged information, and are protected by law. If
you are not the intended recipient, please notify J.D. Opdyke as soon as
possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of
this message and any attachments. Any disclosure, copying, or distribution of
this message, or the taking of any action based on it, is strictly prohibited.
#########################################################
---------- Forwarded message ----------
From: toby dunn <tobydunn@hotmail.com>
Date: Sat, Nov 6, 2010 at 4:33 PM
Subject: RE: http://www.datamineit.com/DMI_publications.htm -- Much Faster
Bootstraps Using SAS
To: jdopdyke@gmail.com, sas-l@listserv.uga.edu, dynamicpanel@yahoo.com
Cc: mkeintz@wharton.upenn.edu
John,
Nothing in either paper programmatically is overly impressive nor anything my
intermediate programmers couldn't handle. Personally, I'd send it back to my
guys at work to have it reworked as both macros are inefficient and could be
made better. Not to say the concept is a bad, just the implementation is
horrendous IMO. As I said before messing around with this is low, and by that I
mean really low, on my priority list. I have five presentations to prep for
Monday and Tuesday and a book to start writing on. My plate is full up on SAS
papers and books, a long list of ideas of papers to write, so needless to say
I'm not too overly worried about adding more.
I made a few minor mods to the MakeData Macro since I really didnt run it before
posting it, my own fault, it actualy runs faster than your original one and you
can specify the output data name without modifying the code internally to gain
the greatest degree of flexibility in thyat area. Not to mention its just
easier to read and comprehend.
%Macro MakeData( Strata= , Segments= , Geography= , DSO= ) ;
%Local SList GList K J ;
%Do K = 1 %To &Segments ;
%Let SList = &SList "SEGMENT&K" ;
%End ;
%Do J = 1 %To &Geography ;
%Let GList = &GList "GEOG&J" ;
%End ;
Data &DSO ( Keep = Geography Segment Price ) ;
Length Geography $ 5
Segment $ 8 ;
Do I = 1 To &Strata ;
Do Geography = %SysFunc( TranWrd( %Str(&GList) , %STR( ) , %STR( , ) ) ) ;
Do Segment = %SysFunc( TranWrd( %Str(&SList) , %STR( ) , %STR( , ) ) ) ;
Select ( Segment ) ;
When ( 'SEGMENT1' ) Price = Rand( 'Uniform' ) * 10 * I ;
When ( 'SEGMENT2' ) Price = Rand( 'Normal' ) * 10 * I ;
When ( 'SEGMENT3' ) Price = Rand( 'LogNormal' ) * 10 * I ;
Otherwise ;
End ;
Output ;
End ;
End ;
End ;
Run ;
%Mend MakeData ;
The NOBS macro I sent earlier, well the design is better as it declares local
macro variables
explicitly, uses NLOBS rather than NOBS, uses less macro variables, and has the
added benefit of
a little error handling.
Toby Dunn
"I'm a hell bent 100% Texan til I die"
"Don't touch my Willie, I don't know you that well"
________________________________
Date: Sat, 6 Nov 2010 14:32:06 -0400
Subject: Fwd: http://www.datamineit.com/DMI_publications.htm -- Much Faster
Bootstraps Using SAS
From: jdopdyke@gmail.com
To: SAS-L@listserv.uga.edu; dynamicpanel@yahoo.com
CC: mkeintz@wharton.upenn.edu; tobydunn@hotmail.com
Toby
Thanks for your rep lies
1) Sorry for the misspelling
2) I agree that that is the only limitation of the Class statement vs. the by
statement -- too many "by" variables can crash the code (because the entire
matrix is held in memory, which is why it is sometimes noticeably faster).
That’s not often an issue, and I didn’t mention it because I didn’t want to
distract from the real point of the dialogue, which is how much faster and
efficient my faster bootstrap code is than any other SAS implementation.
3) to accommodate you, here's some SAS Macro code you can try to figure out.
Its a peer reviewed, original, recursive, combinatorial algorithm that’s been
vetted extensively. Happy to answer any questions you may have.
http://www.springerlink.com/content/n43n44773r0h0r11/?p=4ea1146c0de5490e8933b05dd9d5de57π=3
4) EFFICIENCY: I state in the paper that "efficiency" can mean different
things. I focus on the "efficiency" that most users find the most important in
most settings: real runtime. If you had read the paper and not just skimmed it,
you would have read that, as well as seen the CPU runtimes in Appendix B.
5) I have never seen the speed (or lack thereof) of SQL code, with or without
optimizer, come close to that of efficiently written hash code, regardless of
platform or SAS version (hashing is only DIRECTLY available in SAS (one could
always use DLLs) in more recent versions). By all means, shoot me the code,
with data specs, to test and I’ll let you know what I find out. And as I said
previously, and as I state in my paper, of course hashing is constrained by the
amount of memory you have.
6) Yes, I know the reference to the song. The linguistic device, crudely
applied here in my opinion, is called double entendre. Purile in my view, not a
tagline by which I'd want to be known, but to each their own. Rather, I was
referring to your "scks" comment -- until you've read, (not skimmed), a coding
paper, understood it, and actually run the code under a wide range of settings,
such an assessment is irresponsible and unprofessional. Don't "bet" something's
inefficient -- test it and see if you can do better. If/when you cannot,
silence is the better part of valor.
7) Cussing can be a little fun in a barroom when a little drunk Toby, but this
is a professional Listserver, where the only (collegial) competition should be
whoever can write the best SAS code.
“The thought did cross my mind to run and rework your code. I simply just
currently don't have the time at the moment to do so.”
I’ll look forward to your substantive reply, with actual SAS code. Maybe you
can present it at WUSS Toby.
Sincerely,
J.D.
P.S. – Let’s stick to the bootstrap code – writing a couple of Put statements in
a four line macro is hardly what I’d call “reworking.” But I’m glad SAS Macro
is becoming a little more familiar to you. Good luck with it.
============================================
J.D. Opdyke, Managing Director-Quantitative Strategies
DataMineIt
17 McKinley Road
Marblehead, MA 01945
phone: 617-943-6463, 781-639-6463
fax: 781-639-6463
email: JDOpdyke@DataMineIt.com
web: www.DataMineIt.com
============================================
#########################################################
Statement of Confidentiality
The information contained in this electronic message, and any attachments to
this message, are intended for the exclusive use of the addressee(s), may
contain confidential or privileged information, and are protected by law. If
you are not the intended recipient, please notify J.D. Opdyke as soon as
possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of
this message and any attachments. Any disclosure, copying, or distribution of
this message, or the taking of any action based on it, is strictly prohibited.
#########################################################
---------- Forwarded message ----------
From: toby dunn <tobydunn@hotmail.com>
Date: Sat, Nov 6, 2010 at 2:21 PM
Subject: RE: http://www.datamineit.com/DMI_publications.htm -- Much Faster
Bootstraps Using SAS
To: jdopdyke@gmail.com, mkeintz@wharton.upenn.edu, dynamicpanel@yahoo.com,
sas-l@listserv.uga.edu
In an effort to better explain and well to set a better tone as the SAS-L
listserve, is well civil 99% of the time.
I only had time to rework you MakeData macro I'm not sure the Listserve will
preserve the formatting:
%MacroMakeData( Strata= , Segments= , Geography= , DSO= ) ;
%LocalSList GList K J ;
%DoK = 1%To&Segements ;
%LetSList = SList "SEGMENT&K" ;
%End;
%DoJ = 1%To&Geography ;
%LetGList = GList "GEOG&J" ;
%End;
Data &DSO ( Keep = Geography Segment Price ) ;
Length Geography $ 5
Segment $ 8;
Do I = 1To &Strata ;
Do Geography = %SysFunc( TranWrd( %Str(&SList) , %STR( ) , %STR( , ) ) ) ;
Do Segment = %SysFunc( TranWrd( %Str(&GList) , %STR( ) , %STR( , ) ) ) ;
Select ( Segment ) ;
When ( 'SEGMENT1') Price = Rand( 'Uniform') * 10* I ;
When ( 'SEGMENT2') Price = Rand( 'Normal') * 10* I ;
When ( 'SEGMENT3') Price = Rand( 'LogNormal') * 10* I ;
End ;
Output ;
End ;
End ;
End ;
Run ;
%MendMakeData ;
I also noticed you define %Macros within a %Macro. Not a good as it forces a
recompilation of the interior macros
with each execution of the wrapping macro.
I'd recommend defining those before your %OPDY_Boot macro.
The %NOBS macro can be rewritten to the following:
%MacroNobs( Data= ) ;
%LocalOPEN CLOSE ;
%If%SysFunc( Exist( &Data ) ) %Then%Do;
%LetOPEN = %Sysfunc( Open( &Data , IS ) ) ;
%SysFunc( Attrn( &OPEN , NLOBS ) )
%LetClose = %SysFunc( Close ( &OPEN ) ) ;
%End;
%Else%Do;
%PutERROR: DataSet [&Data] Does Not Exist!!! ;
%PutERROR: Number Of Obs Will Be Set To 0!!! ;
0
%End;
%MendNOBS ;
Also Doing things like:
proc summary data=byvar_nobs(keep=_FREQ_) nway;
var _FREQ_;
output out=byvar_nobs(keep=max_freq) max=max_freq;
run;
data _null_;
set byvar_nobs;
call symputx('max_freq',max_freq);
run;
Makes no sense just use Proc SQL to do this.
Toby Dunn
"I'm a hell bent 100% Texan til I die"
"Don't touch my Willie, I don't know you that well"
________________________________
Date: Sat, 6 Nov 2010 11:44:58 -0400
Subject: http://www.datamineit.com/DMI_publications.htm -- Much Faster
Bootstraps Using SAS
From: jdopdyke@gmail.com
To: mkeintz@WHARTON.UPENN.EDU; dynamicpanel@YAHOO.COM; SAS-L@LISTSERV.UGA.EDU;
tobydunn@HOTMAIL.COM
Mark,
1)
If you could show me how to get access post this reply on the LISTSERV I would
greatly appreciate it. Short of that, would you mind posting this reply for
me? Tody made a number of errors that SAS Users reading his comments should be
made aware of.
2)
I have very much enjoyed reading your “Outperforming SAS® Indices for Sorted
Datasets” paper – thank you for the contribution.
3)
I have attached my faster bootstraps paper to this email. Feel free to take the
code for a spin. Any feedback you may have is welcome, and I’d be happy to
answer any questions you may have about it. I reply to Tody's comments below.
Sincerely,
J.D. Opdyke
============================================
J.D. Opdyke, Managing Director-Quantitative Strategies
DataMineIt
17 McKinley Road
Marblehead, MA 01945
phone: 617-943-6463, 781-639-6463
fax: 781-639-6463
email: JDOpdyke@DataMineIt.com
web: www.DataMineIt.com
============================================
#########################################################
Statement of Confidentiality
The information contained in this electronic message, and any attachments to
this message, are intended for the exclusive use of the addressee(s), may
contain confidential or privileged information, and are protected by law. If
you are not the intended recipient, please notify J.D. Opdyke as soon as
possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of
this message and any attachments. Any disclosure, copying, or distribution of
this message, or the taking of any action based on it, is strictly prohibited.
#########################################################
Tody,
1)
BY STATEMENT: A “By statement” is, in fact, used in subsequent analyses (see
Proc Univariate on page 13). I did not use it in the Proc Summary following
Proc SurveySelect because using a Class statement instead is a better default: a
By statement will crash if the data happens not to be sorted according to the
order of your By variables, and I’ve found Class to be slightly faster under
certain circumstances. But the two are essentially equivalent, so your
statement about not leveraging “by statements” is incorrect.
2)
EFFICIENCY: “Efficiency” as defined in the paper is real runtimes -- “the speed
with which SAS users can obtain actual results” (page 2). By that criterion, as
a factual matter, my code is more efficient than any other SAS code out there
Tody, including the only Proc (Proc SurveySelect) that allows SAS users to
implement m-out-of-n bootstraps.
You may be a little unfamiliar with advanced SAS Macro code (and maybe
bootstraps), which perhaps is why you cannot follow it. I’ve been coding in SAS
for over 20 years -- once you learn advanced SAS Macro code, you should be able
to follow along as the code is only a few pages long and pretty straightforward
for advanced users (e.g. multiple ampersand resolution in loops, strings of
macro variable names and macro variable values output by proc sql, etc.).
3)
HASHING: You sound like you’re a little new to SAS, so I’ll simply explain (as
in the paper) how/why hashing would be used in this setting. Currently, for
almost all circumstances, there is no faster way to “merge” datasets in SAS than
hashing (technically its not a “merge,” but the end result is the same), and one
way to conduct bootstraps in SAS is to output the bootstrap samples containing
the record numbers to sample in one dataset, then merge that with the original
dataset that contains a record counter. Hashing can be used to do that faster
than any other method in SAS under almost any circumstances, even efficiently
designed indices. The only drawback is that, because its all done in memory, if
your memory is too small Tody, the hash code will abort.
The hashing code to do that is in the two SAS papers I cite: very easy to find
and download (and write, if you know SAS Macro).
4)
SUBSTANTIVE FEEDBACK: If you have substantive feedback on the paper, I’d very
much like to hear it, but nothing in what you wrote above is substantive, let
alone correct. Rather than saying “I’ll bet its inefficient,” Just try the code
out Tody – then without the false bravado/insecure bluster, just comment on the
speed of the code. When put in writing in the paper, its actually quite
embarrassing how the rest of the methods are so completely dominated by the OPDY
algorithm.
5)
PROFANITY: I’m no prude, but we don’t need to hear about your wllie scking –
resorting to profanity is a sign of fear of one’s own ignorance. Maintain your
professionalism, and keep the internet, let alone Listserv, clean, please. A
little class Tody, goes a long way.
6)
A FINAL CHALLENGE:
Also, I’ll put out a challenge to you: write faster bootstrap code and post
it. If you cannot, I think its time to eat a little crow. Welcome to SAS!
Date: Mon, 1 Nov 201018:14:30-0400
Reply-To: oloolo <dynamicpanel@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: oloolo <dynamicpanel@YAHOO.COM>
Subject: Re: Much Faster Bootstraps Using SAS?
Comments: To: Toby Dunn <tobydunn@HOTMAIL.COM>
I skimed over the paper from what I can tell, he doesn't know how to use PROC
SURVEYSELECT to generate bootstrap samples and how to leverage BY statement in
subsequent analysis
On Mon, 1 Nov 201021:28:32+0000, toby dunn <tobydunn@HOTMAIL.COM> wrote:
Im reading it and from the looks of the code he could use a good programmer to
help him out.
Aside from that I can't find his Hash code in his paper, still a little confused
why he would use a Hash for this. His Macro code sucks to the point I bet it is
inefficient.
While he may very well be correct for however I am skeptical that he is correct
muchless can code efficiencly enough to make a statement.
Toby Dunn
"I'm a hell bent 100% Texan til I die"
"Don't touch my Willie, I don't know you that well"
Date: Mon, 1 Nov 201020:46:14+0000
From: mkeintz@WHARTON.UPENN.EDU
Subject: Much Faster Bootstraps Using SAS?
To: SAS-L@LISTSERV.UGA.EDU
Is anybody on the L familiar with Opdyke's results reported below?
I've just learned of this paper's existance and have not read it yet.
Regards,
Mark
===========================================================================
=====================
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1692130
Much Faster Bootstraps Using SAS(r)
J.D. Opdyke
DataMineIt
InterStat, October 2010
Abstract: Seven bootstrap algorithms coded in SAS(r) are compared. The fastest
("OPDY"), which uses no modules beyond Base SAS(r), achieves speed increases
almost two orders of magnitude faster (over 80x faster) than the relevant
"built-in" SAS(r) procedure (Proc SurveySelect). It is even much faster than
hashing, but unlike hashing it requires virtually no storage space, and its
memory usage efficiency allows it to execute bootstraps on input datasets larger
(sometimes by orders of magnitude) than the largest a hash table can use before
aborting. This makes OPDY arguably the only truly scalable bootstrap algorithm
in SAS(r).
============================================
J.D. Opdyke, Managing Director-Quantitative Strategies
DataMineIt
17 McKinley Road
Marblehead, MA 01945
phone: 617-943-6463, 781-639-6463
fax: 781-639-6463
email: JDOpdyke@DataMineIt.com
web: www.DataMineIt.com
============================================
#########################################################
Statement of Confidentiality
The information contained in this electronic message, and any attachments to
this message, are intended for the exclusive use of the addressee(s), may
contain confidential or privileged information, and are protected by law. If
you are not the intended recipient, please notify J.D. Opdyke as soon as
possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy all copies of
this message and any attachments. Any disclosure, copying, or distribution of
this message, or the taking of any action based on it, is strictly prohibited.
#########################################################
|