|
This very crude test suggests that you might save about the same
amount of memory using an associative array to save the boot strap
results versus using macro variables.
Then you won't have to convert the macro variables back to a SAS data
set. You will be able to accomodate any BY variable data type, and
there will be no loss of precision for the various statistics.
3235 data _null_;
3236 array b[10000];
3237 array c[10000] $8;
3238 set sashelp.class;
3239 call symputX(cats('name', _n_),name);
3240 call symputX(cats('sex', _n_),sex);
3241 call symputX(cats('age', _n_),age);
3242 call symputX(cats('weight',_n_),weight);
3243 call symputX(cats('height',_n_),height);
3244 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
user cpu time 0.03 seconds
system cpu time 0.00 seconds
Memory 3626k
OS Memory 10888k
Timestamp 11/8/2010 10:52:27 AM
3245 data _null_;
3246 array b[10000];
3247 array c[10000] $8;
3248 if _n_ = 1 then do;
3249 declare hash class();
3250 class.definekey('_N_');
3251 class.definedata('name','sex','age','weight','height');
3252 class.definedone();
3253 end;
3254 set sashelp.class end=eof;
3255 class.add();
3256 if eof then class.output(dataset:'class');
3257 run;
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
user cpu time 0.03 seconds
system cpu time 0.01 seconds
Memory 3620k
OS Memory 10868k
Timestamp 11/8/2010 10:52:27 AM
3258 data class(drop=b: c:);
3259 array b[10000];
3260 array c[10000] $8;
3261 set sashelp.class;
3262 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
user cpu time 0.03 seconds
system cpu time 0.01 seconds
Memory 4390k
OS Memory 12432k
Timestamp 11/8/2010 10:52:27 AM
On Mon, Nov 8, 2010 at 10:20 AM, J.D. Opdyke <jdopdyke@gmail.com> wrote:
> Søren,
> Thank you very much for taking out the time to comment. I address your
> comments in my previous reply to our colleague Mark Keintz:
>
> "And using data _null_ on the main data step is important at the margin when
> bootstrapping large datasets: using data _null_ and saving the results in
> cumulated macro variables rather than using a simple output statement saves
> memory – again, only noticeable when using large datasets. Taking the
> cumulated macro variable approach allows the code to run on even larger
> input datasets without crashing (I even comment that in the code: “*** save
> results of each stratum in cumulated macro variables instead of outputting
> to a dataset on the data step to lessen intermediate memory requirements
> ***;” -- I cannot fathom how users/readers would think that I wouldn’t have
> had a good reason for not using a simple output statement). "
>
> So while its a subjective assessment, I do not believe the code is not
> clumsy -- it is very purposeful (when reviewing advanced code, I always make
> the presumption that there's a good reason somebody, who's tested the code
> much more extensively than I probably ever will, did what they did; not
> always true, but often true), allowing the OPDY algorithm to run on even
> larger datasets. The goal of my code is speed and scalability, while code
> readability and understandability (especially by non-experts) are second
> order considerations-not trivial by any means, but not on the same level as
> speed (real runtimes) and scalability. And while that's obviously not the
> only way to measure efficiency, I explicitly state that multiple times in
> the paper.
> Thank you again for your comments -- I'd greatly appreciate any additional
> thoughts or feedback you may have.
> Sincerely,
>
> J.D. Opdyke
>
> ============================================
> J.D. Opdyke, Managing Director-Quantitative Strategies
> DataMineIt
> 17 McKinley Road
> Marblehead, MA 01945
> phone: 617-943-6463, 781-639-6463
> fax: 781-639-6463
> email: JDOpdyke@DataMineIt.com
> web: www.DataMineIt.com
> ============================================
>
> #########################################################
> Statement of Confidentiality
> The information contained in this electronic message, and any attachments to
> this message, are intended for the exclusive use of the addressee(s), may
> contain confidential or privileged information, and are protected by
> law. If you are not the intended recipient, please notify J.D. Opdyke as
> soon as possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy
> all copies of this message and any attachments. Any disclosure, copying, or
> distribution of this message, or the taking of any action based on it, is
> strictly prohibited.
> #########################################################
>
>
> Mark,
>
> Thank you so much for your substantive comments. Below are substantive
> replies to each of them:
>
>
>
> 1) SAS Users turn to _TEMPORARY_ arrays because they can GREATLY speed
> execution time and preserve a HUGE amount of memory when working with large
> arrays because names for each of the array cells are not created; that is
> especially important here since my algorithm is memory intensive (a very
> good paper documenting this is “Getting Started with the DATA Step Hash
> Object”, Secosky and Bloom, SAS Institute, March, 29, 2007, if you haven’t
> already seen it).
>
>
>
> 2) The point of “all the CUMPREVFREQ stuff” the was not, PER SE, to put
> first in first, second in second, etc. although it accomplishes that
> objective. Rather, the point was to reuse cells of the array when changing
> BY VARIABLE value groups so that the size of the array would be only
> &maxfreq and not N (where N is #obs in dataset). Depending on the structure
> of the dataset and its strata/by variables, this can dramatically increase
> the robustness of the algorithm, allowing it to run on even larger datasets
> than it would otherwise.
>
>
>
> 3) I implemented your code (see .log file below) and the only “macro stuff”
> that can be excluded is a single data step on a small dataset and a single
> proc sql on a small dataset, both of which take no time at all. You still
> have to obtain the _FREQ_’s from the Proc Summary because the strata could
> have (typically would have) different numbers of observations, and those
> _FREQ_’s are the only way to do the random number generation with the rand()
> function. And using data _null_ on the main data step is important at the
> margin when bootstrapping large datasets: using data _null_ and saving the
> results in cumulated macro variables rather than using a simple output
> statement saves memory – again, only noticeable when using large datasets.
> Taking the cumulated macro variable approach allows the code to run on even
> larger input datasets without crashing (I even comment that in the code:
> “*** save results of each stratum in cumulated macro variables instead of
> outputting to a dataset on the data step to lessen intermediate memory
> requirements ***;” -- I cannot fathom how users/readers would think that I
> wouldn’t have had a good reason for not using a simple output statement).
>
>
>
> 4) In the end, when OPDY_Boot and DOW are run on large datasets, the former
> is about twice as fast (see .log file below). On smaller datasets the
> runtimes are essentially the same because the approaches are so similar, but
> the larger the dataset, the more notable the speed advantage of OPDY, at
> least in absolute terms. And if real runtimes are the issue, as they
> usually are, then large datasets are the only time that fast, scalable
> bootstraps are crucial.
>
>
>
> So I think I can safely say that OPDY retains its crown, although DOW is a
> very decent second Mark.
>
>
>
> I again greatly appreciate your input, and shall happily answer any
> additional questions you may have about OPDY and the paper.
>
>
>
> Very truly yours,
>
>
>
> J.D.
>
> ============================================
> J.D. Opdyke, Managing Director-Quantitative Strategies
> DataMineIt
> 17 McKinley Road
> Marblehead, MA 01945
> phone: 617-943-6463, 781-639-6463
> fax: 781-639-6463
> email: JDOpdyke@DataMineIt.com
> web: www.DataMineIt.com
> ============================================
>
> #########################################################
> Statement of Confidentiality
> The information contained in this electronic message, and any attachments to
> this message, are intended for the exclusive use of the addressee(s), may
> contain confidential or privileged information, and are protected by
> law. If you are not the intended recipient, please notify J.D. Opdyke as
> soon as possible at (617) 943-6463 and JDOpdyke@DataMineIt.com and destroy
> all copies of this message and any attachments. Any disclosure, copying, or
> distribution of this message, or the taking of any action based on it, is
> strictly prohibited.
> #########################################################
>
>
> ---------- Forwarded message ----------
> From: Data _null_; <iebupdte@gmail.com>
> Date: Mon, Nov 8, 2010 at 8:46 AM
> Subject: Re: Much Faster Bootstraps Using SAS?
> To: JDOpdyke@datamineit.com, SAS-L@listserv.uga.edu
>
>
> For the times shown in
>
> Appendix B
> Table B1: Real and CPU Runtimes (minutes) of the Algorithms for
> Various N, #strata, n, and m
>
> Are these the times for each macro to complete?
>
> On Mon, Nov 8, 2010 at 3:28 AM, Søren Lassen <s.lassen@post.tele.dk> wrote:
>> Looking through the code samples, I think the core idea is quite sound:
>> 1. Loop through each by-group and place all values of the variable
>> of interest in an array.
>> 2. When the array is filled, calculate the mean of a number of random
>> elements from the array, put the mean into another array.
>> 3. Repeat (2) as many times as wanted, until the second array is filled.
>> 3. Calculate "meta-statistics" (eg. the standard deviation, the mean,
>> and various fractiles of the mean values) from the second array, and
>> output that.
>>
>> The actual "OPDY" macro is unfortunately messed up with a lot of
>> unnecessary code to put the whole output dataset into macro variables
>> and then put it back into a dataset - as far as I can see, many lines
>> of code can be replaced with a simple output statement. The preliminary
>> code is also somewhat clumsy in my opinion.
>>
>> But I think the author may be right in postulating that his idea is
>> running circles around PROC SURVEYSELECT and other methods. The most
>> obvious alternatives are to create a sample dataset and
>> calculate statistics and then meta-statistics for that, or to sample the
>> original dataset by random access. Both methods will need a lost more
>> disk access, and therefore time, than the method described.
>>
>> Regards,
>> Søren
>>
>> On Mon, 1 Nov 2010 20:46:14 +0000, Keintz, H. Mark
>> <mkeintz@WHARTON.UPENN.EDU> wrote:
>>
>>>Is anybody on the L familiar with Opdyke's results reported below?
>>>
>>>I've just learned of this paper's existance and have not read it yet.
>>>
>>>Regards,
>>>Mark
>>>
>>>
>>
>> ============================================================================
>> ====================
>>>http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1692130
>>>
>>>Much Faster Bootstraps Using SAS(r)
>>>
>>>J.D. Opdyke
>>>DataMineIt
>>>
>>>
>>>
>>>InterStat, October 2010
>>>
>>>Abstract:
>>>Seven bootstrap algorithms coded in SAS(r) are compared. The fastest
>> ("OPDY"), which uses no modules beyond Base SAS(r), achieves speed
>> increases almost two orders of magnitude faster (over 80x faster) than the
>> relevant "built-in" SAS(r) procedure (Proc SurveySelect). It is even much
>> faster than hashing, but unlike hashing it requires virtually no storage
>> space, and its memory usage efficiency allows it to execute bootstraps on
>> input datasets larger (sometimes by orders of magnitude) than the largest
>> a
>> hash table can use before aborting. This makes OPDY arguably the only
>> truly
>> scalable bootstrap algorithm in SAS(r).
>>
>
>
>
>
|