Date: Fri, 15 Dec 2006 16:26:40 -0800
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Jackknifing for fun and profit!
Content-Type: text/plain; format=flowed
SAS-Lers everywhere:
As you may have noticed, the subject of producing a jackknife data
set for computing a jackknife estimate has come up. The basic idea
is that if you have N records, you want to analyze the data N times,
each time omitting the Ith record. Then you have a linearization of
the behavior of the statistic of interest.
This means you end up with a data set with N*(N-1) rows. As N
gets big, this gets ridiculously unwieldy. At N=1000, you're making
a dataset that has nearly a million rows. So the construction of the
data set starts to matter.
Here's the code I showed Marina:
data outb;
do replicate = 1 to num;
do rec = 1 to num;
set test nobs=num point=rec;
if replicate ^= rec then output;
end;
end;
stop;
run;
But there are other ways to generate the OUTB data file,
given the starting data set TEST.
Here's a PROC SURVEYSELECT method. You knew I was going
to go there sooner or later, didn't you?
proc sql noprint;
select count(*) into :size from test;
quit;
proc surveyselect data=test out=outb1 method=srs samprate=1 rep=&SIZE. ;
run;
data outb / view=outb;
set outb1;
if replicate=mod(_n_,&SIZE.)+1 then delete;
run;
This works because the proc spots that the sample will have
to pull every record, so it just outputs all the records. In order.
For each replicate.
And you can do this with PROC SQL too, of course. Here's one
way.
data test2 / view=test2;
set test;
rec=_n_;
run;
proc sql noprint;
create table outb as
select a.rec as replicate, b.*
from test2 a, test2 b
where a.rec^=b.rec;
quit;
And you can try using the SASFILE to speed things up, although
SAS tries to buffer the input data set anyway, so there is not
much advantage for small files.
So here's the question. Can you come up with a faster way of
building the OUTB dataset so that it comes out already sorted
by the value of REPLICATE ? By the nature of the process, it
does not have to be sorted within each value of REPLICATE,
unless you just want it that way. Feel free to make up your
own TEST data set as a starting point. This is supposed to be
a general solution, so if I offer a single TEST data set, that could
bias the results.
Just a little something since I didn't buy you a holiday gift,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Stay up-to-date with your friends through the Windows Live Spaces friends
list.
http://clk.atdmt.com/MSN/go/msnnkwsp0070000001msn/direct/01/?href=http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mk