```Date: Sun, 6 Sep 2009 14:41:14 -0700 Reply-To: Dale McLerran Sender: "SAS(r) Discussion" From: Dale McLerran Subject: Re: Fastest Steps for Simulating: Anderson-Darling Goodness of Fit test for Non-typical distn In-Reply-To: <6eca73440909060143s1a03c7faq15a6825921ff4acf@mail.gmail.com> Content-Type: text/plain; charset=iso-8859-1 By the way, you do remember that if you are computing the A-D test statistic for an observation assumed to be from distribution D(x), then you need to change the CDF to match the distribution which is being tested. You will recall that this was pointed out in a previous post: http://listserv.uga.edu/cgi-bin/wa?A2=ind0908D&L=sas-l&P=R17435 Dale --------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 --------------------------------------- --- On Sun, 9/6/09, OR Stats wrote: > From: OR Stats > Subject: Re: Fastest Steps for Simulating: Anderson-Darling Goodness of Fit test for Non-typical distn > To: SAS-L@LISTSERV.UGA.EDU > Date: Sunday, September 6, 2009, 1:43 AM > More literally, > > NOTE: Argument 4 to function CDF at line 3 column 5 is > invalid. > is referring to the function provided > > log(cdf('normal',_x&N{i},mu,sd) > > The output tables that inserts -1 * sample size for the > column of AD or > answer to > > AD = -&N - S; > > is i n c o r r e c t, where S is > probably set to zero b.c. 'argument 4' > of CDF is invalid. > > > > On Sat, Sep 5, 2009 at 6:18 AM, OR Stats > wrote: > > > It now creates the datasets. But the S column is > just all zeros and AD > > column is all -samplesize (i.e., -50, -100, -200 > etc.) > > > > the error log now is > > > > NOTE: Argument 4 to function CDF at line 3 column 5 is > invalid. > > > > NOTE: Argument 4 to function CDF at line 3 column 52 > is invalid. > > > > On Sat, Sep 5, 2009 at 12:39 AM, Dale > McLerran > > wrote: > > > >> My mistake. There was a legacy reference to > array X > >> from when you had asked first asked how to compute > the > >> A-D test for a distribution which you wish to > specify. > >> We now have four different arrays of various > lengths. > >> The macro should reference the array of the > length > >> currently being simulated. In order to > reference the > >> correct array, replace the code > >> > >> S + ((2*i - 1)/&N) * > (log(cdf('normal',x{i},mu,sd)) + > >> log(1 - > cdf('normal',x{&N+1-i},mu,sd))); > >> > >> with > >> > >> S + ((2*i - 1)/&N) * > (log(cdf('normal',_x&N{i},mu,sd)) + > >> log(1 - > cdf('normal',_x&N{&N+1-i},mu,sd))); > >> > >> Dale > >> > >> --------------------------------------- > >> Dale McLerran > >> Fred Hutchinson Cancer Research Center > >> mailto: dmclerra@NO_SPAMfhcrc.org > >> Ph: (206) 667-2926 > >> Fax: (206) 667-5977 > >> --------------------------------------- > >> > >> > >> --- On Fri, 9/4/09, OR Stats > wrote: > >> > >> > From: OR Stats > >> > Subject: Re: Fastest Steps for Simulating: > Anderson-Darling Goodness of > >> > Fit test for Non-typical distn > >> > To: SAS-L@LISTSERV.UGA.EDU > >> > Date: Friday, September 4, 2009, 7:34 PM > >> > cool, good. The undeclared > >> > array is still giving problems > >> > > >> > ERROR: Undeclared array referenced: x. > >> > > >> > ERROR: Variable x has not been declared as an > array. > >> > > >> > ERROR: Undeclared array referenced: x. > >> > > >> > ERROR: Variable x has not been declared as an > array. > >> > > >> > 1218 %AD(N=100) > >> > > >> > > >> > On Fri, Sep 4, 2009 at 9:28 PM, Data _null_; > > >> > wrote: > >> > > >> > > That is incorrect syntax for an > iterative DO. > >> > You need. > >> > > > >> > > do s=5,5.2,5.4; > >> > > > >> > > On 9/4/09, OR Stats > >> > wrote: > >> > > > Hmm... still same error > >> > > > 1124 do S=[5 5.2 5.4]; /* This line > needs correct > >> > specification */ > >> > > > > - > >> > > > > 386 > >> > > > > - > >> > > > > 200 > >> > > > > >> > > > ERROR 386-185: Expecting an > arithmetic > >> > expression. > >> > > > > >> > > > ERROR 200-322: The symbol is not > recognized and > >> > will be ignored. > >> > > > > >> > > > On Fri, Sep 4, 2009 at 9:15 PM, OR > Stats > >> > wrote: > >> > > > > >> > > > > Ok. Too much coding on a > Friday! > >> > Thx!! > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > On Fri, Sep 4, 2009 at 9:13 > PM, Data _null_; > >> > > >> > > wrote: > >> > > > > > >> > > > > > From your original > post.... > >> > > > > > > >> > > > > > > >> > > > > > > 1 Million times > using 50, 100, > >> > 200, and 300 rows of data at each > >> > > > > > > iteration for three > different > >> > values of s (s1, s2, s3)? > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On 9/4/09, OR Stats > > >> > wrote: > >> > > > > > > Not sure what S1 S2 > and S3 are > >> > referring to? > >> > > > > > > > >> > > > > > > > >> > > > > > > On Fri, Sep 4, 2009 > at 8:56 PM, > >> > Data _null_; > >> > > > wrote: > >> > > > > > > > Did you notice > this > >> > comment... > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > /* This > line needs > >> > correct specification */ > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > On 9/4/09, OR > Stats > >> > wrote: > >> > > > > > > > > I am > getting the > >> > following error msg's > >> > > > > > > > > > >> > > > > > > > > do S={S1 > S2 S3}; /* This > >> > line needs correct specification */ > >> > > > > > > > > > >> > > > > > > > > > >> > - > >> > > > > > > > > > >> > > > > > > > > > >> > 386 > >> > > > > > > > > > >> > > > > > > > > > >> > 76 > >> > > > > > > > > > >> > > > > > > > > > >> > -- > >> > > > > > > > > > >> > > > > > > > > > >> > 202 > >> > > > > > > > > > >> > > > > > > > > ERROR > 386-185: Expecting > >> > an arithmetic expression. > >> > > > > > > > > > >> > > > > > > > > ERROR > 76-322: Syntax > >> > error, statement will be ignored. > >> > > > > > > > > > >> > > > > > > > > ERROR > 202-322: The > >> > option or parameter is not recognized and > >> > > will > >> > > > be > >> > > > > > > > > ignored. > >> > > > > > > > > > ERROR: Undeclared > >> > array referenced: x. > >> > > > > > > > > > >> > > > > > > > > ERROR: > Variable x has > >> > not been declared as an array. > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > And what > is S for as the > >> > 2nd statement of ranuni(p,S)? > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > On Sun, > Aug 30, 2009 at > >> > 11:19 PM, Dale McLerran > >> > > > > > > wrote: > >> > > > > > > > > > >> > > > > > > > > > One > million > >> > times? Why? I really think that > is overkill. > >> > > > > > > > > > I > would try to > >> > cover more parameter combinations if it were > >> > > > > > > > > > me. > >> > > > > > > > > > > >> > > > > > > > > > But > you should be > >> > able to use a single data step to generate > >> > > > > > > > > > A-D > statistics for > >> > all of your parameter combinations. > The > >> > > > > > > > > > code > below should > >> > be pretty efficient. > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > %macro AD(N=); > >> > > > > > > > > > > do i=1 to > >> > &N; > >> > > > > > > > > > > >> > /* The next line needs > completion with the > >> > appropriate G > >> > > */ > >> > > > > > > > > > > >> > _x&N{i} = G(ranuni(6923479,S)); > >> > > > > > > > > > > end; > >> > > > > > > > > > > >> > > > > > > > > > > call sortn(of > >> > _X&N(*)); > >> > > > > > > > > > > mu = mean(of > >> > x1-x&N); > >> > > > > > > > > > > var = var(of > >> > x1-x&N); > >> > > > > > > > > > > sd = > >> > sqrt(var); > >> > > > > > > > > > > S=0; > >> > > > > > > > > > > do i=1 to > >> > &N; > >> > > > > > > > > > > S + > >> > ((2*i - 1)/&N) * > (log(cdf('normal',x{i},mu,sd)) + > >> > > > > > > > > > > >> > log(1 - > >> > > > > cdf('normal',x{&N+1-i},mu,sd))); > >> > > > > > > > > > > end; > >> > > > > > > > > > > AD = -&N > >> > - S; > >> > > > > > > > > > > output > >> > AD_&N; > >> > > > > > > > > > > %mend; > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > /* > Generate 10000 > >> > samples of same size (N=9 in this case) > >> > > > following */ > >> > > > > > > > > > /* a > normal > >> > distribution and compute AD statistic for > each > >> > > > sample. */ > >> > > > > > > > > > data > AD_50 > >> > > > > > > > > > > >> > AD_100 > >> > > > > > > > > > > >> > AD_200 > >> > > > > > > > > > > >> > AD_300; > >> > > > > > > > > > > array _x50 > >> > {50} x1-x50; > >> > > > > > > > > > > array _x100 > >> > {100} x1-x100; > >> > > > > > > > > > > array _x200 > >> > {200} x1-x200; > >> > > > > > > > > > > array _X300 > >> > {300} x1-x300; > >> > > > > > > > > > > do S={S1 S2 > >> > S3}; /* > This line > >> > needs correct > >> > > > specification */ > >> > > > > > > > > > > do > >> > rep=1 to 10000; > >> > > > > > > > > > > >> > %AD(N=50) > >> > > > > > > > > > > >> > %AD(N=100) > >> > > > > > > > > > > >> > %AD(N=200) > >> > > > > > > > > > > >> > %AD(N=300) > >> > > > > > > > > > > end; > >> > > > > > > > > > > end; > >> > > > > > > > > > > keep S AD; > >> > > > > > > > > > run; > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > /* > Determine > >> > probability of observed data */ > >> > > > > > > > > > /* > using simulated > >> > data AD distribution. */ > >> > > > > > > > > > proc > sort > >> > data=AD_50; > >> > > > > > > > > > > by S AD; > >> > > > > > > > > > run; > >> > > > > > > > > > > >> > > > > > > > > > proc > sort > >> > data=AD_100; > >> > > > > > > > > > > by S AD; > >> > > > > > > > > > run; > >> > > > > > > > > > > >> > > > > > > > > > proc > sort > >> > data=AD_200; > >> > > > > > > > > > > by S AD; > >> > > > > > > > > > run; > >> > > > > > > > > > > >> > > > > > > > > > proc > sort > >> > data=AD_300; > >> > > > > > > > > > > by S AD; > >> > > > > > > > > > run; > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > The > above is > >> > untested code and should be tested with a > >> > > > > > > > > > small > number of > >> > replicates before using it for a final > >> > > > > > > > > > > simulation. > >> > Also, there will obviously need to be some > >> > > > > > > > > > final > step where > >> > you determine the quantiles of the AD > >> > > > > > > > > > > statistics. > >> > > > > > > > > > > >> > > > > > > > > > Dale > >> > > > > > > > > > > >> > > > > > > > > > > >> > --------------------------------------- > >> > > > > > > > > > Dale > McLerran > >> > > > > > > > > > Fred > Hutchinson > >> > Cancer Research Center > >> > > > > > > > > > > mailto: dmclerra@NO_SPAMfhcrc.org > >> > > > > > > > > > > Ph: (206) > >> > 667-2926 > >> > > > > > > > > > Fax: > (206) > >> > 667-5977 > >> > > > > > > > > > > >> > --------------------------------------- > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > --- > On Sun, > >> > 8/30/09, OR Stats > >> > wrote: > >> > > > > > > > > > > >> > > > > > > > > > > > From: OR Stats > >> > > >> > > > > > > > > > > > Subject: > >> > Fastest Steps for Simulating: > Anderson-Darling > >> > > > Goodness of > >> > > > > > > Fit > >> > > > > > > > > > test > for > >> > Non-typical distn > >> > > > > > > > > > > > To: SAS-L@LISTSERV.UGA.EDU > >> > > > > > > > > > > > Date: Sunday, > >> > August 30, 2009, 8:14 PM > >> > > > > > > > > > > > This is > >> > good. I am ready now to run a large > scale > >> > > > simulation. > >> > > > > > > What > >> > > > > > > > > > that > >> > > > > > > > > > > > means is that > >> > I want to compute the goodness of fit > >> > > statistic > >> > > > for (M > >> > > > > > > x > >> > > > > > > > > > > > S) groups and > >> > n times each group. > >> > > > > > > > > > > > >> > > > > > > > > > > > Group defined > >> > by (m,s); S = s1 s2 s3 and M = 50 100 > 200 > >> > > 300. > >> > > > > > > Basically, > >> > > > > > > > > > > > M is my > >> > different sample sizes for which I am testing > their > >> > > > fit to > >> > > > > > > > > > > > function > >> > G(random#,s) (i.e., inverse > distribution). I > >> > > would > >> > > > like to > >> > > > > > > run > >> > > > > > > > > > > > each group 1 > >> > million times. For each s group, by > >> > > generating > >> > > > random > >> > > > > > > > > > > > numbers just > >> > by 300 x 1million times, I'll have enough > >> > > > simulated > >> > > > > > > data > >> > > > > > > > > > > > y(s) to use > >> > for the largest and smaller sample sizes. > >> > > > > > > > > > > > >> > > > > > > > > > > > My final > >> > column space would look like > >> > > > > > > > > > > > >> > > i ranuni y_s1=G(ranuni,s1) y_s2=G(ranuni,s2) > >> > > > > > > y_s3=G(ranuni,s3) > >> > > > > > > > > > > > 1 > >> > > > > > > > > > > > . > >> > > > > > > > > > > > . > >> > > > > > > > > > > > . > >> > > > > > > > > > > > m > >> > > > > > > > > > > > All rows in > >> > the above table would be used to caculate > >> > > function > >> > > > f_s1, > >> > > > > > > > > > > > f_s2, f_s3 > >> > (i.e., AD). This last step is repeated > 1 > >> > > Million > >> > > > times. > >> > > > > > > > > > > > >> > > > > > > > > > > > Can we do this > >> > in one to two DATA STEPS? Which syntax > >> > > would > >> > > > be > >> > > > > > > fastest > >> > > > > > > > > > > > since we have > >> > to generate 300 Million random numbers, from > >> > > > which we > >> > > > > > > would > >> > > > > > > > > > > > split the > >> > sample by 1 Million disjoint sets that we > would > >> > > then > >> > > > > > > compute a > >> > > > > > > > > > > > statistic 1 > >> > Million times using 50, 100, 200, and 300 > rows > >> > > of > >> > > > data > >> > > > > > > at > >> > > > > > > > > > > > each iteration > >> > for three different values of s (s1, s2, > >> > > s3)? > >> > > > > > > > > > > > >> > > > > > > > > > > > Thank Q! > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > >> > > > > >> > > > >> > > >> > > > > > ```

Back to: Top of message | Previous page | Main SAS-L page