Date: Fri, 4 Mar 2005 15:05:44 -0800 Dale McLerran "SAS(r) Discussion" DomainKeys? See http://antispam.yahoo.com/domainkeys Dale McLerran NLMIXED: REPLICATE statement 6667 text/plain; charset=us-ascii

--- "adel F." <adel_tangi@YAHOO.FR> wrote: > Hi, > I would like to ask the following questions. > Suppose we have a depedent variable Y(0,1) and thow binary > independent variables X1(0,1) and > X2(0,1) and a subject with 10 values I organize my original data, > called my ata, as a set of cells of combinations of Y(0,1),X1(0,1) > ,X2(0,1) and Subject (10 values) .. > > A new data is created from mydata with 5 columns Subj (Subject) > Y,X1,X2 and freq > With freq is the frequency of the combinationY(0,1),X1(0,1) ,X2(0,1) > and Subject (10 values) > The final data contains 80 observations (80=2*2*2*10) > > I use the following code to produce the parameters and to have a > table with 10 values for mu_j, 10 for the values of Subj > > > proc nlmixed Gconv=1e-7 QPOINTS=5 data=final; > parms b0 =2.1 b1=0.1 b2=0.5 s2u=0.08; > bounds s2u >0; > > eta=b0+b1*X1+b2*X2+u; > beta=u; > expeta=exp(eta); > p=expeta/(1+expeta); > REPLICATE freq; > > model Y ~ binary(p); > random u~normal(0,exp(2*Log(s2u))) subject=Subj; > predict beta out=resid; > run; > > > The data final is sorted by Subj,before considering the NLMIXED. > My questions , is this specification correct, and if the command > predict beta out=resid; > will give the 10 values of mu_j > > Thanks you very much for your comments and suggestions > > Adel >

Adel addressed a similar question directly to me, noting that when the REPLICATE statement was employed, the number of subjects employed in the analysis were more than the number of observations in the data set. The following dimension table was reported for Adel's data:

Dimensions

Observations Used 9751 Observations Not Used 0 Total Observations 9751 Subjects 75978 Max Obs Per Subject 43 Parameters 15 Quadrature Points 5

The number of observations in the data set was 9751. At the same time, the NLMIXED procedure imputes 75,978 subjects for those 9751 observations.

Documentation for the REPLICATE statement says that "The REPLICATE statement provides a way to accommodate models in which different subjects have identical data." Note the reference to the number of SUBJECTS with identical data. Adel's FREQ variable is not the number of SUBJECTS with identical data, but rather the number of records WITHIN A SUBJECT which have identical data. The documentation goes on to state

"This occurs most commonly when the dependent variable is binary. When you specify a REPLICATE variable, PROC NLMIXED assumes that its value indicates the number of subjects having data identical to those for the current value of the SUBJECT= variable (specified in the RANDOM statement). Only the last observation of the REPLICATE variable for each subject is used, and the replicate variable must have only positive integer values."

When the REPLICATE statement is employed along with a RANDOM statement, then the implication is that we have constructed for each subject a matrix

obs Y X1 X2 ... Xp 1 {y1} {x1_1} {x2_1} {xp_1} 2 {y2} {x1_2} {x2_2} {xp_2} ... k {yk} {x1_k} {x2_k} {xp_k}

where values within braces {} are realized values of Y, X1, ... Moreover, we assume that we have sorted the data matrix for the i-th subject by the response and all predictor variables. If the entire ordered data matrix is identical for any two , then we have replicate subjects.

Subject replicates would be identified by a process something like the following:

1) Sort data by subject, response, and predictor variables with subject variable indicated first on the sort list.

2) Read all data into memory. Assuming that all data are numeric, then we can store all the data in a temporary array with dimensions N and p+2 where N is the total number of observations across all subjects and p is the number of predictor variables.

3) Construct four arrays of length I where I is the number of subjects in the data set. Array 1 indexes the position from 1 to N of the first record per subject Array 2 indexes whether a subject has identical values with an index subject (to be defined) Array 3 contains a list of index subjects Array 4 contains the number of replicates for each index subject

Index subjects are identified as the first ordered subject belonging to a unique replicate group. A replicate group consists of all subjects who have the same data matrix as an index subject. Initialize arrays 2, 3, and 4 to 0.

4) Starting with the first subject which has not already been identified as a replicate (that has array 2 value=0), loop over all other subjects not already identified as being a replicate (array 2 value=0) comparing subject- specific data matrices of the first and i-th subjects. The first subject is recorded as the next index subject, has replicate status set to 1, and has number of replicates set to 1. If another subject is identified as having the same data matrix as our index subject, then set that subjects replicate status to 1 and increment the number of replicates for the current index subject.

The first array can be used to a) point to the memory address for the start of each subjects data matrix so that we can quickly return required data, and 2) allows a very fast initial assessment of whether two subjects are candidate replicates. Two subjects can only be replicates if they have the same number of observations. Having a pointer to the location of the first observation for each subject would allow us to construct an initial determination that the number of records for each subject are identical. We would only proceed to compare data values if the number of records are identical.

5) After all subjects have replicate status set to 1, then loop back through index subjects and write out their data matrix with frequency from array 4 attached.

This seems to me a rather difficult process. It is highly unlikely to me that anyone would actually persue this replicate identification. To my mind, it is also unlikely to return much reduction of data volume. Most subjects will have some data which makes them unique.

More profitable for data reduction, and something which has really easy implementation, is just what Adel has described above. We have a subject who has multiple measurements. Many of the measurements on any given subject will have the same values for both response and predictors. We can collapse the data into the frequency that a particular response/predictor value is observed for the i-th subject. Now, the likelihood for each within-subject replicate is identical, so we can constuct the total log-likelihood contribution across replicates as

(replicate frequency) * (log-likelihood of index replicate)

The NLMIXED procedure allows one to perform this computation, but only if you write your own likelihood model and specify the general(log-likelihood) model rather than using one of the already constructed likelihood models. The following code would work for the problem which Adel faces:

proc nlmixed Gconv=1e-7 QPOINTS=5 data=final; parms b0 =2.1 b1=0.1 b2=0.5 s2u=0.08; bounds s2u >0;

eta=b0+b1*X1+b2*X2+u; beta=u; expeta=exp(eta); p=expeta/(1+expeta); if Y=1 then loglike = log(p); else loglike = log(1-p); loglike = freq*loglike;

model Y ~ general(loglike); random u~normal(0,exp(2*Log(s2u))) subject=Subj out=U_predict; run;

I would welcome any feedback from SI or anyone else who actually believes that the REPLICATE statement of the NLMIXED procedure is of real value. I would note that if there were no random effects involved, then the REPLICATE statement would function the same as my freq*(log-likelihood) computation demonstrated above. Only in that instance might I advise use of the REPLICATE statement.

But then one might inquire why use the procedure NLMIXED at all. Certainly, one would prefer to use PROC GENMOD to PROC NLMIXED for fitting a fixed effect model where the response is any of those in the exponential family which are coded for NLMIXED (binomial, gamma, normal, poisson, negative binomial). If one must solve a general likelihood model that they must code on their own, then it is just as easy when writing the likelihood model to multiply the log-likelihood contribution by the frequency of occurrence as I did for Adel as it is to use the REPLICATE statement.

So, why the REPLICATE statement which is coded for NLMIXED? It just makes very little sense to me. Given that someone, somewhere might be using the REPLICATE statement to specify the number of subjects having identical data matrices when fitting a random effects model, I know that the REPLICATE statement will not be changed. However, I would certainly encourage SI to add a FREQ statement which addresses the problem that Adel presents. That would be truly useful

Sorry for the long missive. I really would hope though that someone at SI is taking note. I use and advocate the NLMIXED procedure quite often, but the REPLICATE statement is, in my book, just plain silly.

Dale

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

__________________________________ Celebrate Yahoo!'s 10th Birthday! Yahoo! Netrospective: 100 Moments of the Web http://birthday.yahoo.com/netrospective/

Back to: Top of message | Previous page | Main SAS-L page