|
--- "adel F." <adel_tangi@YAHOO.FR> wrote:
> Hi,
> I would like to ask the following questions.
> Suppose we have a depedent variable Y(0,1) and thow binary
> independent variables X1(0,1) and
> X2(0,1) and a subject with 10 values I organize my original data,
> called my ata, as a set of cells of combinations of Y(0,1),X1(0,1)
> ,X2(0,1) and Subject (10 values) ..
>
> A new data is created from mydata with 5 columns Subj (Subject)
> Y,X1,X2 and freq
> With freq is the frequency of the combinationY(0,1),X1(0,1) ,X2(0,1)
> and Subject (10 values)
> The final data contains 80 observations (80=2*2*2*10)
>
> I use the following code to produce the parameters and to have a
> table with 10 values for mu_j, 10 for the values of Subj
>
>
> proc nlmixed Gconv=1e-7 QPOINTS=5 data=final;
> parms b0 =2.1 b1=0.1 b2=0.5 s2u=0.08;
> bounds s2u >0;
>
> eta=b0+b1*X1+b2*X2+u;
> beta=u;
> expeta=exp(eta);
> p=expeta/(1+expeta);
> REPLICATE freq;
>
> model Y ~ binary(p);
> random u~normal(0,exp(2*Log(s2u))) subject=Subj;
> predict beta out=resid;
> run;
>
>
> The data final is sorted by Subj,before considering the NLMIXED.
> My questions , is this specification correct, and if the command
> predict beta out=resid;
> will give the 10 values of mu_j
>
> Thanks you very much for your comments and suggestions
>
> Adel
>
Adel addressed a similar question directly to me, noting that
when the REPLICATE statement was employed, the number of subjects
employed in the analysis were more than the number of observations
in the data set. The following dimension table was reported for
Adel's data:
Dimensions
Observations Used 9751
Observations Not Used 0
Total Observations 9751
Subjects 75978
Max Obs Per Subject 43
Parameters 15
Quadrature Points 5
The number of observations in the data set was 9751. At the same
time, the NLMIXED procedure imputes 75,978 subjects for those
9751 observations.
Documentation for the REPLICATE statement says that "The REPLICATE
statement provides a way to accommodate models in which different
subjects have identical data." Note the reference to the number
of SUBJECTS with identical data. Adel's FREQ variable is not the
number of SUBJECTS with identical data, but rather the number of
records WITHIN A SUBJECT which have identical data. The
documentation goes on to state
"This occurs most commonly when the dependent variable is
binary. When you specify a REPLICATE variable, PROC NLMIXED
assumes that its value indicates the number of subjects
having data identical to those for the current value of the
SUBJECT= variable (specified in the RANDOM statement). Only
the last observation of the REPLICATE variable for each
subject is used, and the replicate variable must have only
positive integer values."
When the REPLICATE statement is employed along with a RANDOM
statement, then the implication is that we have constructed for
each subject a matrix
obs Y X1 X2 ... Xp
1 {y1} {x1_1} {x2_1} {xp_1}
2 {y2} {x1_2} {x2_2} {xp_2}
...
k {yk} {x1_k} {x2_k} {xp_k}
where values within braces {} are realized values of Y, X1, ...
Moreover, we assume that we have sorted the data matrix for the
i-th subject by the response and all predictor variables. If the
entire ordered data matrix is identical for any two , then we have
replicate subjects.
Subject replicates would be identified by a process something like
the following:
1) Sort data by subject, response, and predictor variables
with subject variable indicated first on the sort list.
2) Read all data into memory. Assuming that all data are
numeric, then we can store all the data in a temporary
array with dimensions N and p+2 where N is the total
number of observations across all subjects and p is the
number of predictor variables.
3) Construct four arrays of length I where I is the number
of subjects in the data set.
Array 1 indexes the position from 1 to N of the first
record per subject
Array 2 indexes whether a subject has identical
values with an index subject (to be defined)
Array 3 contains a list of index subjects
Array 4 contains the number of replicates for each
index subject
Index subjects are identified as the first ordered subject
belonging to a unique replicate group. A replicate group
consists of all subjects who have the same data matrix as
an index subject.
Initialize arrays 2, 3, and 4 to 0.
4) Starting with the first subject which has not already
been identified as a replicate (that has array 2 value=0),
loop over all other subjects not already identified as
being a replicate (array 2 value=0) comparing subject-
specific data matrices of the first and i-th subjects.
The first subject is recorded as the next index subject,
has replicate status set to 1, and has number of
replicates set to 1. If another subject is identified as
having the same data matrix as our index subject, then
set that subjects replicate status to 1 and increment
the number of replicates for the current index subject.
The first array can be used to a) point to the memory
address for the start of each subjects data matrix so
that we can quickly return required data, and 2) allows
a very fast initial assessment of whether two subjects
are candidate replicates. Two subjects can only be
replicates if they have the same number of observations.
Having a pointer to the location of the first observation
for each subject would allow us to construct an initial
determination that the number of records for each subject
are identical. We would only proceed to compare data
values if the number of records are identical.
5) After all subjects have replicate status set to 1, then
loop back through index subjects and write out their
data matrix with frequency from array 4 attached.
This seems to me a rather difficult process. It is highly
unlikely to me that anyone would actually persue this replicate
identification. To my mind, it is also unlikely to return much
reduction of data volume. Most subjects will have some data
which makes them unique.
More profitable for data reduction, and something which has
really easy implementation, is just what Adel has described
above. We have a subject who has multiple measurements. Many
of the measurements on any given subject will have the same
values for both response and predictors. We can collapse the
data into the frequency that a particular response/predictor
value is observed for the i-th subject. Now, the likelihood
for each within-subject replicate is identical, so we can
constuct the total log-likelihood contribution across replicates
as
(replicate frequency) * (log-likelihood of index replicate)
The NLMIXED procedure allows one to perform this computation,
but only if you write your own likelihood model and specify the
general(log-likelihood) model rather than using one of the
already constructed likelihood models. The following code would
work for the problem which Adel faces:
proc nlmixed Gconv=1e-7 QPOINTS=5 data=final;
parms b0 =2.1 b1=0.1 b2=0.5 s2u=0.08;
bounds s2u >0;
eta=b0+b1*X1+b2*X2+u;
beta=u;
expeta=exp(eta);
p=expeta/(1+expeta);
if Y=1 then loglike = log(p);
else loglike = log(1-p);
loglike = freq*loglike;
model Y ~ general(loglike);
random u~normal(0,exp(2*Log(s2u))) subject=Subj out=U_predict;
run;
I would welcome any feedback from SI or anyone else who actually
believes that the REPLICATE statement of the NLMIXED procedure
is of real value. I would note that if there were no random
effects involved, then the REPLICATE statement would function
the same as my freq*(log-likelihood) computation demonstrated
above. Only in that instance might I advise use of the
REPLICATE statement.
But then one might inquire why use the procedure NLMIXED at all.
Certainly, one would prefer to use PROC GENMOD to PROC NLMIXED
for fitting a fixed effect model where the response is any of
those in the exponential family which are coded for NLMIXED
(binomial, gamma, normal, poisson, negative binomial). If one
must solve a general likelihood model that they must code on
their own, then it is just as easy when writing the likelihood
model to multiply the log-likelihood contribution by the frequency
of occurrence as I did for Adel as it is to use the REPLICATE
statement.
So, why the REPLICATE statement which is coded for NLMIXED?
It just makes very little sense to me. Given that someone,
somewhere might be using the REPLICATE statement to specify
the number of subjects having identical data matrices when
fitting a random effects model, I know that the REPLICATE
statement will not be changed. However, I would certainly
encourage SI to add a FREQ statement which addresses the
problem that Adel presents. That would be truly useful
Sorry for the long missive. I really would hope though that
someone at SI is taking note. I use and advocate the NLMIXED
procedure quite often, but the REPLICATE statement is, in my
book, just plain silly.
Dale
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
__________________________________
Celebrate Yahoo!'s 10th Birthday!
Yahoo! Netrospective: 100 Moments of the Web
http://birthday.yahoo.com/netrospective/
|