Date: Sun, 16 Jan 2005 14:44:06 -0500
Reply-To: "Zack, Matthew M." <MMZ1@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Zack, Matthew M." <MMZ1@CDC.GOV>
Subject: Re: surveyselect question
Content-Type: text/plain; charset="us-ascii"
What if you randomly select patients and all their visits without PROC
SURVEYSELECT?
* Sort patient visits;
* by patient ID;
proc sort;
by pt;
run;
* Generate a uniform random number for each patient;
data two(drop=rnseed);
retain rnseed 6093141 rn;
set;
by pt;
if (first.pt eq 1)
then rn=uniform(rnseed);
output two;
run;
* Sort patient visits;
* by patient ID, visit ID, and ascending uniform random number;
proc sort data=two;
by pt visit rn;
run;
* Select about 50 [+- 2 visits so that range=48 to 52] total patient
visits;
* Add five visits (possibly from different patients) after the 50
above are selected;
* where SBP=. or DBP=.;
data visit50(drop=rn lstvisit nmissbp);
retain lstvisit nmissbp 0;
set two;
by pt visit;
select;
when (lstvisit eq 0) do;
if ((ABS(50-_n_) le 2) and
(last.pt eq 1))
then lstvisit=1;
output visit50;
end;
when (lstvisit eq 1) do;
if ((sbp eq .) or
(dbp eq .)) then do;
nmissbp=nmissbp+1;
if (nmissbp le 5)
then output visit50;
else lstvisit=2;
end;
end;
otherwise stop;
end;
run;
* Select about 20% of the input data set;
* Add five visits (possibly from different patients) after the above
20% are selected;
* where SBP=. or DBP=.;
data visit20p(drop=rn nmissbp);
retain nmissbp 0;
set two;
by pt visit;
select;
when (rn le 0.20) output visit20p;
otherwise do;
if ((sbp eq .) or
(dbp eq .)) then do;
nmissbp=nmissbp+1;
if (nmissbp le 5)
then output visit20p;
else stop;
end;
end;
end;
run;
Matthew Zack
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Scott
Sent: Sunday, January 16, 2005 1:12 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: surveyselect question
Hi,
I've read various posts about SURVEYSELECT and random samples in the
archives, but couldn't find the answer to my problem, thus this post...
Say I have a dataset:
PT VISIT SBP DBP, where
PT = patient
VISIT = visit number, say 1 - 4, which may be incomplete for a given
PT, i.e. could be 1; 1,2,4; 1,2,3; 1,3; etc.
SBP = systolic blood pressure
DBP = diastolic blood pressure (both BP's could have missing values)
I'd like to sample this dataset as follows:
1. Sample has "around" say 50 observations in total.
2. Sample has say 20% of observations from input data set.
In both of these samples, *** ALL observations for a given PT are
included ***, i.e. if PT 7 is one of the patients randomly selected,
then all visits for that PT are included in the random sample.
3. #1 and #2 above, augmented by say 5 random observations where either
SBP, DBP, or both have a missing value.
For #3, I don't care if I make two passes over the data, but one pass
would be nice.
IOW, in "pseudocode":
1. If each PT had 4 visit records, I would have either 12 (48) or 13
(52) observations in the sample dataset, since I specified a sample size
of around 50.
2. If each PT had 4 visit records, and the total input dataset is 1000
observations, I would have 200 observations in the sample dataset,
comprised of 50 PTs with 4 visits each.
3.(1) 12 or 13 random patients, plus 5 observations where SBP, DBP, or
both were missing.
3.(2) 50 random patients, plus 5 observations where SBP, DBP, or both
were missing.
I've played with SURVEYSELECT, but can't figure out how to get all
records for a given PT to be included in the output.
Note that this sampling is for QC tests of code algorithms, not for
further statistical analyses of the resulting sample.
Thanks,
Scott