Date: Mon, 17 Jan 2005 08:07:02 -0500
Reply-To: "Zack, Matthew M." <MMZ1@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Zack, Matthew M." <MMZ1@CDC.GOV>
Subject: Re: surveyselect question
Content-Type: text/plain; charset="us-ascii"
I don't know because I'm not an expert in PROC SURVEYSELECT (another
SAS-L member, David Cassell, is such an expert).
PROC SURVEYSELECT will allow you to select a simple random sample
(option, METHOD=SRS)
of patients of a specified size (N=13) or a specified proportion
(SAMPRATE=0.05), but
it will do so only after you collapse across visits so that every
patient is represented
only once. Otherwise, PROC SURVEYSELECT may not select all visits of a
specific selected
patient. You would then have to merge the selected sample of patients
with the original sample
by patient ID to retrieve all the visits associated with each patient.
I don't think PROC SURVEYSELECT by itself can select observations with
missing values
of either SBP or DBP. You would have to use a prior DATA step or PROC
APPEND to select observations with these characteristics before PROC
SURVEYSELECT would randomly select
a sample of such observations. Then, you would have to concatenate
these sampled
observations after the previously randomly selected persons.
I don't know whether your macro, my prior program, or PROC SURVEYSELECT
would be better
(more efficient, easier to understand, etc.) in terms of your specific
application.
PROC SURVEYSELECT is designed to select samples for stratified and
clustered designs,
with and without replacement. It may not work so well with the
specifications for your application.
Matthew Zack
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Scott
Sent: Sunday, January 16, 2005 6:48 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: surveyselect question
Hi Matthew,
Actually, the below macro (typical K/N example) will do what I want,
except for the missing values part. My question is whether SURVEYSELECT
can (also) meet my needs, plus I want to get a better feel of its
functionality via this specific example. I read about half the
SURVEYSELECT doc; a lot of the terminology was over my head (I'm not a
statistician), so I was hoping SAS-L could jump start me in my use of
SURVEYSELECT.
Regards,
Scott
macro follows - replace %parmv calls with your own error processing, or
just compile a null %parmv macro:
%macro randlist
/*----------------------------------------------------------------------
Create a random list of data, and create an output data set, a macro
variable list, or both. Output can be limited by pct of total
observations, min # of observations, max # of observations, or a
combination of these parameters.
----------------------------------------------------------------------*/
(data= /* Input dataset (REQ). */
/* Input dataset options are allowed. */
,out= /* Output dataset (Optional). */
/* Output dataset options are allowed, such as KEEP=, */
/* but don't use a WHERE= output dataset option. */
/* If missing no output dataset will be created. */
,mvar=randlist /* Output macro variable list (Optional). */
/* Default value is randlist. If missing no macro */
/* variable list will be created. */
,var=pt /* Variable to subset on (REQ). */
/* Default value is pt. */
,pct=50 /* percentage of input dataset desired in output. */
/* (Optional). Default value is 50%. If missing, */
/* 100% of the input dataset will be included in the */
/* output. Valid values are 1 - 100. */
,min= /* Minimum number of items desired in output (Optional)*/
,max= /* Maximum number of items desired in output (Optional)*/
,seed=0 /* Seed value for uniform function (REQ). */
/* Default value is 0. */
/* Valid values are any non-negative integer. */
);
/*----------------------------------------------------------------------
Usage:
%randlist(data=dataprot.efficacy)
returns a random 50% of input dataset in macro variable &randlist.
data returned is a list of pts.
%randlist(data=dataprot.efficacy,var=foo)
returns a random 50% of input dataset in macro variable &randlist.
data returned is a list of foo.
%randlist(data=dataprot.efficacy,out=work.randlist)
returns a random 50% of input dataset in both macro variable
&randlist
and output dataset work.randlist.
data returned would be a random 50% of pts from the input dataset,
and a random ordered list of these same pts in &randlist.
%randlist(data=dataprot.efficacy,out=work.randlist,mvar=)
returns a random 50% of input dataset in output dataset work.randlist
only.
data returned would be a random 50% of pts from the input dataset.
%randlist(data=dataprot.efficacy,pct=20)
returns a random 20% of input dataset in macro variable &randlist.
%randlist(data=dataprot.efficacy,pct=20,min=10)
returns a random 20% of input dataset in macro variable &randlist,
or a minimum of 10 items returned if 20% is less than 10 items.
%randlist(data=dataprot.efficacy,pct=20,max=30)
returns a random 20% of input dataset in macro variable &randlist,
or a maximum of 20 items returned if 20% is greater than 30 items.
%randlist(data=dataprot.efficacy,pct=20,min=20,max=120)
returns a random 20% of input dataset in macro variable &randlist,
with a minimum of 20 items and a maximum number of 120 items
returned.
%randlist(data=dataprot.efficacy (where=(pt not like
"S%")),out=work.randlist (keep=pt visit))
returns a random 50% of input dataset, which is first subset by the
input dataset where clause.
returned data is in both the macro variable &randlist and the
output dataset work.randlist.
only the variables pt and visit are kept in the output dataset.
------------------------------------------------------------------------
Notes:
Input dataset must be specified and must exist.
&var must be a variable in the input dataset.
By default, &randlist is always created. Set &mvar to blank to suppress
this.
Variable is returned as a comma-delimited list only (no parentheses).
This makes it easier to further parse the returned list if required. If
building an IN where clause, you need to add the parentheses, eg.
where pt in (&randlist);
vs.
where pt in &randlist;
Character variables are returned as a single-quoted comma-delimited
list, eg. '10041017','10031023','10021037', etc.
and numeric variables are returned as a comma-delimited list, eg.
17,3,22,7,4,53, etc.
An output WHERE clause defeats the whole purpose of this macro. The &out
parm is upper-cased in the macro. An output WHERE clause will have
unpredictable results.
Note that, if you invoke this macro and create a macro variable &mvar in
a *non-default* macro variable name (i.e. not &randlist), then later
re-invoke the macro with &mvar set to missing, the *old* random item
list will still exist in that macro variable. Since the syntax to not
create a macro variable is a blank &mvar, I don't know which old macro
variable to reset. The best I can do is unconditionally delete the
*default* macro variable &randlist on each macro invocation. You can use
%symdel to delete an old macro variable before re-invoking the macro.
-----------------------------------------------------------------------
History:
12JAN05 Initial creation
----------------------------------------------------------------------*/
%* delete the old default &mvar setting ; %unquote(%nrstr(%symdel
randlist / nowarn)); /* workaround a SAS bug */
%* basic error checking ;
%local macro parmerr options;
%let macro = &sysmacroname;
%* do not want to upcase the input data parm or it will mess up any
input where clause ; %parmv(data, _req=1,_words=1,_case=N) /* _words=1
allows input ds options */ %parmv(out, _req=0,_words=1,_case=U) /*
_words=1 allows output ds options */ %parmv(mvar,
_req=0,_words=0,_case=U) %parmv(var, _req=1,_words=0,_case=U);
%parmv(pct, _req=0,_words=0,_case=U,_val=POSITIVE)
%parmv(min, _req=0,_words=0,_case=U,_val=POSITIVE)
%parmv(max, _req=0,_words=0,_case=U,_val=POSITIVE)
%parmv(seed, _req=1,_words=0,_case=U,_val=NONNEGATIVE)
%if (&parmerr) %then %goto quit;
%* first parse off any dataset options ;
%let i = %index(%bquote(&data),%str(%());
%if &i %then %do;
%let options = %substr(%bquote(&data),&i);
%let data = %substr(%bquote(&data),1,&i-1);
%end;
%let data = %upcase(&data);
%* more error checking ;
%* does the input dataset exist? ;
%if not %sysfunc(exist(&data)) %then
%parmv(_msg=Dataset &data does not exist);
%* is var in the input dataset? ;
%if not %varexist(&data,&var) %then
%parmv(_msg=Variable &var does not exist in dataset &data); %else
%let vartype = %varexist(&data,&var,TYPE);
%* are both output dataset and mvar variable blank? ;
%if (%bquote(&out) = and &mvar = ) %then
%parmv(_msg=Either an output dataset or macro variable must be
specified);
%* is percent between 1 and 100? ;
%if not (1 <= &pct and &pct <= 100) %then
%parmv(_msg=Percent must be an integer between 1 and 100);
%* is max greater than min? ;
%if (&min ^= and &max ^= )%then
%if (&max <= &min) %then
%parmv(_msg=Max (&max) must be greater than min (&min));
%if (&parmerr) %then %goto quit;
%* create a dataset of distinct &var, sorted in random order ; proc sql
noprint;
create table _temp_ as
select *, uniform(&seed) as sort
from
(select distinct &var
from &data &options
)
order by sort
;
quit;
%* calculate percentage, min, and max ;
%nobs(_temp_);
%if (&pct = ) %then %let pct = 100;
%if (&min = ) %then %let min = 0;
%if (&max = ) %then %let max = &nobs;
%* convert pct to an observation count ;
%let pct = %sysfunc(round(&nobs * (&pct / 100)));
%* now calculate the desired item count given pct, min, and max settings
; %let num = %sysfunc(min(%sysfunc(max(&min,&pct)),&max));
%* debugging statement, comment out in production ;
%put nobs=&nobs pct=&pct min=&min max=&max num=#
%* create random subset and/or item list ;
proc sql noprint;
%* do you want an output dataset? ;
%if (%bquote(&out) ^= ) %then %do;
create table %bquote(&out) as
select *
from &data &options
where &var in
(select &var
from _temp_ (obs=&num)
)
;
%end;
%* do you want an output macro variable? ;
%* if you want a sorted, but still random, list then uncomment the
"distinct" keyword ;
%* but an unsorted list gives more confidence that the list is truly
random ;
%if (&mvar ^= ) %then %do;
%global &mvar;
%if (&vartype = C) %then %do;
select /* distinct */ "'" || trim(&var) || "'" into :&mvar
separated by ','
from _temp_ (obs=&num)
;
%end;
%else %do;
select /* distinct */ &var into :&mvar separated by ','
from _temp_ (obs=&num)
;
%end;
%end;
quit;
%quit:
%mend;
"Zack, Matthew M." <MMZ1@CDC.GOV> wrote in message
news:70A9A1413B47DD4FB1F311A7C19B218C038B1DF8@m-nccd-1.nccd.cdc.gov...
> What if you randomly select patients and all their visits without PROC
> SURVEYSELECT?
>
> * Sort patient visits;
> * by patient ID;
>
> proc sort;
> by pt;
> run;
>
> * Generate a uniform random number for each patient;
>
> data two(drop=rnseed);
> retain rnseed 6093141 rn;
> set;
> by pt;
> if (first.pt eq 1)
> then rn=uniform(rnseed);
> output two;
> run;
>
> * Sort patient visits;
> * by patient ID, visit ID, and ascending uniform random number;
>
> proc sort data=two;
> by pt visit rn;
> run;
>
> * Select about 50 [+- 2 visits so that range=48 to 52] total patient
> visits;
> * Add five visits (possibly from different patients) after the 50
> above are selected;
> * where SBP=. or DBP=.;
>
> data visit50(drop=rn lstvisit nmissbp);
> retain lstvisit nmissbp 0;
> set two;
> by pt visit;
> select;
> when (lstvisit eq 0) do;
> if ((ABS(50-_n_) le 2) and
> (last.pt eq 1))
> then lstvisit=1;
> output visit50;
> end;
> when (lstvisit eq 1) do;
> if ((sbp eq .) or
> (dbp eq .)) then do;
> nmissbp=nmissbp+1;
> if (nmissbp le 5)
> then output visit50;
> else lstvisit=2;
> end;
> end;
> otherwise stop;
> end;
> run;
>
> * Select about 20% of the input data set;
> * Add five visits (possibly from different patients) after the above
> 20% are selected;
> * where SBP=. or DBP=.;
>
> data visit20p(drop=rn nmissbp);
> retain nmissbp 0;
> set two;
> by pt visit;
> select;
> when (rn le 0.20) output visit20p;
> otherwise do;
> if ((sbp eq .) or
> (dbp eq .)) then do;
> nmissbp=nmissbp+1;
> if (nmissbp le 5)
> then output visit20p;
> else stop;
> end;
> end;
> end;
> run;
>
> Matthew Zack
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Scott
> Sent: Sunday, January 16, 2005 1:12 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: surveyselect question
>
>
> Hi,
>
> I've read various posts about SURVEYSELECT and random samples in the
> archives, but couldn't find the answer to my problem, thus this
> post...
>
> Say I have a dataset:
>
> PT VISIT SBP DBP, where
>
> PT = patient
> VISIT = visit number, say 1 - 4, which may be incomplete for a given
> PT, i.e. could be 1; 1,2,4; 1,2,3; 1,3; etc.
> SBP = systolic blood pressure
> DBP = diastolic blood pressure (both BP's could have missing
values)
>
> I'd like to sample this dataset as follows:
>
> 1. Sample has "around" say 50 observations in total.
>
> 2. Sample has say 20% of observations from input data set.
>
> In both of these samples, *** ALL observations for a given PT are
> included ***, i.e. if PT 7 is one of the patients randomly selected,
> then all visits for that PT are included in the random sample.
>
> 3. #1 and #2 above, augmented by say 5 random observations where
> either SBP, DBP, or both have a missing value.
>
> For #3, I don't care if I make two passes over the data, but one pass
> would be nice.
>
> IOW, in "pseudocode":
>
> 1. If each PT had 4 visit records, I would have either 12 (48) or 13
> (52) observations in the sample dataset, since I specified a sample
> size of around 50.
>
> 2. If each PT had 4 visit records, and the total input dataset is
> 1000 observations, I would have 200 observations in the sample
> dataset, comprised of 50 PTs with 4 visits each.
>
> 3.(1) 12 or 13 random patients, plus 5 observations where SBP, DBP, or
> both were missing.
>
> 3.(2) 50 random patients, plus 5 observations where SBP, DBP, or both
> were missing.
>
> I've played with SURVEYSELECT, but can't figure out how to get all
> records for a given PT to be included in the output.
>
> Note that this sampling is for QC tests of code algorithms, not for
> further statistical analyses of the resulting sample.
>
> Thanks,
> Scott