LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2005, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 17 Jan 2005 08:07:02 -0500
Reply-To:     "Zack, Matthew M." <MMZ1@CDC.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Zack, Matthew M." <MMZ1@CDC.GOV>
Subject:      Re: surveyselect question
Content-Type: text/plain; charset="us-ascii"

I don't know because I'm not an expert in PROC SURVEYSELECT (another SAS-L member, David Cassell, is such an expert).

PROC SURVEYSELECT will allow you to select a simple random sample (option, METHOD=SRS) of patients of a specified size (N=13) or a specified proportion (SAMPRATE=0.05), but it will do so only after you collapse across visits so that every patient is represented only once. Otherwise, PROC SURVEYSELECT may not select all visits of a specific selected patient. You would then have to merge the selected sample of patients with the original sample by patient ID to retrieve all the visits associated with each patient.

I don't think PROC SURVEYSELECT by itself can select observations with missing values of either SBP or DBP. You would have to use a prior DATA step or PROC APPEND to select observations with these characteristics before PROC SURVEYSELECT would randomly select a sample of such observations. Then, you would have to concatenate these sampled observations after the previously randomly selected persons.

I don't know whether your macro, my prior program, or PROC SURVEYSELECT would be better (more efficient, easier to understand, etc.) in terms of your specific application.

PROC SURVEYSELECT is designed to select samples for stratified and clustered designs, with and without replacement. It may not work so well with the specifications for your application.

Matthew Zack

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Scott Sent: Sunday, January 16, 2005 6:48 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: surveyselect question

Hi Matthew,

Actually, the below macro (typical K/N example) will do what I want, except for the missing values part. My question is whether SURVEYSELECT can (also) meet my needs, plus I want to get a better feel of its functionality via this specific example. I read about half the SURVEYSELECT doc; a lot of the terminology was over my head (I'm not a statistician), so I was hoping SAS-L could jump start me in my use of SURVEYSELECT.

Regards, Scott

macro follows - replace %parmv calls with your own error processing, or just compile a null %parmv macro:

%macro randlist /*---------------------------------------------------------------------- Create a random list of data, and create an output data set, a macro variable list, or both. Output can be limited by pct of total observations, min # of observations, max # of observations, or a combination of these parameters. ----------------------------------------------------------------------*/ (data= /* Input dataset (REQ). */ /* Input dataset options are allowed. */ ,out= /* Output dataset (Optional). */ /* Output dataset options are allowed, such as KEEP=, */ /* but don't use a WHERE= output dataset option. */ /* If missing no output dataset will be created. */ ,mvar=randlist /* Output macro variable list (Optional). */ /* Default value is randlist. If missing no macro */ /* variable list will be created. */ ,var=pt /* Variable to subset on (REQ). */ /* Default value is pt. */ ,pct=50 /* percentage of input dataset desired in output. */ /* (Optional). Default value is 50%. If missing, */ /* 100% of the input dataset will be included in the */ /* output. Valid values are 1 - 100. */ ,min= /* Minimum number of items desired in output (Optional)*/ ,max= /* Maximum number of items desired in output (Optional)*/ ,seed=0 /* Seed value for uniform function (REQ). */ /* Default value is 0. */ /* Valid values are any non-negative integer. */ );

/*---------------------------------------------------------------------- Usage:

%randlist(data=dataprot.efficacy) returns a random 50% of input dataset in macro variable &randlist. data returned is a list of pts.

%randlist(data=dataprot.efficacy,var=foo) returns a random 50% of input dataset in macro variable &randlist. data returned is a list of foo.

%randlist(data=dataprot.efficacy,out=work.randlist) returns a random 50% of input dataset in both macro variable &randlist and output dataset work.randlist. data returned would be a random 50% of pts from the input dataset, and a random ordered list of these same pts in &randlist.

%randlist(data=dataprot.efficacy,out=work.randlist,mvar=) returns a random 50% of input dataset in output dataset work.randlist only. data returned would be a random 50% of pts from the input dataset.

%randlist(data=dataprot.efficacy,pct=20) returns a random 20% of input dataset in macro variable &randlist.

%randlist(data=dataprot.efficacy,pct=20,min=10) returns a random 20% of input dataset in macro variable &randlist, or a minimum of 10 items returned if 20% is less than 10 items.

%randlist(data=dataprot.efficacy,pct=20,max=30) returns a random 20% of input dataset in macro variable &randlist, or a maximum of 20 items returned if 20% is greater than 30 items.

%randlist(data=dataprot.efficacy,pct=20,min=20,max=120) returns a random 20% of input dataset in macro variable &randlist, with a minimum of 20 items and a maximum number of 120 items returned.

%randlist(data=dataprot.efficacy (where=(pt not like "S%")),out=work.randlist (keep=pt visit)) returns a random 50% of input dataset, which is first subset by the input dataset where clause. returned data is in both the macro variable &randlist and the output dataset work.randlist. only the variables pt and visit are kept in the output dataset. ------------------------------------------------------------------------ Notes:

Input dataset must be specified and must exist.

&var must be a variable in the input dataset.

By default, &randlist is always created. Set &mvar to blank to suppress this.

Variable is returned as a comma-delimited list only (no parentheses). This makes it easier to further parse the returned list if required. If building an IN where clause, you need to add the parentheses, eg.

where pt in (&randlist);

vs.

where pt in &randlist;

Character variables are returned as a single-quoted comma-delimited list, eg. '10041017','10031023','10021037', etc.

and numeric variables are returned as a comma-delimited list, eg. 17,3,22,7,4,53, etc.

An output WHERE clause defeats the whole purpose of this macro. The &out parm is upper-cased in the macro. An output WHERE clause will have unpredictable results.

Note that, if you invoke this macro and create a macro variable &mvar in a *non-default* macro variable name (i.e. not &randlist), then later re-invoke the macro with &mvar set to missing, the *old* random item list will still exist in that macro variable. Since the syntax to not create a macro variable is a blank &mvar, I don't know which old macro variable to reset. The best I can do is unconditionally delete the *default* macro variable &randlist on each macro invocation. You can use %symdel to delete an old macro variable before re-invoking the macro. ----------------------------------------------------------------------- History:

12JAN05 Initial creation ----------------------------------------------------------------------*/

%* delete the old default &mvar setting ; %unquote(%nrstr(%symdel randlist / nowarn)); /* workaround a SAS bug */

%* basic error checking ; %local macro parmerr options; %let macro = &sysmacroname;

%* do not want to upcase the input data parm or it will mess up any input where clause ; %parmv(data, _req=1,_words=1,_case=N) /* _words=1 allows input ds options */ %parmv(out, _req=0,_words=1,_case=U) /* _words=1 allows output ds options */ %parmv(mvar, _req=0,_words=0,_case=U) %parmv(var, _req=1,_words=0,_case=U); %parmv(pct, _req=0,_words=0,_case=U,_val=POSITIVE) %parmv(min, _req=0,_words=0,_case=U,_val=POSITIVE) %parmv(max, _req=0,_words=0,_case=U,_val=POSITIVE) %parmv(seed, _req=1,_words=0,_case=U,_val=NONNEGATIVE) %if (&parmerr) %then %goto quit;

%* first parse off any dataset options ; %let i = %index(%bquote(&data),%str(%()); %if &i %then %do; %let options = %substr(%bquote(&data),&i); %let data = %substr(%bquote(&data),1,&i-1); %end; %let data = %upcase(&data);

%* more error checking ;

%* does the input dataset exist? ; %if not %sysfunc(exist(&data)) %then %parmv(_msg=Dataset &data does not exist);

%* is var in the input dataset? ; %if not %varexist(&data,&var) %then %parmv(_msg=Variable &var does not exist in dataset &data); %else %let vartype = %varexist(&data,&var,TYPE);

%* are both output dataset and mvar variable blank? ; %if (%bquote(&out) = and &mvar = ) %then %parmv(_msg=Either an output dataset or macro variable must be specified);

%* is percent between 1 and 100? ; %if not (1 <= &pct and &pct <= 100) %then %parmv(_msg=Percent must be an integer between 1 and 100);

%* is max greater than min? ; %if (&min ^= and &max ^= )%then %if (&max <= &min) %then %parmv(_msg=Max (&max) must be greater than min (&min));

%if (&parmerr) %then %goto quit;

%* create a dataset of distinct &var, sorted in random order ; proc sql noprint; create table _temp_ as select *, uniform(&seed) as sort from (select distinct &var from &data &options ) order by sort ; quit;

%* calculate percentage, min, and max ; %nobs(_temp_);

%if (&pct = ) %then %let pct = 100; %if (&min = ) %then %let min = 0; %if (&max = ) %then %let max = &nobs;

%* convert pct to an observation count ; %let pct = %sysfunc(round(&nobs * (&pct / 100)));

%* now calculate the desired item count given pct, min, and max settings ; %let num = %sysfunc(min(%sysfunc(max(&min,&pct)),&max));

%* debugging statement, comment out in production ; %put nobs=&nobs pct=&pct min=&min max=&max num=&num;

%* create random subset and/or item list ; proc sql noprint;

%* do you want an output dataset? ; %if (%bquote(&out) ^= ) %then %do; create table %bquote(&out) as select * from &data &options where &var in (select &var from _temp_ (obs=&num) ) ; %end;

%* do you want an output macro variable? ; %* if you want a sorted, but still random, list then uncomment the "distinct" keyword ; %* but an unsorted list gives more confidence that the list is truly random ; %if (&mvar ^= ) %then %do; %global &mvar;

%if (&vartype = C) %then %do; select /* distinct */ "'" || trim(&var) || "'" into :&mvar separated by ',' from _temp_ (obs=&num) ; %end; %else %do; select /* distinct */ &var into :&mvar separated by ',' from _temp_ (obs=&num) ; %end; %end; quit;

%quit: %mend;

"Zack, Matthew M." <MMZ1@CDC.GOV> wrote in message news:70A9A1413B47DD4FB1F311A7C19B218C038B1DF8@m-nccd-1.nccd.cdc.gov... > What if you randomly select patients and all their visits without PROC

> SURVEYSELECT? > > * Sort patient visits; > * by patient ID; > > proc sort; > by pt; > run; > > * Generate a uniform random number for each patient; > > data two(drop=rnseed); > retain rnseed 6093141 rn; > set; > by pt; > if (first.pt eq 1) > then rn=uniform(rnseed); > output two; > run; > > * Sort patient visits; > * by patient ID, visit ID, and ascending uniform random number; > > proc sort data=two; > by pt visit rn; > run; > > * Select about 50 [+- 2 visits so that range=48 to 52] total patient > visits; > * Add five visits (possibly from different patients) after the 50 > above are selected; > * where SBP=. or DBP=.; > > data visit50(drop=rn lstvisit nmissbp); > retain lstvisit nmissbp 0; > set two; > by pt visit; > select; > when (lstvisit eq 0) do; > if ((ABS(50-_n_) le 2) and > (last.pt eq 1)) > then lstvisit=1; > output visit50; > end; > when (lstvisit eq 1) do; > if ((sbp eq .) or > (dbp eq .)) then do; > nmissbp=nmissbp+1; > if (nmissbp le 5) > then output visit50; > else lstvisit=2; > end; > end; > otherwise stop; > end; > run; > > * Select about 20% of the input data set; > * Add five visits (possibly from different patients) after the above

> 20% are selected; > * where SBP=. or DBP=.; > > data visit20p(drop=rn nmissbp); > retain nmissbp 0; > set two; > by pt visit; > select; > when (rn le 0.20) output visit20p; > otherwise do; > if ((sbp eq .) or > (dbp eq .)) then do; > nmissbp=nmissbp+1; > if (nmissbp le 5) > then output visit20p; > else stop; > end; > end; > end; > run; > > Matthew Zack > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > Scott > Sent: Sunday, January 16, 2005 1:12 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: surveyselect question > > > Hi, > > I've read various posts about SURVEYSELECT and random samples in the > archives, but couldn't find the answer to my problem, thus this > post... > > Say I have a dataset: > > PT VISIT SBP DBP, where > > PT = patient > VISIT = visit number, say 1 - 4, which may be incomplete for a given > PT, i.e. could be 1; 1,2,4; 1,2,3; 1,3; etc. > SBP = systolic blood pressure > DBP = diastolic blood pressure (both BP's could have missing values) > > I'd like to sample this dataset as follows: > > 1. Sample has "around" say 50 observations in total. > > 2. Sample has say 20% of observations from input data set. > > In both of these samples, *** ALL observations for a given PT are > included ***, i.e. if PT 7 is one of the patients randomly selected, > then all visits for that PT are included in the random sample. > > 3. #1 and #2 above, augmented by say 5 random observations where > either SBP, DBP, or both have a missing value. > > For #3, I don't care if I make two passes over the data, but one pass > would be nice. > > IOW, in "pseudocode": > > 1. If each PT had 4 visit records, I would have either 12 (48) or 13 > (52) observations in the sample dataset, since I specified a sample > size of around 50. > > 2. If each PT had 4 visit records, and the total input dataset is > 1000 observations, I would have 200 observations in the sample > dataset, comprised of 50 PTs with 4 visits each. > > 3.(1) 12 or 13 random patients, plus 5 observations where SBP, DBP, or

> both were missing. > > 3.(2) 50 random patients, plus 5 observations where SBP, DBP, or both > were missing. > > I've played with SURVEYSELECT, but can't figure out how to get all > records for a given PT to be included in the output. > > Note that this sampling is for QC tests of code algorithms, not for > further statistical analyses of the resulting sample. > > Thanks, > Scott


Back to: Top of message | Previous page | Main SAS-L page