Date: Thu, 25 Oct 2007 09:40:20 -0700
Reply-To: "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Subject: Re: Selecting a Random Sample
In-Reply-To: <1193327691.207572.41890@22g2000hsm.googlegroups.com>
Content-Type: text/plain; charset=iso-8859-1
I missed the original post. I don't know if the original poster wants to get a 20% sample of just the IDs or wants all the records for the 20% sample of IDs. Here is one way of getting either.
data sample;
in_sample=uniform(32751) LT .2;
do until(last.ID);
set a;
by ID;
**if you want all records of your 20% sample, output here;
if in_sample then output;
end;
**if you want only the IDs, then output here;
**if in_sample then output;run;
Hope this is helpful,
Dan
Daniel J. Nordlund
Research and Data Analysis
Washington State Department of Social and Health Services
Olympia, WA 98504-5204
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Shiling Zhang
> Sent: Thursday, October 25, 2007 8:55 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Selecting a Random Sample
>
> On Oct 24, 3:12 am, a...@hotmail.com wrote:
> > Hello,
> >
> > My data has the following format:
> > Data A;
> > ID Year Type
> > 1 1999 A
> > 1 2000 A
> > 1 2001 B
> > 1 2001 C
> > 2 1988 H
> > 3 1989 C
> > 4 2001 G
> > 4 1998 Y
> > 5 2001 B
> >
> > I want to select a random 20% sample of the IDs.
> >
> > So for example,
> >
> > The output could be:
> >
> > 4 2001 G
> > 4 1998 Y
> >
> > or the output could be:
> >
> > 5 2001 B
> >
> > The way I approach it is:
> > Data B;
> > set A;
> > by ID;
> > retain X;
> > if first.ID then X = ranuni(4544);
> > run;
> >
> > Data C;
> > set B;
> > if X < 0.20 then output;
> > end;
> >
> > This way I would extract 20% of the IDs. My question is: is there a
> > better/more efficient way to do this?
> >
> > Thanks.
>
> Here is a one pass in data step. I hope some one can come up with
> "proc surveyselect".
>
> data t1;
> do i = 1 to 10;
> do j=1 to mod(i,3)+1;
> output;
> end;
> end;
> run;
>
> proc print data=t1; run;
>
> proc sql noprint;
> select count (distinct i) into: tot_i
> from t1;
> quit;
>
> %put >>>&tot_i<<<;
> **sample percent of by variable;
> %let p=0.4;
>
> data sample;
> retain p0 p &p tot_i &tot_i;
> seed=90876;
> n0=p0*&tot_i;
> rate=(ranuni( seed )<p);
> s+rate;
>
> do until( last.i);
> set t1 nobs=n;
> by i;
> if rate then output;
> end;
>
> *stop rule;
> if s>=p0*&tot_i then stop;
> *update p base upon the current one is select or not;
> if rate then p=(p*tot_i-1)/(tot_i + (-1));
> else p=(p* tot_i)/(tot_i + (-1)) ;
> tot_i + (-1);
> keep i j p0 seed;
> run;
>
> proc print data=sample; run;
>
> HTH.
>
>
|