Date: Tue, 30 Jan 2007 03:49:57 -0800
Reply-To: "J. Manuel Picaza" <JMPicaza@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "J. Manuel Picaza" <JMPicaza@GMAIL.COM>
Organization: http://groups.google.com
Subject: Re: sample data segmentation
In-Reply-To: <BAY103-F38A4C3A1C2AD5A41F1AF7EB0A60@phx.gbl>
Content-Type: text/plain; charset="iso-8859-1"
Thank you David for your answer.
What I want to do is to build from a frequency table, a partition of
the probability space in a fixed number or groups (always possible if
number of groups <= number of observations) such that the distribution
of observations in the groups be the most homogen possible.
In mathematical terms i am trying to find the optimal partition of N
different observations in K groups (with each observation belonging to
one and only one group).
Optimal would mean the partition with lowest variance or other measure
for the homogenity of number of observations in each group.
I have not find it in the SAS-L archives (what does not mean it is not
there)
Thank you very much for your help in advance,
Manuel
On Jan 30, 7:54 am, davidlcass...@MSN.COM (David L Cassell) wrote:
> JMPic...@GMAIL.COM wrote:
>
> >Hello all,
> >I have a question for you.
>
> >/*
> > * suppose this input data and that we want 20 groups
> >*/
>
> >data WORK.dist(keep=id other);
> > other='Other variables go here';
> > do i = 1 to 3; id=1;output; end;
> > do i = 1 to 10000; id=2;output; end; do i = 1 to 10; id=3;output;
> >end;
> > do i = 1 to 25; id=4;output; end; do i = 1 to 30; id=5;output;
> >end;
> > do i = 1 to 1; id=6;output; end; do i = 1 to 48; id=7;output; end;
> > do i = 1 to 1; id=8;output; end; do i = 1 to 55; id=9;output; end;
> > do i = 1 to 90; id=10;output; end; do i = 1 to 19; id=11;output;
> >end;
> > do i = 1 to 25; id=12;output; end; do i = 1 to 30; id=13;output;
> >end;
> > do i = 1 to 1; id=14;output; end; do i = 1 to 40; id=15;output;
> >end;
> > do i = 1 to 1; id=16;output; end; do i = 1 to 508; id=17;output;
> >end;
> > do i = 1 to 91; id=18;output; end; do i = 1 to 10; id=19;output;
> >end;
> > do i = 1 to 25; id=20;output; end; do i = 1 to 30; id=21;output;
> >end;
> > do i = 1 to 1; id=22;output; end; do i = 1 to 40; id=23;output;
> >end;
> > do i = 1 to 1; id=24;output; end; do i = 1 to 54; id=25;output;
> >end;
> > do i = 1 to 25; id=40;output; end; do i = 1 to 30; id=41;output;
> >end;
> > do i = 1 to 1; id=42;output; end; do i = 1 to 48; id=43;output;
> >end;
> > do i = 1 to 1; id=44;output; end; do i = 1 to 55; id=45;output;
> >end;
> > do i = 1 to 90; id=46;output; end; do i = 1 to 19; id=47;output;
> >end;
> > do i = 1 to 25; id=48;output; end; do i = 1 to 30; id=49;output;
> >end;
> > do i = 1 to 1; id=50;output; end;
> > do i = 1 to 1; id=70;output; end; do i = 1 to 10; id=71;output;
> >end;
> > do i = 1 to 25; id=72;output; end; do i = 1 to 30; id=73;output;
> >end;
> > do i = 1 to 1; id=74;output; end; do i = 1 to 48; id=75;output;
> >end;
> > do i = 1 to 1; id=76;output; end; do i = 1 to 55; id=77;output;
> >end;
> > do i = 1 to 90; id=78;output; end; do i = 1 to 19; id=79;output;
> >end;
> > do i = 1 to 25; id=80;output; end;
> > do i = 1 to 1; id=90;output; end; do i = 1 to 40; id=91;output;
> >end;
> > do i = 1 to 1; id=92;output; end; do i = 1 to 54; id=93;output;
> >end;
> > do i = 1 to 90; id=94;output; end; do i = 1 to 13; id=95;output;
> >end;
> > do i = 1 to 25; id=96;output; end; do i = 1 to 30; id=97;output;
> >end;
> > do i = 1 to 1; id=98;output; end; do i = 1 to 2; id=99;output;
> >end;
> > ; do i = 1 to 1000; id=100;output; end;
> >run;
>
> >%let numOfGroups=20;
>
> >proc sql noprint;
> > create table WORK.freq_table as select distinct
> > id, count(id) as freq
> > from dist group by id
> > order by id;
> > select distinct sum(freq)/&numOfGroups
> > into :obsPerGroup
> > from WORK.freq_table;
> >quit;
>
> >/* Now i try to clasify the data in groups */
> >DATA WORK.groups;
> > set WORK.freq_table;
> > if _n_=1 then do;
> > group=1;
> > cum=freq;
> > end;
> > else do;
> > cum+freq;
> > if cum gt &obsPerGroup then do;
> > group+1;
> > cum=freq;
> > end;
> > end;
> >RUN;
>
> >/* and as a summary table */
> >proc sql;
> > create table WORK.summary_table as select distinct
> > group,
> > sum(freq) as HowManyInGroup,
> > min(id) as min,
> > max(id) as max
> > from WORK.groups
> > group by group
> > order by group;
> >quit;
>
> >/******
> > ******
> > ******/
>
> >There is only 7 groups (instead of 20) because most of the data is in
> >a
> >single point.
> >For the group 1 we can do nothing (maybe I should join it to the group
> >2)
>
> >My question is: Is there some method to get just 20 groups and
> >the number of observations for each group equal for each one?
> >(or similar at least)
>
> >Best regards,
>
> >ManuelWe were just covering this topic in the last few days in SAS-L. I
> guess you missed it.
>
> An easy way is to use PROC SORT or PROC SQL or a DATA step
> to get rid of duplicates on ID,
> and then use PROC SURVEYSELECT to pick out n=20 groups.
>
> But what is the larger goal of this sort of selection process? That
> should drive some of the mechanics of the selection process and
> the choices of things like inclusion probabilities.
>
> For example, are you eventually going to want selection independent
> of the number of records in each Id, or proportional to the number of
> records in each Id, or something else?
>
> Bear in mind that, when you have data like these and you want
> sampling proportional to 'size', the relative sizes can make it impossible
> to
> get the selection *exactly* as you hoped for. So state your objectives
> clearly, please.
>
> HTH,
> David
> --
> David L. Cassell
> mathematical statistician
> Design Pathways
> 3115 NW Norwood Pl.
> Corvallis OR 97330
>
> _________________________________________________________________
> Valentine's Day -- Shop for gifts that spell L-O-V-E at MSN Shoppinghttp://shopping.msn.com/content/shp/?ctId=8323,ptnrid=37,ptnrdata=240...
|