Date: Thu, 12 Jul 2007 08:53:44 -0700
Reply-To: "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Subject: Re: Finding parent-child relationship within ID variable
In-Reply-To: <1184237660.003306.69400@m3g2000hsh.googlegroups.com>
Content-Type: text/plain; charset=iso-8859-1
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Hari
> Sent: Thursday, July 12, 2007 3:54 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Finding parent-child relationship within ID variable
>
> On Jul 12, 3:38 am, Nord...@DSHS.WA.GOV ("Nordlund, Dan (DSHS/RDA)")
> wrote:
> > > -----Original Message-----
> > > From: SAS(r) Discussion [mailto:S...@LISTSERV.UGA.EDU] On
> > > Behalf Of Hari
> > > Sent: Wednesday, July 11, 2007 2:39 PM
> > > To: S...@LISTSERV.UGA.EDU
> > > Subject: Finding parent-child relationship within ID variable
> >
> > > Hi,
> >
> > > I have a SAS data set A with 2 fields. Both are ID fields
> and all the
> > > digits in the ID are "numbers" and stored in $16 format.
> >
> > > The data set A was originally created by using a join of
> a data set B
> > > with itself (B*B) and only those records which met
> certain conditions
> > > where only retained in the output. The original data set B had 80
> > > million obs and right now the final output data set A has
> close to 90
> > > Million observations.
> >
> > > By doing the join operation Im trying to see as to which ID's are
> > > related to each other. (basically, the data B is customer
> level data
> > > which was house-holded using address logic. Presently I want to do
> > > additonal house-holding using email/name etc...hence the
> join was done
> > > to see which customers have same email/name but diff
> house-hold level
> > > ID's)
> >
> > > *21 sample records (for simplicity sake im putting the length of
> > > fields as just 2);
> >
> > > ID_1 ID_2
> > > 11 67
> > > 11 55
> > > 11 98
> > > 67 55
> > > 55 67
> > > 98 11
> > > 98 55
> > > 55 98
> > > 67 98
> > > 55 11
> > > 98 67
> > > 67 11
> > > 23 45
> > > 92 35
> > > 45 23
> > > 92 29
> > > 29 35
> > > 35 92
> > > 29 92
> > > 35 29
> > > 29 35
> >
> > > Now, for me distinction between ID_1 and ID_2 is
> meaningless. As long
> > > as ID 11 is related to ID 67, it means the same as another record
> > > which says ID 67 is related to ID 11. Hence, I want to reduce the
> > > above data set to the below format, such that only unique
> information
> > > is retained
> >
> > > ID_modified1 ID_modified_2
> > > 11 67
> > > 11 55
> > > 11 98
> > > 23 45
> > > 29 35
> > > 29 92
> >
> > > (In above output am retaining the minimum ID in left and the rest
> > > higher ID's in right, but I can keep it anyway as long as
> information
> > > isnt duplicated)
> >
> > > Whats an efficient way to arrive at this output? (I hope
> to learn some
> > > nice techniques!)
> >
> > > Btw, I do know one method which is to Transpose data set A by ID_1
> > > (and do Var ID_2) and after transposing rearranging the
> ID values in
> > > each record individually such that minimum ID is in left
> most variable
> > > and maximum in right most and so on.Finally, do a SQL
> distinct on the
> > > rearranged data set and this distinct data set can again
> be transposed
> > > by the left most ID variable to get the final output.
> >
> > > The steps that I dont really like in my method is
> rearranging because:
> > > -
> > > a) I really dont know in advance as to how many new transposed
> > > variables will be created. It can be even as big as 200
> or even 400! I
> > > can probably do a proc contents and store the number of resulting
> > > fields in a macro variable
> >
> > > b) More bigger problem is rearrangng itself. Since, num of fields
> > > might even be 400 or so, I will usually create that many
> dummy data
> > > set variables, but again I dont know an efficient way to sort all
> > > variables within each record individually.
> >
> > > Please guide me.
> >
> > > regards,
> > > HP
> > > India
> >
> > Hari,
> >
> > You could try something like:
> >
> > data temp / view = temp;
> > set have;
> > if id_1 LT ID_2 then do;
> > t=id_1;
> > id_1=id_2;
> > id_2=t;
> > end;
> > run;
> > proc sql ;
> > create table want as
> > select distinct id_1, id_2
> > from temp
> > order by id_1
> > ;
> > quit;
> >
> > Hope this is helpful,
> >
> > Dan
> >
> > Daniel J. Nordlund
> > Research and Data Analysis
> > Washington State Department of Social and Health Services
> > Olympia, WA 98504-5204- Hide quoted text -
> >
> > - Show quoted text -
>
> Daniel,
>
> That wouldnt work in the present case, because after the first data
> (view) step, the temp would like (I have dropped the variable t as it
> is unneccesary)
>
> ID_1 ID_2
> 11 67
> 11 55
> 11 98
> 55 67
> 55 67
> 11 98
> 55 98
> 55 98
> 67 98
> 11 55
> 67 98
> 11 67
> 23 45
> 35 92
> 23 45
> 29 92
> 29 35
> 35 92
> 29 92
> 29 35
> 29 35
>
> Now when we do distinct using the "Want" data set result would like
>
> ID_1 ID_2
> 11 67
> 11 55
> 11 98
> 23 45
> 29 35
> 29 92
> 35 92
> 55 67
> 55 98
> 67 98
>
> Which is not equivalent to the result Im looking for. if you notice
> above there is a "information" duplication above wherein 11 is mapped
> to 67 and 55 and in a later record we have 55 mapped to 67, which is
> unnecessary for my case.
>
> hp
>
Hari,
If I understand correctly (and we already know that is not a given :-), you can just do an extra SQL query to eliminate rows where id_1 shows up in id_2.
data temp / view = temp;
set have;
if id_2 LT ID_1 then do;
t=id_1;
id_1=id_2;
id_2=t;
end;
drop t;
run;
proc sql ;
create table temp2 as
select distinct id_1, id_2
from temp
order by id_1
;
create table want as
select a.*
from temp2 as a
where a.id_1 not in(select id_2 from temp2)
;
quit;
Hope this is helpful,
Dan
Daniel J. Nordlund
Research and Data Analysis
Washington State Department of Social and Health Services
Olympia, WA 98504-5204
|