LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 12 Jul 2007 08:53:44 -0700
Reply-To:     "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Nordlund, Dan (DSHS/RDA)" <NordlDJ@DSHS.WA.GOV>
Subject:      Re: Finding parent-child relationship within ID variable
Comments: To: Hari <excel_hari@YAHOO.COM>
In-Reply-To:  <1184237660.003306.69400@m3g2000hsh.googlegroups.com>
Content-Type: text/plain; charset=iso-8859-1

> -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > Behalf Of Hari > Sent: Thursday, July 12, 2007 3:54 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Finding parent-child relationship within ID variable > > On Jul 12, 3:38 am, Nord...@DSHS.WA.GOV ("Nordlund, Dan (DSHS/RDA)") > wrote: > > > -----Original Message----- > > > From: SAS(r) Discussion [mailto:S...@LISTSERV.UGA.EDU] On > > > Behalf Of Hari > > > Sent: Wednesday, July 11, 2007 2:39 PM > > > To: S...@LISTSERV.UGA.EDU > > > Subject: Finding parent-child relationship within ID variable > > > > > Hi, > > > > > I have a SAS data set A with 2 fields. Both are ID fields > and all the > > > digits in the ID are "numbers" and stored in $16 format. > > > > > The data set A was originally created by using a join of > a data set B > > > with itself (B*B) and only those records which met > certain conditions > > > where only retained in the output. The original data set B had 80 > > > million obs and right now the final output data set A has > close to 90 > > > Million observations. > > > > > By doing the join operation Im trying to see as to which ID's are > > > related to each other. (basically, the data B is customer > level data > > > which was house-holded using address logic. Presently I want to do > > > additonal house-holding using email/name etc...hence the > join was done > > > to see which customers have same email/name but diff > house-hold level > > > ID's) > > > > > *21 sample records (for simplicity sake im putting the length of > > > fields as just 2); > > > > > ID_1 ID_2 > > > 11 67 > > > 11 55 > > > 11 98 > > > 67 55 > > > 55 67 > > > 98 11 > > > 98 55 > > > 55 98 > > > 67 98 > > > 55 11 > > > 98 67 > > > 67 11 > > > 23 45 > > > 92 35 > > > 45 23 > > > 92 29 > > > 29 35 > > > 35 92 > > > 29 92 > > > 35 29 > > > 29 35 > > > > > Now, for me distinction between ID_1 and ID_2 is > meaningless. As long > > > as ID 11 is related to ID 67, it means the same as another record > > > which says ID 67 is related to ID 11. Hence, I want to reduce the > > > above data set to the below format, such that only unique > information > > > is retained > > > > > ID_modified1 ID_modified_2 > > > 11 67 > > > 11 55 > > > 11 98 > > > 23 45 > > > 29 35 > > > 29 92 > > > > > (In above output am retaining the minimum ID in left and the rest > > > higher ID's in right, but I can keep it anyway as long as > information > > > isnt duplicated) > > > > > Whats an efficient way to arrive at this output? (I hope > to learn some > > > nice techniques!) > > > > > Btw, I do know one method which is to Transpose data set A by ID_1 > > > (and do Var ID_2) and after transposing rearranging the > ID values in > > > each record individually such that minimum ID is in left > most variable > > > and maximum in right most and so on.Finally, do a SQL > distinct on the > > > rearranged data set and this distinct data set can again > be transposed > > > by the left most ID variable to get the final output. > > > > > The steps that I dont really like in my method is > rearranging because: > > > - > > > a) I really dont know in advance as to how many new transposed > > > variables will be created. It can be even as big as 200 > or even 400! I > > > can probably do a proc contents and store the number of resulting > > > fields in a macro variable > > > > > b) More bigger problem is rearrangng itself. Since, num of fields > > > might even be 400 or so, I will usually create that many > dummy data > > > set variables, but again I dont know an efficient way to sort all > > > variables within each record individually. > > > > > Please guide me. > > > > > regards, > > > HP > > > India > > > > Hari, > > > > You could try something like: > > > > data temp / view = temp; > > set have; > > if id_1 LT ID_2 then do; > > t=id_1; > > id_1=id_2; > > id_2=t; > > end; > > run; > > proc sql ; > > create table want as > > select distinct id_1, id_2 > > from temp > > order by id_1 > > ; > > quit; > > > > Hope this is helpful, > > > > Dan > > > > Daniel J. Nordlund > > Research and Data Analysis > > Washington State Department of Social and Health Services > > Olympia, WA 98504-5204- Hide quoted text - > > > > - Show quoted text - > > Daniel, > > That wouldnt work in the present case, because after the first data > (view) step, the temp would like (I have dropped the variable t as it > is unneccesary) > > ID_1 ID_2 > 11 67 > 11 55 > 11 98 > 55 67 > 55 67 > 11 98 > 55 98 > 55 98 > 67 98 > 11 55 > 67 98 > 11 67 > 23 45 > 35 92 > 23 45 > 29 92 > 29 35 > 35 92 > 29 92 > 29 35 > 29 35 > > Now when we do distinct using the "Want" data set result would like > > ID_1 ID_2 > 11 67 > 11 55 > 11 98 > 23 45 > 29 35 > 29 92 > 35 92 > 55 67 > 55 98 > 67 98 > > Which is not equivalent to the result Im looking for. if you notice > above there is a "information" duplication above wherein 11 is mapped > to 67 and 55 and in a later record we have 55 mapped to 67, which is > unnecessary for my case. > > hp >

Hari,

If I understand correctly (and we already know that is not a given :-), you can just do an extra SQL query to eliminate rows where id_1 shows up in id_2.

data temp / view = temp; set have; if id_2 LT ID_1 then do; t=id_1; id_1=id_2; id_2=t; end; drop t; run; proc sql ; create table temp2 as select distinct id_1, id_2 from temp order by id_1 ; create table want as select a.* from temp2 as a where a.id_1 not in(select id_2 from temp2) ; quit;

Hope this is helpful,

Dan

Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and Health Services Olympia, WA 98504-5204


Back to: Top of message | Previous page | Main SAS-L page