|
On Jul 12, 3:38 am, Nord...@DSHS.WA.GOV ("Nordlund, Dan (DSHS/RDA)")
wrote:
> > -----Original Message-----
> > From: SAS(r) Discussion [mailto:S...@LISTSERV.UGA.EDU] On
> > Behalf Of Hari
> > Sent: Wednesday, July 11, 2007 2:39 PM
> > To: S...@LISTSERV.UGA.EDU
> > Subject: Finding parent-child relationship within ID variable
>
> > Hi,
>
> > I have a SAS data set A with 2 fields. Both are ID fields and all the
> > digits in the ID are "numbers" and stored in $16 format.
>
> > The data set A was originally created by using a join of a data set B
> > with itself (B*B) and only those records which met certain conditions
> > where only retained in the output. The original data set B had 80
> > million obs and right now the final output data set A has close to 90
> > Million observations.
>
> > By doing the join operation Im trying to see as to which ID's are
> > related to each other. (basically, the data B is customer level data
> > which was house-holded using address logic. Presently I want to do
> > additonal house-holding using email/name etc...hence the join was done
> > to see which customers have same email/name but diff house-hold level
> > ID's)
>
> > *21 sample records (for simplicity sake im putting the length of
> > fields as just 2);
>
> > ID_1 ID_2
> > 11 67
> > 11 55
> > 11 98
> > 67 55
> > 55 67
> > 98 11
> > 98 55
> > 55 98
> > 67 98
> > 55 11
> > 98 67
> > 67 11
> > 23 45
> > 92 35
> > 45 23
> > 92 29
> > 29 35
> > 35 92
> > 29 92
> > 35 29
> > 29 35
>
> > Now, for me distinction between ID_1 and ID_2 is meaningless. As long
> > as ID 11 is related to ID 67, it means the same as another record
> > which says ID 67 is related to ID 11. Hence, I want to reduce the
> > above data set to the below format, such that only unique information
> > is retained
>
> > ID_modified1 ID_modified_2
> > 11 67
> > 11 55
> > 11 98
> > 23 45
> > 29 35
> > 29 92
>
> > (In above output am retaining the minimum ID in left and the rest
> > higher ID's in right, but I can keep it anyway as long as information
> > isnt duplicated)
>
> > Whats an efficient way to arrive at this output? (I hope to learn some
> > nice techniques!)
>
> > Btw, I do know one method which is to Transpose data set A by ID_1
> > (and do Var ID_2) and after transposing rearranging the ID values in
> > each record individually such that minimum ID is in left most variable
> > and maximum in right most and so on.Finally, do a SQL distinct on the
> > rearranged data set and this distinct data set can again be transposed
> > by the left most ID variable to get the final output.
>
> > The steps that I dont really like in my method is rearranging because:
> > -
> > a) I really dont know in advance as to how many new transposed
> > variables will be created. It can be even as big as 200 or even 400! I
> > can probably do a proc contents and store the number of resulting
> > fields in a macro variable
>
> > b) More bigger problem is rearrangng itself. Since, num of fields
> > might even be 400 or so, I will usually create that many dummy data
> > set variables, but again I dont know an efficient way to sort all
> > variables within each record individually.
>
> > Please guide me.
>
> > regards,
> > HP
> > India
>
> Hari,
>
> You could try something like:
>
> data temp / view = temp;
> set have;
> if id_1 LT ID_2 then do;
> t=id_1;
> id_1=id_2;
> id_2=t;
> end;
> run;
> proc sql ;
> create table want as
> select distinct id_1, id_2
> from temp
> order by id_1
> ;
> quit;
>
> Hope this is helpful,
>
> Dan
>
> Daniel J. Nordlund
> Research and Data Analysis
> Washington State Department of Social and Health Services
> Olympia, WA 98504-5204- Hide quoted text -
>
> - Show quoted text -
Daniel,
That wouldnt work in the present case, because after the first data
(view) step, the temp would like (I have dropped the variable t as it
is unneccesary)
ID_1 ID_2
11 67
11 55
11 98
55 67
55 67
11 98
55 98
55 98
67 98
11 55
67 98
11 67
23 45
35 92
23 45
29 92
29 35
35 92
29 92
29 35
29 35
Now when we do distinct using the "Want" data set result would like
ID_1 ID_2
11 67
11 55
11 98
23 45
29 35
29 92
35 92
55 67
55 98
67 98
Which is not equivalent to the result Im looking for. if you notice
above there is a "information" duplication above wherein 11 is mapped
to 67 and 55 and in a later record we have 55 mapped to 67, which is
unnecessary for my case.
hp
|