LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2004, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 19 Aug 2004 11:05:16 -0700
Reply-To:   "Pardee, Roy" <pardee.r@GHC.ORG>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Pardee, Roy" <pardee.r@GHC.ORG>
Subject:   Re: group by first name or last name
Comments:   To: "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM>
Content-Type:   text/plain; charset="us-ascii"

How close does this get you?

data people ; input FirstName $ LastName $ ; datalines ; John Smith George Smith Bill Clinton George Bush ;

proc sql ; select p1.FirstName, p1.LastName, coalesce(p2.LastName, p1.LastName) as Group from people as p1 LEFT JOIN people as p2 on p1.FirstName = p2.FirstName order by 3 ; quit ;

Output:

FirstName LastName Group ============================= Bill Clinton Clinton George Bush Bush George Bush Smith George Smith Bush George Smith Smith John Smith Smith

?

Cheers,

-Roy

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Richard A. DeVenezia Sent: Thursday, August 19, 2004 10:14 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: group by first name or last name

Art wrote: > How to group peole by their first name OR last name, > > 1 John Smith > 2 Goerge Smith > 3 Bill Clinton > 4 George Bush > > > I need to put 1,2,4 into to one group. Is there any way I can do this > in SAS or SQL?

Art:

An interesting problem. Lets formulate:

Give a set of pairs (key1, key2), find the distinct groups of pairs such that each pair in a group has this property: key1 matches key1 of any other pair in the group. -or- key2 matches key2 of any other pair in the group.

Straight up SQL, if possible, would no doubt require a full outer join and perhaps multiple passes.

Here I present an iterative way using version 9 hashes. Two hashes maintain the groupId assigned to each key value. Two additional hashes are used to maintain group mapping paths. When the data can be passed without causing a mapping, then the groups have been fully determined. Then a final pass is done, at which point the groupIds are assigned to each pair.

[ I haven't taken the time to prove if this algorithm is correct (or defeatable) for all data sequences. Since there is no proof, one must assume there is a possibility ever so slight that some data sequence could defeat the algorithm (i.e. the data truly indicates N groups and the algorithm finds >N groups) ]

----- %let seed = %sysfunc(mod(%sysfunc(compress(%sysfunc(constant(e)),.)),2**31));

data pairs; do id = 1 to 1e4; key1 = int (1e4*ranuni(&seed)); key2 = int (1e4*ranuni(&seed)); output; end; run;

/* data pairs; id + 1; input key1 $ key2 $; cards; John Smith George Smith Bill Clinton George Bush run;

data pairs; id + 1; input key1 key2 ; format _numeric_ 4.; cards; 1 2 3 2 4 5 3 6 4 2 run; */

%let dbg = *;

data pairsWithGroupAssignments ;

declare hash one(); one.definekey ('key1'); one.definedata ('key1', 'groupid'); one.definedone();

declare hash two(); two.definekey ('key2'); two.definedata ('key2', 'groupid'); two.definedone();

declare hash map1(); map1.definekey ('from'); map1.definedata ('from', 'to'); map1.definedone();

declare hash map2(); map2.definekey ('from'); map2.definedata ('from', 'to'); map2.definedone();

_groupId = 0; noMappings = 0;

do until (noMappings and outputDone);

doOutput = noMappings; noMappings = 1;

do _n_ = 1 to numberOfPairs;

set pairs nobs=numberOfPairs point=_n_;

rc1 = one.find(); g1 = groupId; rc2 = two.find(); g2 = groupId;

if doOutput then do; output; continue; end;

&dbg. put id= '(' key1 +(-1) ', ' key2 +(-1) ') ' @;

if rc1 ne 0 and rc2 ne 0 then do; /** / addboth: /**/ _groupId + 1; groupId = _groupId; one.add (); two.add (); &dbg. put 'add ' key1= 'and ' key2= 'to ' groupId=; end; else if rc1 ne 0 and rc2 = 0 then do; /** / add1: /**/ groupId = g2; one.add(); &dbg. put 'add ' key1= 'to ' groupId=; end; else if rc1 = 0 and rc2 ne 0 then do; /** / add2: /**/ groupId = g1; two.add(); &dbg. put 'add ' key2= 'to ' groupId=; end; else if g1 > g2 then do; /** / g1g2: /**/

from = g1; to = g2;

* determine groupid by following map1; _to = to; do while (map1.find(key:_to) = 0); _to = to; end;

from = g1; map1.replace();

groupId = to; one.replace();

&dbg. put 'add ' key1= 'to ' groupId= 'mapped from key1 group ' from;

noMappings = 0; end; else if g2 > g1 then do; /** / g2g1: /**/ from = g2; to = g1;

* determine groupid by following map2; to_ = to; do while (map2.find(key:to_) = 0); to_ = to; end;

from = g2; map2.replace();

groupId = to; two.replace();

&dbg. put 'add ' key2= 'to ' groupId= 'mapped from key2 group ' from;

noMappings = 0; end; else do; /** / same: /**/ &dbg. put rc1= rc2= g1= g2=; end;

end;

nPass + 1;

outputDone = doOutput; end;

put 'NOTE: Data iterated ' npass 'times.'; /* two.output(dataset:'g2'); one.output(dataset:'g1'); map1.output(dataset:'map1'); map2.output(dataset:'map2'); */ stop;

keep id key1 key2 groupId; format _numeric_ 8.;

run;

proc sql noprint; select count(distinct groupId) into :ngroups from &syslast; quit;

%put ngroups=&ngroups; -----

Tracking relationships or paths through a network has been discussed in several prior threads over the years.

You might also find www.devenezia.com/downloads/sas/samples#superset interesting or helpful.

-- Richard A. DeVenezia Done being productive? Try SAS Tetris. http://www.devenezia.com/downloads/sas/af?topic=27


Back to: Top of message | Previous page | Main SAS-L page