| Date: | Thu, 19 Aug 2004 11:05:16 -0700 |
| Reply-To: | "Pardee, Roy" <pardee.r@GHC.ORG> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Pardee, Roy" <pardee.r@GHC.ORG> |
| Subject: | Re: group by first name or last name |
|
| Content-Type: | text/plain; charset="us-ascii" |
How close does this get you?
data people ;
input FirstName $ LastName $ ;
datalines ;
John Smith
George Smith
Bill Clinton
George Bush
;
proc sql ;
select p1.FirstName, p1.LastName, coalesce(p2.LastName, p1.LastName)
as Group
from people as p1 LEFT JOIN
people as p2
on p1.FirstName = p2.FirstName
order by 3 ;
quit ;
Output:
FirstName LastName Group
=============================
Bill Clinton Clinton
George Bush Bush
George Bush Smith
George Smith Bush
George Smith Smith
John Smith Smith
?
Cheers,
-Roy
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Richard A. DeVenezia
Sent: Thursday, August 19, 2004 10:14 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: group by first name or last name
Art wrote:
> How to group peole by their first name OR last name,
>
> 1 John Smith
> 2 Goerge Smith
> 3 Bill Clinton
> 4 George Bush
>
>
> I need to put 1,2,4 into to one group. Is there any way I can do this
> in SAS or SQL?
Art:
An interesting problem.
Lets formulate:
Give a set of pairs (key1, key2), find the distinct groups of pairs such
that each pair in a group has this property:
key1 matches key1 of any other pair in the group.
-or-
key2 matches key2 of any other pair in the group.
Straight up SQL, if possible, would no doubt require a full outer join
and perhaps multiple passes.
Here I present an iterative way using version 9 hashes.
Two hashes maintain the groupId assigned to each key value.
Two additional hashes are used to maintain group mapping paths. When the
data can be passed without causing a mapping, then the groups have been
fully determined. Then a final pass is done, at which point the groupIds
are assigned to each pair.
[ I haven't taken the time to prove if this algorithm is correct (or
defeatable) for all data sequences. Since there is no proof, one must
assume there is a possibility ever so slight that some data sequence
could defeat the algorithm (i.e. the data truly indicates N groups and
the algorithm finds >N groups) ]
-----
%let seed =
%sysfunc(mod(%sysfunc(compress(%sysfunc(constant(e)),.)),2**31));
data pairs;
do id = 1 to 1e4;
key1 = int (1e4*ranuni(&seed));
key2 = int (1e4*ranuni(&seed));
output;
end;
run;
/*
data pairs;
id + 1;
input key1 $ key2 $;
cards;
John Smith
George Smith
Bill Clinton
George Bush
run;
data pairs;
id + 1;
input key1 key2 ;
format _numeric_ 4.;
cards;
1 2
3 2
4 5
3 6
4 2
run;
*/
%let dbg = *;
data pairsWithGroupAssignments ;
declare hash one();
one.definekey ('key1');
one.definedata ('key1', 'groupid');
one.definedone();
declare hash two();
two.definekey ('key2');
two.definedata ('key2', 'groupid');
two.definedone();
declare hash map1();
map1.definekey ('from');
map1.definedata ('from', 'to');
map1.definedone();
declare hash map2();
map2.definekey ('from');
map2.definedata ('from', 'to');
map2.definedone();
_groupId = 0;
noMappings = 0;
do until (noMappings and outputDone);
doOutput = noMappings;
noMappings = 1;
do _n_ = 1 to numberOfPairs;
set pairs nobs=numberOfPairs point=_n_;
rc1 = one.find(); g1 = groupId;
rc2 = two.find(); g2 = groupId;
if doOutput then do;
output;
continue;
end;
&dbg.
put id= '(' key1 +(-1) ', ' key2 +(-1) ') ' @;
if rc1 ne 0 and rc2 ne 0 then do;
/** /
addboth:
/**/
_groupId + 1;
groupId = _groupId;
one.add ();
two.add ();
&dbg. put 'add ' key1= 'and ' key2= 'to ' groupId=;
end;
else
if rc1 ne 0 and rc2 = 0 then do;
/** /
add1:
/**/
groupId = g2;
one.add();
&dbg. put 'add ' key1= 'to ' groupId=;
end;
else
if rc1 = 0 and rc2 ne 0 then do;
/** /
add2:
/**/
groupId = g1;
two.add();
&dbg. put 'add ' key2= 'to ' groupId=;
end;
else
if g1 > g2 then do;
/** /
g1g2:
/**/
from = g1;
to = g2;
* determine groupid by following map1;
_to = to;
do while (map1.find(key:_to) = 0);
_to = to;
end;
from = g1;
map1.replace();
groupId = to;
one.replace();
&dbg. put 'add ' key1= 'to ' groupId= 'mapped from key1 group ' from;
noMappings = 0;
end;
else
if g2 > g1 then do;
/** /
g2g1:
/**/
from = g2;
to = g1;
* determine groupid by following map2;
to_ = to;
do while (map2.find(key:to_) = 0);
to_ = to;
end;
from = g2;
map2.replace();
groupId = to;
two.replace();
&dbg. put 'add ' key2= 'to ' groupId= 'mapped from key2 group ' from;
noMappings = 0;
end;
else do;
/** /
same:
/**/
&dbg. put rc1= rc2= g1= g2=;
end;
end;
nPass + 1;
outputDone = doOutput;
end;
put 'NOTE: Data iterated ' npass 'times.';
/*
two.output(dataset:'g2');
one.output(dataset:'g1');
map1.output(dataset:'map1');
map2.output(dataset:'map2');
*/
stop;
keep id key1 key2 groupId;
format _numeric_ 8.;
run;
proc sql noprint;
select count(distinct groupId) into :ngroups from &syslast; quit;
%put ngroups=&ngroups;
-----
Tracking relationships or paths through a network has been discussed in
several prior threads over the years.
You might also find www.devenezia.com/downloads/sas/samples#superset
interesting or helpful.
--
Richard A. DeVenezia
Done being productive? Try SAS Tetris.
http://www.devenezia.com/downloads/sas/af?topic=27
|