Date: Wed, 20 Sep 2006 17:03:14 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Fast count distinct / review and remove duplicates
In-Reply-To: <200609201343.k8KAn54w011880@mailgw.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"
Ben:
Just went through this process ....
proc sql;
/* Select set of rows with duplicated ID. */
create table dups as
select * from x
group by id having count(ID)>1
;
/* Count number of IDs with duplicates. */
select count(distinct ID) from test
;
quit;
To a SQLHead, the program looks anything but clunky. To a relational DB
kind of guy, the purpose of identifying primary key integrity violations
isn't to fix the ID's, which could make matters worse, but to detect
ambiguity in the implementation of a data model. Indistinct primary key
values could mean
1. duplication of a row of data for an entity;
2. different values split across rows of data for an entity;
3. contradictory values of data for an entity;
4. misidentification of attributes of at least one entry.
Each of these conditions has a different remedy. In a database these
conditions indicate a disconnect between reality and the way the
database represents reality.
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of ben.powell@CLA.CO.UK
Sent: Wednesday, September 20, 2006 9:44 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Fast count distinct / review and remove duplicates
Dear List
What is a good (optimized) way to test for uniqueness in a range of,
say, 4.1m 13 char id numbers?
Considerations:
Data not sorted
It would be useful to review duplicate records if they exist - expected
very low rate of duplication e.g. < 0.01%
/*CLUNKY.*/
proc sql;
select count(distinct id)
from x;
quit;/*4158025*/
proc sql;
create table test as
select id, count(*) as qty
from x
group by id
having qty>1;
quit;/*3*/
proc sql;
create table x2 as
select id
from x
where ^ id in(
select isn
from test);
quit;/*4158022*/
/*/CLUNKY*/
Any comments much appreciated,
Rgds.
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of ben.powell@CLA.CO.UK
Sent: Wednesday, September 20, 2006 9:44 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Fast count distinct / review and remove duplicates
Dear List
What is a good (optimized) way to test for uniqueness in a range of,
say, 4.1m 13 char id numbers?
Considerations:
Data not sorted
It would be useful to review duplicate records if they exist - expected
very low rate of duplication e.g. < 0.01%
/*CLUNKY.*/
proc sql;
select count(distinct id)
from x;
quit;/*4158025*/
proc sql;
create table test as
select id, count(*) as qty
from x
group by id
having qty>1;
quit;/*3*/
proc sql;
create table x2 as
select id
from x
where ^ id in(
select isn
from test);
quit;/*4158022*/
/*/CLUNKY*/
Any comments much appreciated,
Rgds.