LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2006, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 20 Sep 2006 17:03:14 -0400
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: Fast count distinct / review and remove duplicates
Comments: To: ben.powell@CLA.CO.UK
In-Reply-To:  <200609201343.k8KAn54w011880@mailgw.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"

Ben: Just went through this process ....

proc sql; /* Select set of rows with duplicated ID. */ create table dups as select * from x group by id having count(ID)>1 ; /* Count number of IDs with duplicates. */ select count(distinct ID) from test ; quit;

To a SQLHead, the program looks anything but clunky. To a relational DB kind of guy, the purpose of identifying primary key integrity violations isn't to fix the ID's, which could make matters worse, but to detect ambiguity in the implementation of a data model. Indistinct primary key values could mean 1. duplication of a row of data for an entity; 2. different values split across rows of data for an entity; 3. contradictory values of data for an entity; 4. misidentification of attributes of at least one entry.

Each of these conditions has a different remedy. In a database these conditions indicate a disconnect between reality and the way the database represents reality. Sig -----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of ben.powell@CLA.CO.UK Sent: Wednesday, September 20, 2006 9:44 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Fast count distinct / review and remove duplicates

Dear List

What is a good (optimized) way to test for uniqueness in a range of, say, 4.1m 13 char id numbers?

Considerations:

Data not sorted It would be useful to review duplicate records if they exist - expected very low rate of duplication e.g. < 0.01%

/*CLUNKY.*/ proc sql; select count(distinct id) from x; quit;/*4158025*/

proc sql; create table test as select id, count(*) as qty from x group by id having qty>1; quit;/*3*/

proc sql; create table x2 as select id from x where ^ id in( select isn from test); quit;/*4158022*/ /*/CLUNKY*/

Any comments much appreciated,

Rgds.

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of ben.powell@CLA.CO.UK Sent: Wednesday, September 20, 2006 9:44 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Fast count distinct / review and remove duplicates

Dear List

What is a good (optimized) way to test for uniqueness in a range of, say, 4.1m 13 char id numbers?

Considerations:

Data not sorted It would be useful to review duplicate records if they exist - expected very low rate of duplication e.g. < 0.01%

/*CLUNKY.*/ proc sql; select count(distinct id) from x; quit;/*4158025*/

proc sql; create table test as select id, count(*) as qty from x group by id having qty>1; quit;/*3*/

proc sql; create table x2 as select id from x where ^ id in( select isn from test); quit;/*4158022*/ /*/CLUNKY*/

Any comments much appreciated,

Rgds.


Back to: Top of message | Previous page | Main SAS-L page