LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2005, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Fri, 9 Dec 2005 10:27:43 -0500
Reply-To:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:   Re: How to find duplicate records
Comments:   To: Anonymous user <Number_42@caramail.com>
Content-Type:   text/plain; charset="us-ascii"

Anon: Once out of the sphere of physical sequencing of data, the ordering of duplicates in a dataset has no importance; as a result, a standard SQL method selects all duplicates on a single or composite key:

select * from test group by x,y having count(*)>1;

To find duplicates in any column variable of a dataset, group on all column variables. The SAS-L archives contain many variations on this theme. Many database programmers use SAS to prepare data for loading into relational database systems. Checks for distinct primary keys occur early in the process.

What one does with the key duplicates depends on the situation. The SQL method at least avoids arbitrary selection of rows to delete. I regard that as a dangerous practice in the case of key duplicates. Sig

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Anonymous user Sent: Friday, December 09, 2005 6:09 AM To: sas-l@uga.edu Subject: How to find duplicate records

Hello,

i have no question this time. I found out myself a solution, but since the problem of finding duplicates is so basic and i have not found how to do it on this group, i post my source code for the next one:

option mprint;

/* ** tested on SAS 8.02 ** Description: ** - set_to_search: the set where to search for duplicates ** - key: the key to search duplicates for, can be a list of column names!!!! ** - duplicates_set: the result set, to put only the duplicates ** ** "key" can be composed of several columns (no first/last) ** "set_to_search" does not need a numeric column (like for proc means) */ %macro find_duplicates(set_to_search, key, duplicates_set); %local sorted_set filter_set; %local i key_column;

%let sorted_set = __sorted_set; %let filter_set = __filter_set;

proc sort data = &set_to_search out = &sorted_set; by &key; run;

data &filter_set (keep=&key); set &sorted_set;

if %let i = 1; %let key_column = %scan(&key, &i); %do %while ("&key_column" ne "");

/* first test ? */ %if &i > 1 %then %do; and %end;

(&key_column = lag(&key_column))

%let i = %eval(&i + 1); %let key_column = %scan(&key, &i); %end; then output; run;

/* remove duplicates "key of duplicates" */ proc sort data = &filter_set nodup; by &key; run;

/* get the data with keys */ data &duplicates_set; merge &sorted_set (in=in1) &filter_set (in=in2); by &key;

if in1 and in2; run;

proc datasets nolist; delete &sorted_set; delete &filter_set; run;

%mend find_duplicates;

/* what i have */ data data; input a $ b $ c $; datalines; a1 b1 cat a1 b2 dog a1 b1 . a2 b1 dog ;

/* what i want */ /* a b c

a1 b1 cat a1 b1 . */

%find_duplicates(data, a b, dups);

HTH, regards


Back to: Top of message | Previous page | Main SAS-L page