|
Anon:
Once out of the sphere of physical sequencing of data, the ordering of
duplicates in a dataset has no importance; as a result, a standard SQL
method selects all duplicates on a single or composite key:
select * from test group by x,y having count(*)>1;
To find duplicates in any column variable of a dataset, group on all
column variables. The SAS-L archives contain many variations on this
theme. Many database programmers use SAS to prepare data for loading
into relational database systems. Checks for distinct primary keys occur
early in the process.
What one does with the key duplicates depends on the situation. The SQL
method at least avoids arbitrary selection of rows to delete. I regard
that as a dangerous practice in the case of key duplicates.
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Anonymous user
Sent: Friday, December 09, 2005 6:09 AM
To: sas-l@uga.edu
Subject: How to find duplicate records
Hello,
i have no question this time. I found out myself a solution, but since
the problem of finding duplicates is so basic and i have not found how
to do it on this group, i post my source code for the next one:
option mprint;
/*
** tested on SAS 8.02
** Description:
** - set_to_search: the set where to search for duplicates
** - key: the key to search duplicates for, can be a list of column
names!!!!
** - duplicates_set: the result set, to put only the duplicates
**
** "key" can be composed of several columns (no first/last)
** "set_to_search" does not need a numeric column (like for proc means)
*/ %macro find_duplicates(set_to_search, key, duplicates_set);
%local sorted_set filter_set;
%local i key_column;
%let sorted_set = __sorted_set;
%let filter_set = __filter_set;
proc sort data = &set_to_search out = &sorted_set; by &key; run;
data &filter_set (keep=&key);
set &sorted_set;
if
%let i = 1;
%let key_column = %scan(&key, &i);
%do %while ("&key_column" ne "");
/* first test ? */
%if &i > 1 %then %do;
and
%end;
(&key_column = lag(&key_column))
%let i = %eval(&i + 1);
%let key_column = %scan(&key, &i);
%end;
then output;
run;
/* remove duplicates "key of duplicates" */
proc sort data = &filter_set nodup; by &key; run;
/* get the data with keys */
data &duplicates_set;
merge
&sorted_set (in=in1)
&filter_set (in=in2);
by
&key;
if in1 and in2;
run;
proc datasets nolist;
delete &sorted_set;
delete &filter_set;
run;
%mend find_duplicates;
/* what i have */
data data;
input a $ b $ c $;
datalines;
a1 b1 cat
a1 b2 dog
a1 b1 .
a2 b1 dog
;
/* what i want */
/*
a b c
a1 b1 cat
a1 b1 .
*/
%find_duplicates(data, a b, dups);
HTH,
regards
|