| Date: | Thu, 1 Jun 2006 18:06:14 -0400 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | Re: fuzzy deduping |
|
| Content-Type: | text/plain; charset="us-ascii" |
Clinton:
Consider a simpler SAS SQL solution along the lines of
select * from ONE
group by client_ID
having datetime+'01jan1960:00:00:00'dt -
mean(datetime+'01jan1960:00:00:00'dt)< 150
and count(*)>1
;
It involves a sort operation. I don't see much in the way of
optimization going on in the _method SAS SQL is using.
An alternative (assuming no pure duplicates of datetimes within a
client_ID) that does 'optimize' implements a reflexive join:
select distinct t1.* from ONE as t1 inner join ONE as t2
on t1.client_ID=t2.client_ID
where t1.datetime ^=
t2.datetime
and t1.datetime between t2.datetime - 150
and
t2.datetime + 150
;
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Rickards, Clinton (GE Consumer Finance)
Sent: Thursday, June 01, 2006 9:00 AM
To: SAS-L (E-mail)
Subject: fuzzy deduping
All,
we are looking to identify certain transactions that happen within 5
minutes of each other. The code below is what has been developed thus
far. It works but strikes me as being slow (5-6 minutes for 6.5M obs for
the core data step DEDUPDATE, not including the sort time). The code
sorts the data set ONE, clones it into LOOKUPONE, and then compares each
row to the _n_+1 row in LOOKUPONE. Although CLIENT_ID is not used in
this comparison it is needed for later processing so I have left it in.
Any alternative approaches would be appreciated.
In this example, the first 2 obs are considered duplicates and so
STATUS=D for obs 1. Obs 2 and 3 have different amounts, and obs 4 is a
different account, so STATUS=K for obs 2, 3, and 4.
Thanks,
Clint
data one;
attrib client_id length=$20;
attrib datetime format=datetime. informat=datetime.;
attrib tran_code length=$3;
input client_id
account_key
datetime
tran_code
tran_amt;
cards;
A 1234 01jun2006:12:45:00 123 15.67
A 1234 01jun2006:12:47:00 123 15.67
A 1234 01jun2006:12:48:14 123 1234.56
B 5678 01jun2006:12:45:00 123 19.98
run;
proc sort;
by client_id account_key datetime TRAN_CODE TRAN_AMT;
run;
data lookupone (keep=datetime TRAN_AMT account_key
rename=(datetime=_datetime TRAN_AMT=_TRAN_AMT
account_key=_account_key));
set one;
run;
%let duptime=300;
data dedupdate;
drop _datetime _TRAN_AMT _account_key ;
set one nobs=numobs;
length status $ 1;
mm=_n_+1;
status = "K";
if mm le numobs then
do;
set lookupone point=mm;
if (0 <= abs(datetime-_datetime) <= &duptime)
and TRAN_AMT=_TRAN_AMT
and account_key = _account_key then
status='D';
end;
run;
Clint Rickards
Dual Card Fraud Strategy Manager
GE Consumer Finance
Shelton, CT
(Internal) 8*370-6156
(External) (203) 944-6156
|