Date: Thu, 1 Jun 2006 18:06:14 -0400 Sigurd Hermansen "SAS(r) Discussion" Sigurd Hermansen Re: fuzzy deduping To: "Rickards, Clinton (GE Consumer Finance)" text/plain; charset="us-ascii"

Clinton:

Consider a simpler SAS SQL solution along the lines of

select * from ONE group by client_ID having datetime+'01jan1960:00:00:00'dt - mean(datetime+'01jan1960:00:00:00'dt)< 150 and count(*)>1 ;

It involves a sort operation. I don't see much in the way of optimization going on in the _method SAS SQL is using.

An alternative (assuming no pure duplicates of datetimes within a client_ID) that does 'optimize' implements a reflexive join: select distinct t1.* from ONE as t1 inner join ONE as t2 on t1.client_ID=t2.client_ID where t1.datetime ^= t2.datetime and t1.datetime between t2.datetime - 150 and t2.datetime + 150 ;

Sig

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Rickards, Clinton (GE Consumer Finance) Sent: Thursday, June 01, 2006 9:00 AM To: SAS-L (E-mail) Subject: fuzzy deduping

All,

we are looking to identify certain transactions that happen within 5 minutes of each other. The code below is what has been developed thus far. It works but strikes me as being slow (5-6 minutes for 6.5M obs for the core data step DEDUPDATE, not including the sort time). The code sorts the data set ONE, clones it into LOOKUPONE, and then compares each row to the _n_+1 row in LOOKUPONE. Although CLIENT_ID is not used in this comparison it is needed for later processing so I have left it in. Any alternative approaches would be appreciated.

In this example, the first 2 obs are considered duplicates and so STATUS=D for obs 1. Obs 2 and 3 have different amounts, and obs 4 is a different account, so STATUS=K for obs 2, 3, and 4.

Thanks,

Clint

data one; attrib client_id length=\$20; attrib datetime format=datetime. informat=datetime.; attrib tran_code length=\$3; input client_id account_key datetime tran_code tran_amt; cards; A 1234 01jun2006:12:45:00 123 15.67 A 1234 01jun2006:12:47:00 123 15.67 A 1234 01jun2006:12:48:14 123 1234.56 B 5678 01jun2006:12:45:00 123 19.98 run;

proc sort; by client_id account_key datetime TRAN_CODE TRAN_AMT; run;

data lookupone (keep=datetime TRAN_AMT account_key rename=(datetime=_datetime TRAN_AMT=_TRAN_AMT account_key=_account_key)); set one; run;

%let duptime=300;

data dedupdate; drop _datetime _TRAN_AMT _account_key ;

set one nobs=numobs; length status \$ 1; mm=_n_+1; status = "K"; if mm le numobs then do; set lookupone point=mm; if (0 <= abs(datetime-_datetime) <= &duptime) and TRAN_AMT=_TRAN_AMT and account_key = _account_key then status='D'; end; run;

Clint Rickards Dual Card Fraud Strategy Manager GE Consumer Finance Shelton, CT (Internal) 8*370-6156 (External) (203) 944-6156

Back to: Top of message | Previous page | Main SAS-L page