LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2006, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 1 Jun 2006 22:59:21 -0400
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: fuzzy deduping
Content-Type: text/plain; charset="iso-8859-1"

Need I say more? If accomplished and creative programmers, Toby and Venky, cannot agree on nuances of the SAS Data step, doesn't it make sense to exhaust possibilities of SAS SQL before resorting to programs that depend on physical sequencing of rows? <civility mode off> I intend to purge Data step, macro, and abject-oriented programming, especially that lame mutant SCL, from the face of the earth. I will prevail! Logic programming forevermore! Resistance is futile! <civility mode on> Where's my prozac? Have a nice day. Your friend, Sig ________________________________

From: owner-sas-l@listserv.uga.edu on behalf of toby dunn Sent: Thu 6/1/2006 6:17 PM To: swovcc@hotmail.com; SAS-L@LISTSERV.UGA.EDU Subject: Re: fuzzy deduping

Funny Venky, when I run your code against mine on my V8.2 Xp system I get different results,

25 Data Need ( Drop = Last: ) ; 26 set One ; 27 by client_id TRAN_CODE TRAN_AMT ; 28 Retain LastTranCode LastTranAmt LastDateTime ; 29 30 LastTranCode = Lag( Tran_Code ) ; 31 LastTranAmt = Lag( Tran_Amt ) ; 32 LastDateTime = Lag( DateTime ) ; 33 34 If First.Client_Id then Status = 'K' ; 35 36 If Not First.Client_Id then do ; 37 38 If ( 0 <= Abs( DateTime - LastDateTime ) <= 300 ) 39 and 40 ( Tran_Code = LastTranCode ) 41 and 42 ( Tran_Amt = LastTranAmt ) 43 then Status = 'D' ; 44 else Status = 'K' ; 45 46 end ; 47 48 run ;

NOTE: There were 4 observations read from the data set WORK.ONE. NOTE: The data set WORK.NEED has 4 observations and 6 variables. NOTE: DATA statement used: real time 0.07 seconds cpu time 0.01 seconds

49 50 51 Data Need2 (/* Drop = Last: */) ; 52 if 0 then set one ; 53 do _n_ = 1 to 6500000 ; 54 LastTranCode = Tran_Code ; 55 LastTranAmt = Tran_Amt ; 56 LastDateTime = DateTime ; 57 end ; 58 set One ; 59 by client_id TRAN_CODE TRAN_AMT ; 60 61 If First.Client_Id then Status = 'K' ; 62 63 If Not First.Client_Id then do ; 64 65 If ( 0 <= Abs( DateTime - LastDateTime ) <= 300 ) 66 and 67 ( Tran_Code = LastTranCode ) 68 and 69 ( Tran_Amt = LastTranAmt ) 70 then Status = 'D' ; 71 else Status = 'K' ; 72 73 end ; 74 75 run ;

NOTE: There were 4 observations read from the data set WORK.ONE. NOTE: The data set WORK.NEED2 has 4 observations and 9 variables. NOTE: DATA statement used: real time 0.40 seconds cpu time 0.40 seconds

Toby Dunn

From: Venky Chakravarthy <swovcc@HOTMAIL.COM> Reply-To: Venky Chakravarthy <swovcc@HOTMAIL.COM> To: SAS-L@LISTSERV.UGA.EDU Subject: Re: fuzzy deduping Date: Thu, 1 Jun 2006 17:26:44 -0400

If speed is of the essence then even a small modification to Toby's code can result in further improvement. LAG is one of the most resource intensive functions. I do not see the need for a RETAIN statement either (not that it would save a lot in run time). I have mimicked the LAG here using some of the data step properties. For more useful comparisons I used a do loop iterating 6.5 million times on the variables of interest to minimize I/O nuisance in metrics.

In the below it is easy to grab the values of the previous observation before the SET overwrites it with the current observation. That in essence replicates the LAG. Remember, the variables coming from a SET, MERGE, UPDATE etc. are automatically retained and are not cleared at the beginning of a new iteration of the data step.

Data Need (/* Drop = Last: */) ; if 0 then set one ; do _n_ = 1 to 6500000 ; LastTranCode = Tran_Code ; LastTranAmt = Tran_Amt ; LastDateTime = DateTime ; end ; set One ; by client_id TRAN_CODE TRAN_AMT ;

If First.Client_Id then Status = 'K' ;

If Not First.Client_Id then do ;

If ( 0 <= Abs( DateTime - LastDateTime ) <= 300 ) and ( Tran_Code = LastTranCode ) and ( Tran_Amt = LastTranAmt ) then Status = 'D' ; else Status = 'K' ;

end ;

run ;

Here is the LOG:

NOTE: There were 4 observations read from the data set WORK.ONE. NOTE: The data set WORK.NEED has 4 observations and 9 variables. NOTE: DATA statement used: real time 1.15 seconds cpu time 1.14 seconds

And this is the LOG from Toby's original code iterating the LAG 6.5 million times:

NOTE: There were 4 observations read from the data set WORK.ONE. NOTE: The data set WORK.NEED has 4 observations and 9 variables. NOTE: DATA statement used: real time 6.82 seconds cpu time 6.81 seconds

Venky Chakravarthy

On Thu, 1 Jun 2006 16:26:58 -0400, Rickards, Clinton (GE Consumer Finance) <clinton.rickards@GE.COM> wrote:

>Duhh... Sometimes I feel so dumb... :) > >Thanks again, Toby. > >-----Original Message----- >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]On Behalf Of toby >dunn >Sent: Thursday, June 01, 2006 4:03 PM >To: SAS-L@LISTSERV.UGA.EDU >Subject: Re: fuzzy deduping > > >Clint , > >If they want the observation with the later time rather than the earlier >time then I think sorting the data set where DateTime is descending should >fix the problem. > > > >Toby Dunn > > > > > >From: "Rickards, Clinton (GE Consumer Finance)" ><clinton.rickards@GE.COM> >Reply-To: "Rickards, Clinton (GE Consumer Finance)" ><clinton.rickards@GE.COM> >To: SAS-L@LISTSERV.UGA.EDU >Subject: Re: fuzzy deduping >Date: Thu, 1 Jun 2006 15:50:04 -0400 > >Toby, > >I like this approach: there is only one data set. The only difference is >that the original version sets STATUS=D for the first obs in a duplicate >pair while your version sets STATUD=K for the second in a duplicate pair. >I'm not sure if that is significant in our application but I will pass it >back to the team for consideration. Your version actually seems better to me >since the first transaction is being duplicated by later ones. > >Thanks for the help... > >Clint > >-----Original Message----- >From: toby dunn [mailto:tobydunn@hotmail.com] >Sent: Thursday, June 01, 2006 3:28 PM >To: Rickards, Clinton (GE Consumer Finance); SAS-L@LISTSERV.UGA.EDU >Subject: RE: fuzzy deduping > > >Clinton , > >See if this meets with your needs: > >data one; >attrib client_id length=$20; >attrib datetime format=datetime. informat=datetime.; >attrib tran_code length=$3; >input client_id > account_key > datetime > tran_code > tran_amt; >cards; >A 1234 01jun2006:12:45:00 123 15.67 >A 1234 01jun2006:12:47:00 123 15.67 >A 1234 01jun2006:12:48:14 123 1234.56 >B 5678 01jun2006:12:45:00 123 19.98 >; >run; > > > >proc sort > Data = One ; > by client_id TRAN_CODE TRAN_AMT datetime ; >run; > >Data Need ( Drop = Last: ) ; >set One ; >by client_id TRAN_CODE TRAN_AMT ; >Retain LastTranCode LastTranAmt LastDateTime ; > >LastTranCode = Lag( Tran_Code ) ; >LastTranAmt = Lag( Tran_Amt ) ; >LastDateTime = Lag( DateTime ) ; > >If First.Client_Id then Status = 'K' ; > >If Not First.Client_Id then do ; > > If ( 0 <= Abs( DateTime - LastDateTime ) <= 300 ) > and > ( Tran_Code = LastTranCode ) > and > ( Tran_Amt = LastTranAmt ) > then Status = 'D' ; > else Status = 'K' ; > >end ; > >run ; > > >Proc Print >data = Need ; >run ; > > > >Toby Dunn > > > > > >From: "Rickards, Clinton (GE Consumer Finance)" ><clinton.rickards@GE.COM> >Reply-To: "Rickards, Clinton (GE Consumer Finance)" ><clinton.rickards@GE.COM> >To: SAS-L@LISTSERV.UGA.EDU >Subject: fuzzy deduping >Date: Thu, 1 Jun 2006 09:00:29 -0400 > >All, > >we are looking to identify certain transactions that happen within 5 minutes >of each other. The code below is what has been developed thus far. It works >but strikes me as being slow (5-6 minutes for 6.5M obs for the core data >step DEDUPDATE, not including the sort time). The code sorts the data set >ONE, clones it into LOOKUPONE, and then compares each row to the _n_+1 row >in LOOKUPONE. Although CLIENT_ID is not used in this comparison it is needed >for later processing so I have left it in. Any alternative approaches would >be appreciated. > >In this example, the first 2 obs are considered duplicates and so STATUS=D >for obs 1. Obs 2 and 3 have different amounts, and obs 4 is a different >account, so STATUS=K for obs 2, 3, and 4. > >Thanks, > >Clint > >data one; > attrib client_id length=$20; > attrib datetime format=datetime. informat=datetime.; > attrib tran_code length=$3; > input client_id > account_key > datetime > tran_code > tran_amt; >cards; >A 1234 01jun2006:12:45:00 123 15.67 >A 1234 01jun2006:12:47:00 123 15.67 >A 1234 01jun2006:12:48:14 123 1234.56 >B 5678 01jun2006:12:45:00 123 19.98 >run; > >proc sort; > by client_id account_key datetime TRAN_CODE TRAN_AMT; >run; > >data lookupone (keep=datetime TRAN_AMT account_key > rename=(datetime=_datetime TRAN_AMT=_TRAN_AMT >account_key=_account_key)); > set one; >run; > >%let duptime=300; > >data dedupdate; > drop _datetime _TRAN_AMT _account_key ; > > set one nobs=numobs; > length status $ 1; > mm=_n_+1; > status = "K"; > if mm le numobs then > do; > set lookupone point=mm; > if (0 <= abs(datetime-_datetime) <= &duptime) > and TRAN_AMT=_TRAN_AMT > and account_key = _account_key then > status='D'; > end; >run; > > >Clint Rickards >Dual Card Fraud Strategy Manager >GE Consumer Finance >Shelton, CT >(Internal) 8*370-6156 >(External) (203) 944-6156


Back to: Top of message | Previous page | Main SAS-L page