Date: Mon, 5 May 2008 09:04:54 -0700
Reply-To: MBWD <xmbwdx@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: MBWD <xmbwdx@GMAIL.COM>
Organization: http://groups.google.com
Subject: Re: Deduplicate tricky "Cancel" data
Content-Type: text/plain; charset=ISO-8859-1
Hi Mark. You are correct -- lines 10-13 were entered incorrectly and
should have been ($21,250.00). Sorry about that. I just got back,
and I am trying your solution. Thanks. - M.
On May 3, 6:39 pm, mkei...@WHARTON.UPENN.EDU ("Keintz, H. Mark")
wrote:
> MBWD, I think you didn't type in all the data values as you intend.
> Shouldn't lines 10 through 13 end up with ($21,250.00), instead of the
> $21,250.00 you have?
>
> If so this program, using a hash table, won't be effected by the order
> of cancels and their corresponding regular transactions, and it will
> take only one pass through the data.
>
> It works by using the hash table to track each transaction profile
> (using all vars except CANCEL), adds 1 to the net count ( _NTRNS) for
> regular trans and subtracts 1 for cancels.
>
> At the end of reading in one person's transaction, the net count
> controls which profiles, and how many, are output.
>
> The only requirement is that all the records for a given person be
> together in your original data set. (Or you could do first.account and
> last.account instead, in which case records for each account should be
> together).
>
> Regards,
> Mark
>
> data want (drop=_:);
> set have;
> by acct name ;
>
> if _n_=1 then do;
> declare hash trns;
> declare hiter hit;
> end;
>
> ** For each person, set up new instance of a hash table. **
> ** Note it has _NTRNS, but not CANCEL **;
> if first.name then do;
> trns = _new_ hash(hashexp:6,ordered:'a');
> trns.definekey ('acct','name','edate','stock','b'
> ,'qty','prc','trans');
> trns.definedata('acct','name','b','qty','stock'
> ,'prc','edate','trans','_ntrns');
> trns.definedone();
> hit= _new_ hiter('trns');
> end;
>
> ** For CANCEL='X', make QTY & TRANS for matching profile **;
> select (cancel);
> when (' ') _plus_minus=+1;
> when ('X') do;
> _plus_minus=-1;
> qty=-1*qty; * Match regular transaction profile;
> trans=-1*trans; * Match regular transaction profile;
> end;
> end;
>
> _rc=trns.check(); ** See if profile already exists ... **;
> if _rc=0 then do; ** If yes ... **;
> _rc=trns.find();
> _ntrns=_ntrns + _plus_minus;
> end;
> else _ntrns=_plus_minus;
>
> _rc=trns.replace(); ** Back into hash, + updated _NTRNS **;
>
> ** At the end of the person, write each profile _NTRNS times **;
> if last.name then do;
> _rc=hit.first();
> do while (_rc=0);
> if _ntrns ^=0 then do _i=1 to _ntrns; output want; end;
> _rc=hit.next();
> end;
> _rc=trns.delete();
> _rc=hit.delete();
> end;
> run;
>
>
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SA...@LISTSERV.UGA.EDU] On Behalf Of
>
> MBWD
> Sent: Friday, May 02, 2008 10:50 PM
> To: SA...@LISTSERV.UGA.EDU
> Subject: Re: Deduplicate tricky "Cancel" data
>
> Thanks so much for your help. Unfortunately, the data example I gave
> does not fully illustrate the problem. Your solution works when
> there is a complete match between the date of two transactions: the
> cancel and its counterpart. But after using your solution it turns
> out that there are sometimes MANY iterations of pairs -- not just
> two. Moreover, sometimes the data wash out after the cancels and
> counterpart are paired, and sometimes not. As an example, I modified
> the data below to give the idea. Finally, the order of the cancel and
> its counterpart are often off (e.g., IBM):
>
> ABC1234 Fred S -175 XOM $89.00 12/1/2007 ($15,575.00) X
> ABC1234 Fred S -175 XOM $89.00 1/5/2008 ($15,575.00)
> ABC1234 Fred S X 175 XOM $89.00 1/5/2008 $15,575.00
> ABC1234 Fred S -175 XOM $89.00 1/5/2008 ($15,575.00)
> ABC1234 Fred S X 175 XOM $89.00 1/5/2008 $15,575.00
> ABC1234 Fred S -175 XOM $89.00 1/8/2008 ($15,575.00)
> ABC1234 Fred S X 175 XOM $89.00 1/8/2008 $15,575.00
> ABC1234 Fred S -175 XOM $89.00 1/8/2008 ($15,575.00) X
> ABC1234 Fred B 105 ATT $25.00 12/1/2007 $2,625.00 X
> ABC1234 Fred B X -850 IBM $25.00 3/1/2007 $21,250.00
> ABC1234 Fred B X -850 IBM $25.00 3/1/2007 $21,250.00
> ABC1234 Fred B X -850 IBM $25.00 3/1/2007 $21,250.00
> ABC1234 Fred B X -850 IBM $25.00 3/1/2007 $21,250.00
> ABC1234 Fred B 850 IBM $25.00 3/1/2007 $21,250.00
> ABC1234 Fred B 850 IBM $25.00 3/1/2007 $21,250.00
> ABC1234 Fred B 850 IBM $25.00 3/1/2007 $21,250.00
> ABC1234 Fred B 850 IBM $25.00 3/1/2007 $21,250.00
>
> As you can see, there are sometimes 2 complete pairs of cancels/
> counterparts -- e.g., 1.5.08 XOM trades.
>
> There are also times where a pair washes out (cancel/counterpart), and
> leaves one sale -- e.g., 1.8.08 XOM trades.
>
> I put an X next to the transactions in the new data set that should
> remain after the wash. But I have no idea how to get this result.
>
> Thank you again for all help. - M.
>
> On May 1, 6:18 pm, ya.hu...@AMYLIN.COM (Ya Huang) wrote:
> > Assumption is that canceled and the counterpart are on the same
> > date for same account, name and stock:
>
> > data have;
> > length net $15;
> > input Account $ Name $ B $ Cancel $19 Quantity Stock $ Price $
> > eDate :mmddyy10. Net $;
> > cards;
> > ABC1234 Fred S -175 XOM $89.00 12/1/2007 ($15,575.00)
> > ABC1234 Fred S -175 XOM $89.00 1/5/2008 ($15,575.00)
> > ABC1234 Fred S X 175 XOM $89.00 1/5/2008 $15,575.00
> > ABC1234 Fred S -175 XOM $89.00 1/8/2008 ($15,575.00)
> > ABC1234 Fred S X 175 XOM $89.00 1/8/2008 $15,575.00
> > ABC1234 Fred B 105 ATT $25.00 12/1/2007 $2,625.00
> > ;
>
> > proc sql;
> > create table need as
> > select *,case when count(edate)=2 and
> > max(quantity)=-min(quantity)
> > then 1 else 0 end as del
> > from have
> > group by account,name,stock,edate
> > ;
>
> > proc print;
> > format edate date.;
> > run;
>
> > Account Cancel Quantity Stock eDate net
> del
>
> > ABC1234 105 ATT 01DEC07 $2,625.00
> 0
> > ABC1234 -175 XOM 01DEC07 ($15,575.00)
> 0
> > ABC1234 X 175 XOM 05JAN08 $15,575.00
> 1
> > ABC1234 -175 XOM 05JAN08 ($15,575.00)
> 1
> > ABC1234 X 175 XOM 08JAN08 $15,575.00
> 1
> > ABC1234 -175 XOM 08JAN08 ($15,575.00)
> 1
>
> > On Thu, 1 May 2008 15:09:45 -0700, MBWD <xmb...@GMAIL.COM> wrote:
> > >I can't figure out how to get the below data to dedupe (sort of) by
> > >deleting the two canceled observations AND their two counterparts (or
> > >alternatively, just creating a binary variably 1,0 for those
> > >observations that match the cancel row).
>
> > >This would delete all but rows 1 and 6. Rows 2 and 3 are
> counterparts
> > >in a cancel transaction. So are rows 3 and 4.
>
> > >Account Name B Cancel Quantity Stock Price Entry Date Net
> > >ABC1234 Fred S -175 XOM $89.00 12/1/2007
> > ($15,575.00)
> > >ABC1234 Fred S -175 XOM $89.00 1/5/2008
> > ($15,575.00)
> > >ABC1234 Fred S X 175 XOM $89.00 1/5/2008
> > $15,575.00
> > >ABC1234 Fred S -175 XOM $89.00 1/8/2008
> > ($15,575.00)
> > >ABC1234 Fred S X 175 XOM $89.00 1/8/2008
> > $15,575.00
> > >ABC1234 Fred B 105 ATT $25.00 12/1/2007
> > $2,625.00
>
> > >In addition, it would be very helpful if someone know how to get the
> > >PROC SORT DUPKEY method to NOT delete any observations, but rather
> > >just mark as "1" and "0" those observations that it would have
> > >deleted.
>
> > >Thanks for any and all help!!
>
> On May 1, 6:18 pm, ya.hu...@AMYLIN.COM (Ya Huang) wrote:
> > Assumption is that canceled and the counterpart are on the same
> > date for same account, name and stock:
>
> > data have;
> > length net $15;
> > input Account $ Name $ B $ Cancel $19 Quantity Stock $ Price $
> > eDate :mmddyy10. Net $;
> > cards;
> > ABC1234 Fred S -175 XOM $89.00 12/1/2007 ($15,575.00)
> > ABC1234 Fred S -175 XOM $89.00 1/5/2008 ($15,575.00)
> > ABC1234 Fred S X 175 XOM $89.00 1/5/2008 $15,575.00
> > ABC1234 Fred S -175 XOM $89.00 1/8/2008 ($15,575.00)
> > ABC1234 Fred S X 175 XOM $89.00 1/8/2008 $15,575.00
> > ABC1234 Fred B 105 ATT $25.00 12/1/2007 $2,625.00
> > ;
>
> > proc sql;
> > create table need as
> > select *,case when count(edate)=2 and
> > max(quantity)=-min(quantity)
> > then 1 else 0 end as del
> > from have
> > group by account,name,stock,edate
> > ;
>
> > proc print;
> > format edate date.;
> > run;
>
> > Account Cancel Quantity Stock eDate net
> del
>
> > ABC1234 105 ATT 01DEC07 $2,625.00
> 0
> > ABC1234 -175 XOM 01DEC07 ($15,575.00)
> 0
> > ABC1234 X 175 XOM 05JAN08 $15,575.00
> 1
> > ABC1234 -175 XOM 05JAN08 ($15,575.00)
> 1
> > ABC1234 X 175 XOM 08JAN08 $15,575.00
> 1
> > ABC1234 -175 XOM 08JAN08 ($15,575.00)
> 1
>
> > On Thu, 1 May 2008 15:09:45 -0700, MBWD <xmb...@GMAIL.COM> wrote:
> > >I can't figure out how to get the below data to dedupe (sort of) by
> > >deleting the two canceled observations AND their two counterparts (or
> > >alternatively, just creating a binary variably 1,0 for those
> > >observations that match the cancel row).
>
> > >This would delete all but rows 1 and 6. Rows 2 and 3 are
> counterparts
> > >in a cancel transaction. So are rows 3 and 4.
>
> > >Account Name B Cancel Quantity Stock Price Entry Date Net
> > >ABC1234 Fred S -175 XOM $89.00 12/1/2007
> > ($15,575.00)
> > >ABC1234 Fred S -175 XOM $89.00 1/5/2008
> > ($15,575.00)
> > >ABC1234 Fred S X 175 XOM $89.00 1/5/2008
> > $15,575.00
> > >ABC1234 Fred S -175 XOM $89.00 1/8/2008
> > ($15,575.00)
> > >ABC1234 Fred S X 175 XOM $89.00 1/8/2008
> > $15,575.00
> > >ABC1234 Fred B 105 ATT $25.00 12/1/2007
> > $2,625.00
>
> > >In addition, it would be very helpful if someone know how to get the
> > >PROC SORT DUPKEY method to NOT delete any observations, but rather
> > >just mark as "1" and "0" those observations that it would have
> > >deleted.
>
> > >Thanks for any and all help!!- Hide quoted text -
>
> - Show quoted text -
|