Date: Tue, 10 Aug 2004 13:41:57 -0400
Reply-To: Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Subject: Re: Comparison across rows
Here's how I would do it. Start with expanded test data:
data sample;
input id name $ price purch_dt;
format purch_dt mmddyy10.;
datalines;
1 Cup 10 16000
1 Cup 10 15980
1 Cup 10 16000
1 Pencil 2 16002
2 Pencil 2 16002
2 Pencil 2 15990
2 Pencil 2 15000
2 Pencil 3 16000
2 Pencil 1 16005
3 Book 6 15999
3 Book 4 16000
3 Book 5 16000
3 Book 4 15900
4 Widget 99 17000
4 Widget 99 17020
4 Widget 99 17040
4 Widget 99 17101
5 Widget 99 17000
5 Widget 99 17040
5 Widget 99 17100
;
I added records to test the boundary between 60 and 61 days, and also to
check results for sequences of three dates when the first and last are
more/less than 60 days apart.
Here's an incidental step just to display the intervals:
data _null_;
set sample;
by id name price notsorted;
interval = purch_dt - lag(purch_dt);
if first.price then interval = .;
put (_all_)(=);
run;
The log shows
id=1 name=Cup price=10 purch_dt=10/22/2003 interval=.
id=1 name=Cup price=10 purch_dt=10/02/2003 interval=-20
id=1 name=Cup price=10 purch_dt=10/22/2003 interval=20
id=1 name=Pencil price=2 purch_dt=10/24/2003 interval=.
id=2 name=Pencil price=2 purch_dt=10/24/2003 interval=.
id=2 name=Pencil price=2 purch_dt=10/12/2003 interval=-12
id=2 name=Pencil price=2 purch_dt=01/25/2001 interval=-990
id=2 name=Pencil price=3 purch_dt=10/22/2003 interval=.
id=2 name=Pencil price=1 purch_dt=10/27/2003 interval=.
id=3 name=Book price=6 purch_dt=10/21/2003 interval=.
id=3 name=Book price=4 purch_dt=10/22/2003 interval=.
id=3 name=Book price=5 purch_dt=10/22/2003 interval=.
id=3 name=Book price=4 purch_dt=07/14/2003 interval=.
id=4 name=Widget price=99 purch_dt=07/18/2006 interval=.
id=4 name=Widget price=99 purch_dt=08/07/2006 interval=20
id=4 name=Widget price=99 purch_dt=08/27/2006 interval=20
id=4 name=Widget price=99 purch_dt=10/27/2006 interval=61
id=5 name=Widget price=99 purch_dt=07/18/2006 interval=.
id=5 name=Widget price=99 purch_dt=08/27/2006 interval=40
id=5 name=Widget price=99 purch_dt=10/26/2006 interval=60
The solution is in steps. Some could be rolled together, but I think it's
easier to follow this way.
proc sql;
First, collapse duplicate records and retain the information in a count
column:
create view one as
select sample.*, count(*) as many
from sample
group by id, name, price, purch_dt;
Next, do a reflexive join to find all row pairs with matching keys and no
more than 60 days of separation:
create view two as
select ref.purch_dt as refdate, one.*
from one as ref inner join one
on ref.id = one.id and
ref.name = one.name and
ref.price = one.price
where one.purch_dt>=ref.purch_dt and
one.purch_dt<=ref.purch_dt+60;
Then subset and filter out the rows which do not meet the criteria of being
part of a multiple:
create view three as
select * from two
group by id, name, price, refdate
having sum(many)>1;
Finally, get rid of duplicates:
create table outcome as
select distinct id, name, price, purch_dt, many
from three
order by id, purch_dt;
quit;
Results:
Obs id name price purch_dt many
1 1 Cup 10 10/02/2003 1
2 1 Cup 10 10/22/2003 2
3 2 Pencil 2 10/12/2003 1
4 2 Pencil 2 10/24/2003 1
5 4 Widget 99 07/18/2006 1
6 4 Widget 99 08/07/2006 1
7 4 Widget 99 08/27/2006 1
8 5 Widget 99 07/18/2006 1
9 5 Widget 99 08/27/2006 1
10 5 Widget 99 10/26/2006 1
On Fri, 6 Aug 2004 18:04:38 -0400, LL Cool Mars <klasikjis@YAHOO.COM> wrote:
>Hi SAS maniacs,
>
>Here is the light problem:
>
>data sample;
>input id name $ price purch_dt;
>format purch_dt mmddyy10.;
>datalines;
>1 Cup 10 16000
>1 Cup 10 15980
>1 Cup 10 16000
>1 Pencil 2 16002
>2 Pencil 2 16002
>2 Pencil 2 15990
>2 Pencil 2 15000
>2 Pencil 3 16000
>2 Pencil 1 16005
>3 Book 6 15999
>3 Book 4 16000
>3 Book 5 16000
>3 Book 4 15900
>;
>run;
>
>proc sort data = sample;
>by id name price purch_dt;
>quit;
>
>would like to get list of items where id, name, and price are equal AND
>pruch_dt within 60 days. I am looking for the solution in one step with
>outcome looking like this:
>1 Cup 10 16000
>1 Cup 10 15980
>1 Cup 10 16000
>2 Pencil 2 16002
>2 Pencil 2 15990
>
>
>if anybody is interested to take a look at what I came up with here it is,
>but I believe this is clumsy:
>
>data outcome;
>set sample;
> if id = LAG(id) and
> name = LAG(name) and
> price = LAG(price) and
> ABS(INTCK('day',purch_dt,LAG(purch_dt)))<60 then output;
>run;
>
>proc sort data = outcome;
>by id name price purch_dt;
>quit;
>
>
>data merged;
>merge outcome (in=in1) sample(in=in2 RENAME =(purch_dt=orig_purch_dt));
>by id name price;
>if in1 and in2 and ABS(INTCK('day',purch_dt, orig_purch_dt))<60 then output
>merged;
>run;
>
>Thanks much,
>LL Cool Mars