Date: Wed, 20 Jun 2007 06:44:45 -0700
Reply-To: Paul <paul.vonhippel@CHASE.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul <paul.vonhippel@CHASE.COM>
Organization: http://groups.google.com
Subject: Re: two-variable deduplication problem
In-Reply-To: <1182341702.410230.33940@o61g2000hsh.googlegroups.com>
Content-Type: text/plain; charset="us-ascii"
Again, I don't think using FIRST. and LAST. will solve the problem.
First, since the best matches are at the top of the file, I want to
preserve the file order during deduplication; I don't want to restore
the file order afterward. Even if sorting the file order was okay,
FIRST. and LAST. still wouldn't work. Consider the sorted example.
ID1 ID2
1 a
1 b
2 a
2 b
The first line of the desired solution is
1 a
which has FIRST.ID1 and FIRST.ID2. The second line of the desired
solution is
2 b
which has LAST.ID1 and FIRST.ID2. In general, I don't see a way to use
FIRST and LAST consistently to get the desired output.
I guess this is a hard problem. I'm glad I posted it, and thankful for
your help.
On Jun 20, 8:15 am, sba...@sbbworks.com wrote:
> On Jun 20, 6:59 am, Paul <paul.vonhip...@chase.com> wrote:
>
>
>
>
>
> > P.S. I should say that my data set contains just a few thousand
> > records, so a dynamic list of unique ID1 and ID2 variables should fit
> > comfortably in memory. I just don't know how to do it.
>
> > On Jun 20, 3:46 am, Paul <paul.vonhip...@chase.com> wrote:
>
> > > Thanks, David, but the usual solution doesn't work here. Remember that
> > > I want to delete an observation if the value of ID1 *or* ID2 has
> > > appeared before. The PROC SORT trick --
>
> > > proc sort data=adataset nodupkey;
> > > by id1 id2;
> > > run;
>
> > > -- deletes an observation if the value-pair (ID1, ID2) has appeared
> > > before. Consider this simplified example:
>
> > > ID1 ID2
> > > 1 a
> > > 1 b
> > > 2 b
> > > 2 a
>
> > > PROC SORT NODUPKEY does not remove any cases, and neither does
> > > deleting cases with FIRST.ID1 and FIRST.ID2. The desired result would
> > > be this:
>
> > > ID1 ID2
> > > 1 a
> > > 2 b
>
> > > I've considered running PROC SORT NODUPKEY twice, first sorting by ID1
> > > and then sorting by ID2. But that deletes all but one case, yielding
> > > this:
>
> > > ID1 ID2
> > > 1 a
>
> > > So I'm afraid need a different approach. I've never used hash tables,
> > > so example code would be helpful. Or maybe someone has a simpler
> > > approach that doesn't involve hashing?
>
> > > Note that the original dataset is not and should not be sorted on
> > > (ID1,ID2). Instead, assume that it's sorted by some other variable
> > > that indicates the desirability of keeping each case.
>
> > > Thanks again,
> > > Paul
>
> > > On Jun 19, 8:47 pm, David <davids...@gmail.com> wrote:
>
> > > > The usual approach is
>
> > > > proc sort data=adataset nodupkey;
> > > > by id1 id2;
> > > > run;
>
> > > > You could do this, as you suggest, by maintaining a hash table with a
> > > > list of ID1 and ID2 pairs, and checking this table for each
> > > > observation in the dataset, but it's harder to code.
>
> > > > However blindling deleteting one of the observations might not be as
> > > > good as finding some business rules to decide which observation should
> > > > be kept: May I suggest you do something like:
>
> > > > proc sort data=adataset;
> > > > by id1 id2;
> > > > run;
>
> > > > data adataset;
> > > > set adataset;
> > > > by id1 id2;
> > > > if not (first.id1 and first.id2) then do;
> > > > if /* insert logic here */ then delete;
> > > > end;
> > > > run;
>
> > > > On Jun 20, 3:12 pm, Paul <paul.vonhip...@chase.com> wrote:
>
> > > > > I have a data set with two ID variables ID1 and ID2. I would like to
> > > > > step through the data set from top to bottom, removing observations if
> > > > > I have already seen their value of ID1 or ID2. What I need, I guess,
> > > > > is some way of appending ID values into a growing list that I can
> > > > > refer to as I step through the data set.
>
> > > > > Suggestions welcome....
>
> > > > > Thanks!
> > > > > Paul- Hide quoted text -
>
> > - Show quoted text -
>
> David's example using the FIRST.varname / LAST.varname approach will
> most definitely work for your request, however I would add that since
> this technique involves sorting the original data, you should add a
> DATA step prior to the first sort to assign a temporary variable
> ORD=_N_; to get the original observation order, and then after
> removing the duplicates with the DATA step approach, re-sort the data
> using the ORD variable in the BY statement, allowing you to put the
> observations back in their original sequence. You can drop the ORD
> variable on the PROC SORT execution to remove it from your dataset.
>
> Scott Barry
> SBBWorks, Inc.- Hide quoted text -
>
> - Show quoted text -
|