LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2007, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 20 Jun 2007 06:44:45 -0700
Reply-To:     Paul <paul.vonhippel@CHASE.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Paul <paul.vonhippel@CHASE.COM>
Organization: http://groups.google.com
Subject:      Re: two-variable deduplication problem
Comments: To: sas-l@uga.edu
In-Reply-To:  <1182341702.410230.33940@o61g2000hsh.googlegroups.com>
Content-Type: text/plain; charset="us-ascii"

Again, I don't think using FIRST. and LAST. will solve the problem. First, since the best matches are at the top of the file, I want to preserve the file order during deduplication; I don't want to restore the file order afterward. Even if sorting the file order was okay, FIRST. and LAST. still wouldn't work. Consider the sorted example.

ID1 ID2 1 a 1 b 2 a 2 b

The first line of the desired solution is 1 a which has FIRST.ID1 and FIRST.ID2. The second line of the desired solution is 2 b which has LAST.ID1 and FIRST.ID2. In general, I don't see a way to use FIRST and LAST consistently to get the desired output.

I guess this is a hard problem. I'm glad I posted it, and thankful for your help.

On Jun 20, 8:15 am, sba...@sbbworks.com wrote: > On Jun 20, 6:59 am, Paul <paul.vonhip...@chase.com> wrote: > > > > > > > P.S. I should say that my data set contains just a few thousand > > records, so a dynamic list of unique ID1 and ID2 variables should fit > > comfortably in memory. I just don't know how to do it. > > > On Jun 20, 3:46 am, Paul <paul.vonhip...@chase.com> wrote: > > > > Thanks, David, but the usual solution doesn't work here. Remember that > > > I want to delete an observation if the value of ID1 *or* ID2 has > > > appeared before. The PROC SORT trick -- > > > > proc sort data=adataset nodupkey; > > > by id1 id2; > > > run; > > > > -- deletes an observation if the value-pair (ID1, ID2) has appeared > > > before. Consider this simplified example: > > > > ID1 ID2 > > > 1 a > > > 1 b > > > 2 b > > > 2 a > > > > PROC SORT NODUPKEY does not remove any cases, and neither does > > > deleting cases with FIRST.ID1 and FIRST.ID2. The desired result would > > > be this: > > > > ID1 ID2 > > > 1 a > > > 2 b > > > > I've considered running PROC SORT NODUPKEY twice, first sorting by ID1 > > > and then sorting by ID2. But that deletes all but one case, yielding > > > this: > > > > ID1 ID2 > > > 1 a > > > > So I'm afraid need a different approach. I've never used hash tables, > > > so example code would be helpful. Or maybe someone has a simpler > > > approach that doesn't involve hashing? > > > > Note that the original dataset is not and should not be sorted on > > > (ID1,ID2). Instead, assume that it's sorted by some other variable > > > that indicates the desirability of keeping each case. > > > > Thanks again, > > > Paul > > > > On Jun 19, 8:47 pm, David <davids...@gmail.com> wrote: > > > > > The usual approach is > > > > > proc sort data=adataset nodupkey; > > > > by id1 id2; > > > > run; > > > > > You could do this, as you suggest, by maintaining a hash table with a > > > > list of ID1 and ID2 pairs, and checking this table for each > > > > observation in the dataset, but it's harder to code. > > > > > However blindling deleteting one of the observations might not be as > > > > good as finding some business rules to decide which observation should > > > > be kept: May I suggest you do something like: > > > > > proc sort data=adataset; > > > > by id1 id2; > > > > run; > > > > > data adataset; > > > > set adataset; > > > > by id1 id2; > > > > if not (first.id1 and first.id2) then do; > > > > if /* insert logic here */ then delete; > > > > end; > > > > run; > > > > > On Jun 20, 3:12 pm, Paul <paul.vonhip...@chase.com> wrote: > > > > > > I have a data set with two ID variables ID1 and ID2. I would like to > > > > > step through the data set from top to bottom, removing observations if > > > > > I have already seen their value of ID1 or ID2. What I need, I guess, > > > > > is some way of appending ID values into a growing list that I can > > > > > refer to as I step through the data set. > > > > > > Suggestions welcome.... > > > > > > Thanks! > > > > > Paul- Hide quoted text - > > > - Show quoted text - > > David's example using the FIRST.varname / LAST.varname approach will > most definitely work for your request, however I would add that since > this technique involves sorting the original data, you should add a > DATA step prior to the first sort to assign a temporary variable > ORD=_N_; to get the original observation order, and then after > removing the duplicates with the DATA step approach, re-sort the data > using the ORD variable in the BY statement, allowing you to put the > observations back in their original sequence. You can drop the ORD > variable on the PROC SORT execution to remove it from your dataset. > > Scott Barry > SBBWorks, Inc.- Hide quoted text - > > - Show quoted text -


Back to: Top of message | Previous page | Main SAS-L page