LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2007, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 4 Apr 2007 18:16:40 -0700
Reply-To:   David L Cassell <davidlcassell@MSN.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   David L Cassell <davidlcassell@MSN.COM>
Subject:   Re: setting unique values to missing
Comments:   To: mabel.pennington@GMAIL.COM
In-Reply-To:   <1175645291.617944.307200@p77g2000hsh.googlegroups.com>
Content-Type:   text/plain; format=flowed

mabel.pennington@GMAIL.COM wrote: > >Hi, >I have a data set that looks like the following: >ID var1 var2 var3 var4 var5 var6 var7 var8 ..........etc. > >I would like to de-identify unique observation for each ID based on >the 8 variables. >The idea is that for every id, if there is an observation such that it >is unique across all 8 vars then var1 is set to missing. After setting >var1 to missing I check again to see if it is still unique. If it is >then I set var2 to missing. I keep doing this until that obs is no >longer unique. > >After de-identifying all unique observations, the data set should be >in the same form as the original with the appropriate vars set to >missing. >I tried to use proc freq but it does not do what I am trying to do. I >might need to use arrays but I am not very good with that. > >This is a very big dataset about 80 variables and millions of >observations. >I appreciate any help >Mabel

If I read your memo correctly, you only need to do this within each separate value of ID. So let me ask some questions about this.

How many distinct values of ID are there? What is the largest number of records for any single ID? How many of those are likely to be duplicates? And do we need to do this to 8 variables, or 80, or more?

Are the data sorted by ID already? If not, are they indexed on ID? If not that either, are they 'grouped' by ID, so they are not sorted on ID, but all the values of any ID are clumped together?

If you have K duplicate records, do you want to apply the 'missing' rules to all the records, all but the first record, or what? Is the order of the records relevant for this?

Finally, why do you need to do this? It may be that your larger goal can be met in a different way, if you just explain what is really going on...

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ The average US Credit Score is 675. The cost to see yours: $0 by Experian. http://www.freecreditreport.com/pm/default.aspx?sc=660600&bcd=EMAILFOOTERAVERAGE


Back to: Top of message | Previous page | Main SAS-L page