>I have a data set that looks like the following:
>ID var1 var2 var3 var4 var5 var6 var7 var8 ..........etc.
>I would like to de-identify unique observation for each ID based on
>the 8 variables.
>The idea is that for every id, if there is an observation such that it
>is unique across all 8 vars then var1 is set to missing. After setting
>var1 to missing I check again to see if it is still unique. If it is
>then I set var2 to missing. I keep doing this until that obs is no
>After de-identifying all unique observations, the data set should be
>in the same form as the original with the appropriate vars set to
>I tried to use proc freq but it does not do what I am trying to do. I
>might need to use arrays but I am not very good with that.
>This is a very big dataset about 80 variables and millions of
>I appreciate any help
If I read your memo correctly, you only need to do this within
each separate value of ID. So let me ask some questions about this.
How many distinct values of ID are there?
What is the largest number of records for any single ID?
How many of those are likely to be duplicates?
And do we need to do this to 8 variables, or 80, or more?
Are the data sorted by ID already?
If not, are they indexed on ID?
If not that either, are they 'grouped' by ID, so they are not
sorted on ID, but all the values of any ID are clumped together?
If you have K duplicate records, do you want to apply the
'missing' rules to all the records, all but the first record, or what?
Is the order of the records relevant for this?
Finally, why do you need to do this? It may be that your larger
goal can be met in a different way, if you just explain what is really
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
The average US Credit Score is 675. The cost to see yours: $0 by Experian.