Date: Mon, 18 Jan 2010 09:35:00 -0800
Reply-To: Ai Hua Wang <aihuawang@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ai Hua Wang <aihuawang@YAHOO.COM>
Subject: Re: assign a unique random integer to each unique id - thank you
In-Reply-To: <2fc7f3341001180859y5abd0a80x960f4147292f9e9d@mail.gmail.com>
Content-Type: text/plain; charset=iso-8859-1
Hi Everybody who helped me:
Please accept my big thank you.
I have tried every solution. They all work very well. You guys are very good at SAS.
Thank you very much again.
Aihua
--- On Mon, 1/18/10, Muthia Kachirayan <muthia.kachirayan@GMAIL.COM> wrote:
From: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Subject: Re: assign a unique random integer to each unique id
To: SAS-L@LISTSERV.UGA.EDU
Received: Monday, January 18, 2010, 4:59 PM
Aihua,
Your clarification to Dan makes your need understood. My earlier array
solution permutes the observation numbers in random order to get the
urands. There is some chance that both of them may be same. The urands will
suggest the similarity between observation numbers and urands. This can be
rectified by choosing a multiplier to the number of observations to get a
random sequence of urands.
The test data set is:
data have;
do key = 1 to 10;
sat = put(key * 100 + key, z4.);
output;
end;
run;
The data set, HAVE, has KEY as the primary key in ascending order with some
SAT data.
key sat
1 0101
2 0202
3 0303
4 0404
5 0505
6 0606
7 0707
8 0808
9 0909
10 1010
The following program creates new UNIQUE IDs(uid) based on the number of
observations in the data set and note that the KEY is not used in the
process. Let us use a number, NUM_FOLD, to multiply the observation number,
say 1000.
%let num_fold = 1000;
data need;
if _n_ = 1 then do;
declare hash h(hashexp:16);
h.definekey('uid');
h.definedata('RID','key','uid');
h.definedone();
end;
do RID = 1 to num;
set have nobs = num ;
uid = ceil(ranuni(123) * num * &num_fold);
do rc = h.check() by 0 while (rc = 0);
uid = ceil(ranuni(123) * num * &num_fold); ** Try another random
number ;
rc = h.check();
end;
h.add();
output;
end;
h.output(dataset:'LOOKUP');
stop;
drop rc key;
run;
proc print data = need;
run;
The data set, NEED, gives the UID for the corresponding Record ID(RID) and
the KEY is dropped to keep the secrecy of data set.
RID sat uid
1 0101 7504
2 0202 3210
3 0303 1784
4 0404 9061
5 0505 3572
6 0606 2212
7 0707 7865
8 0808 3981
9 0909 1247
10 1010 1877
My earlier array solution is a special case when NUM_FOLD = 1.
This program also gives another data set, LOOKUP, giving links to RID/KEY.
When it is sorted by UID, the reverse process of getting RID/KEY becomes
easy.
There is another possiblity of creating UIDs based on the KEYs and not based
on the observation number. The following program does it based on MOD()
function. However, KEYs, have to nemeric but this restriction can be removed
in some circumstances when the character-type KEYs can be changed to numeric
by the use of functions like, PIBw. For further details refer to Dorfman(Key
indexing, Bitmapping and Hashing).
data need;
if _n_ = 1 then do;
declare hash h(hashexp:16);
h.definekey('uid');
h.definedata('RID','key','uid');
h.definedone();
end;
do RID = 1 to num;
set have nobs = num ;
uid = mod(key, num) + 1;
do rc = h.check() by 0 while (rc = 0);
** Try another random number ;
uid = uid + 1;
if uid > num then uid = 1;
rc = h.check();
end;
h.add();
output;
end;
h.output(dataset:'LOOKUP');
stop;
drop rc key;
run;
Do you find this program useful to solve your issue ? Your feedback will be
useful to SAS-Lers to give alternate solutions.
Kind regards,
Muthia Kachirayan
On Sat, Jan 16, 2010 at 5:30 PM, Ai Hua Wang <aihuawang@yahoo.com> wrote:
> Hi Dan:
>
> Thank you very much for your thoughtful follow up. Please see my answers
> below.
>
> Why does your multiplier need to be proportional to dataset size?
> That is just my thought after I tried. Because when I use the smaller
> multiplier I got much more duplicates. When I increase it I got less.
> Eventually I found that it should be at least propotional to the size of the
> data set.
>
> Why do you want random integers assigned to your data?
> I need to use the assigned random integers as the unique id to allow the
> data users to identify each unique record. I thought it is better to use the
> integer than the decimal numbers.
>
> And why do they need to be unique?
> See above description and plus:
> It is used as the replacement of the sensitive information (unique id) for
> the privacy and confidentialiy concern.
>
> I hope this is helpful when you provide more insightful answers.
>
> Best Regards,
> Aihua
>
>
__________________________________________________________________
Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your favourite sites. Download it now
http://ca.toolbar.yahoo.com.