| Date: | Thu, 24 Apr 2003 11:07:12 -0400 |
| Reply-To: | "Gerstle, John" <yzg9@CDC.GOV> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Gerstle, John" <yzg9@CDC.GOV> |
| Subject: | Re: Non-sequential unique numbers |
|
| Content-Type: | text/plain |
|---|
Mike,
I understand the method you were suggesting and agree that it would work. I
was just thinking about, given a large enough sample, even an event with a
very small probability of occurring has decent chance to be observed. I'm
extrapolating from the Central Limit Theorem where if you have enough of a
sample, you'll definitely find 'significant' findings, regardless if they
are true meaningful findings.
John Gerstle
CDC Information Technological Support Contract (CITS)
Biostatistician
>> -----Original Message-----
>> From: Mike Rhoads [mailto:RHOADSM1@WESTAT.com]
>> Sent: Thursday, April 24, 2003 10:59 AM
>> To: 'Gerstle, John'; SAS-L@LISTSERV.UGA.EDU
>> Subject: RE: Non-sequential unique numbers
>>
>> John,
>>
>> Given that the random numbers generated are floating point, I'm not sure
>> what the probability of duplication is. Note that I was not using the
>> random numbers themselves as the ID, but was just sorting by the random
>> number and then assigning the record number of the re-sorted file as the
>> ID.
>> For that approach, it doesn't matter whether there are duplicates
>> (although
>> it turned out that I had misunderstood what Ralph was really asking for).
>>
>> Mike Rhoads
>> Westat
>> RhoadsM1@Westat.com
>>
>> -----Original Message-----
>> From: Gerstle, John [mailto:yzg9@cdc.gov]
>> Sent: Thursday, April 24, 2003 9:33 AM
>> To: Mike Rhoads; SAS-L@LISTSERV.UGA.EDU
>> Subject: RE: Non-sequential unique numbers
>>
>>
>> Mike,
>>
>> Wouldn't you agree, though, that even if you've create 90,000 random
>> values,
>> each with equal probability, that you have a good probability of creating
>> at
>> least one pair of duplicate id numbers? Seems you'd want to create a list
>> of
>> 90,000 unique random numbers and then assign each, without replacement,
>> to
>> each of the records in the dataset.
>>
>> Just a thought...
>>
>> John Gerstle
>> CDC Information Technological Support Contract (CITS)
>> Biostatistician
>>
>>
>> >> -----Original Message-----
>> >> From: Mike Rhoads [mailto:RHOADSM1@WESTAT.COM]
>> >> Sent: Wednesday, April 23, 2003 6:14 PM
>> >> To: SAS-L@LISTSERV.UGA.EDU
>> >> Subject: Re: Non-sequential unique numbers
>> >>
>> >> Ralph,
>> >>
>> >> If by "non-sequential" you mean that it "loses" the original order of
>> the
>> >> records, I would just assign a random number to each record in a DATA
>> >> step,
>> >> sort the output by the random number, then read the sorted file back
>> in
>> >> and
>> >> assign the record number as the ID. Something like (untested),
>> >>
>> >> DATA Temp;
>> >> SET OriginalFile;
>> >> RandomNumber = RANUNI(12345);
>> >> RUN;
>> >>
>> >> PROC SORT DATA=Temp;
>> >> BY RandomNumber;
>> >> RUN;
>> >>
>> >> DATA Final;
>> >> SET Temp;
>> >> IDVAR = _N_;
>> >> DROP RandomNumber; * Or don't ...;
>> >> RUN;
>> >>
>> >> Mike Rhoads
>> >> Westat
>> >> RhoadsM1@Westat.com
>> >>
>> >> -----Original Message-----
>> >> From: Ralph [mailto:rpk0524@YAHOO.COM]
>> >> Sent: Wednesday, April 23, 2003 5:15 PM
>> >> To: SAS-L@LISTSERV.UGA.EDU
>> >> Subject: Non-sequential unique numbers
>> >>
>> >>
>> >> I need to create a unique indentifier for 90,000 records that is
>> >> non-sequential. So far, the best solution I have come up with is:
>> >>
>> >> a = ranuni(345)+ (ranuni(123)+ int(time()));
>> >> b = int(reverse(ar_seqnum))*a;
>> >>
>> >> Using b as my indentifier, I can (most times) come up with unique
>> >> numbers, but the real challenge is this number can be no longer than 8
>> >> bytes. Using this code, my b(s) are 12 bytes. Using SUBSTR of b for
>> >> a length of 8, I get major dups. Can anyone help?
|