LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2004, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 14 Jun 2004 15:19:56 -0400
Reply-To:   sashole@bellsouth.net
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Paul M. Dorfman" <sashole@BELLSOUTH.NET>
Organization:   Sashole of Florida
Subject:   Re: Need to generate random words/letter configurations in SAS
Comments:   To: "Gerstle, John" <YZG9@CDC.GOV>
In-Reply-To:   <148E8334465C024399DD7B14E61D7D15515432@m-nchstp-2.nchstp.cdc.gov>
Content-Type:   text/plain; charset="US-ASCII"

John,

I believe that Richard meant hashing as a way of looking up the list of words having already been output as an alternative to his (very clever) MODIFY/OUTPUT scheme, where the list of words is checked implicitly via disk search. Whatever the search structure is, it needs to be able to lend tiself to searching and insertion in at least O(log(n)) time, otherwise as the number of words generated grow, the increased time required for the insertions and searches can potentially bring the word-generating process to standstill. Hash table have the property of being both inserted and searched at O(1), i.e. constant time, which actually cannot be beat.

If you have no access to Version 9, your only hashing option is hand-coding; luckily, with the 5000 words to search, the simplest linear-probing algorithm will work. The hash table size below is picked up as a prime about 1.5*5000:

87 %let max_word_len = 15 ;

88 %let n_unique_words = 5000 ;

89 %let hash_size = 14293 ; *prime ~ 1.5*&n_unique_words ;

90

91 data words ( keep = word ) ;

92 array ww [0 : &hash_size] $ &max_word_len _temporary_ ;

93

94 length word $ &max_word_len rnd_letter $ 1 ;

95

96 retain pool 'abcdefghijklmnopqrstuvwxyz' n_unique_words 0 ;

97

98 pool_len = lengthn (pool) ;

99

100 do until ( n_unique_words = &n_unique_words ) ;

101 do word_len = 1 to ceil (ranuni (1) * &max_word_len) ;

102 rnd_letter = substr (pool, ceil (ranuni (1) * pool_len)) ;

103 substr (word, word_len, 1) = rnd_letter ;

104 end ;

105

106 do h = mod (input (word, pib6.), &hash_size) by 3 until ( ww [h] = word ) ; 107 if h => &hash_size then h +- &hash_size ;

108 if ww [h] = '' then do ;

109 n_unique_words ++ 1 ;

110 output ;

111 ww [h] = word ;

112 end ;

113 end ;

114 end ;

115 run ;

NOTE: The data set WORK.WORDS has 5000 observations and 1 variables.

NOTE: DATA statement used (Total process time):

real time 0.06 seconds

user cpu time 0.04 seconds

system cpu time 0.01 seconds

Memory 493k

Above, the hash table is represented by the array WW. The workings of the code at the lines 106-113 have been explained ad nauseam (see, for instance, http://www2.sas.com/proceedings/sugi26/p008-26.pdf).

Now if Version 9 is available for you, you need not rummage for the algorithm because it is canned:

117 data words ( keep = word ) ; 118 dcl hash hw ( ) ; 119 hw.definekey ('word') ; 120 hw.definedone ( ) ; 121 122 length word $ &max_word_len rnd_letter $ 1 ; 123 124 retain pool 'abcdefghijklmnopqrstuvwxyz' n_unique_words 0 ; 125 126 pool_len = lengthn (pool) ; 127 128 do until ( n_unique_words = &n_unique_words ) ; 129 do word_len = 1 to ceil (ranuni (1) * &max_word_len) ; 130 rnd_letter = substr (pool, ceil (ranuni (1) * pool_len)) ; 131 substr (word, word_len, 1) = rnd_letter ; 132 end ; 133 134 if hw.add () ne 0 then continue ; 135 n_unique_words ++ 1 ; 136 output ; 137 end ; 138 run ;

NOTE: The data set WORK.WORDS has 5000 observations and 1 variables. NOTE: DATA statement used (Total process time): real time 0.06 seconds user cpu time 0.04 seconds system cpu time 0.01 seconds Memory 475k

The HW.ADD() method returns a non-zero value if the word having just been concocted is already in the table; else it inserts the word in the table for future referencing in the same manner, and the word is output. Once the number of such words is sufficient, the process stops.

Interestingly, this is a situation where an approach many could perceive as simpler one will work - for the simple reason that such a method has a negligible probability of generating even a single duplicate. Thus, you can generate, say, about twice as many words as needed, dedup the file, just in case, then make the final selection.

Kind regards, ---------------- Paul M. Dorfman Jacksonville, FL ----------------

> -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > Behalf Of Gerstle, John > Sent: Monday, June 14, 2004 12:16 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Need to generate random words/letter > configurations in SAS > > Ever interested in other ways of skinning the same kittie... > > I come from a non-programming world (entered into using SAS > via the applied statistics door), and so I find the hashing > language a little confusing. I haven't truly made the > cognitive jump to understand what hashing 'is'. > > Since Richard mentioned below that one could use a hashing > method to create generate random unique words/nonwords, I'd > like to see an example of the hashing version of this > program. Do you have one handy to share with the list? > > John Gerstle > CDC Information Technological Support Contract (CITS) Biostatistician > > > >> -----Original Message----- > >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > Behalf Of > >> Richard A. DeVenezia > >> Sent: Friday, June 11, 2004 10:30 AM > >> To: SAS-L@LISTSERV.UGA.EDU > >> Subject: Re: Need to generate random words/letter configurations in > SAS > >> > >> > >> Here is one way to ensure unique random words. Other ways would > include > >> hashing. > >> > >> proc sql; > >> create table words > >> ( word char(15) unique not missing) > >> ; > >> insert into words values ('.'); > >> quit; > >> > >> data words; > >> > >> modify words; > >> > >> seed = 1; > >> > >> do row = 0 by 0 until (row = 5000); > >> array ch [15] $1 _temporary_; > >> addr = addr(ch[1]); > >> > >> L = 15*ranuni(seed)+1; * random length; > >> do i = 1 to L; > >> ch[i] = byte (rank('a') + 26 * ranuni(seed)); * random > characters > >> a..z; > >> word = peekc (addr, L); > >> end; > >> > >> OUTPUT; * will _error_=1 if word is not unique; > >> > >> row + (_error_ = 0); * count non-error outputs; > >> _error_ = 0; * reset error flag (so next OUTPUT will be > attempted); > >> end; > >> > >> stop; * every MODIFY needs a stop; run; > >> > >> -- > >> Richard A. DeVenezia > >> http://www.devenezia.com/downloads/sas/samples >


Back to: Top of message | Previous page | Main SAS-L page