|
John,
I believe that Richard meant hashing as a way of looking up the list of
words having already been output as an alternative to his (very clever)
MODIFY/OUTPUT scheme, where the list of words is checked implicitly via disk
search. Whatever the search structure is, it needs to be able to lend tiself
to searching and insertion in at least O(log(n)) time, otherwise as the
number of words generated grow, the increased time required for the
insertions and searches can potentially bring the word-generating process to
standstill. Hash table have the property of being both inserted and searched
at O(1), i.e. constant time, which actually cannot be beat.
If you have no access to Version 9, your only hashing option is hand-coding;
luckily, with the 5000 words to search, the simplest linear-probing
algorithm will work. The hash table size below is picked up as a prime about
1.5*5000:
87 %let max_word_len = 15 ;
88 %let n_unique_words = 5000 ;
89 %let hash_size = 14293 ; *prime ~ 1.5*&n_unique_words ;
90
91 data words ( keep = word ) ;
92 array ww [0 : &hash_size] $ &max_word_len _temporary_ ;
93
94 length word $ &max_word_len rnd_letter $ 1 ;
95
96 retain pool 'abcdefghijklmnopqrstuvwxyz' n_unique_words 0 ;
97
98 pool_len = lengthn (pool) ;
99
100 do until ( n_unique_words = &n_unique_words ) ;
101 do word_len = 1 to ceil (ranuni (1) * &max_word_len) ;
102 rnd_letter = substr (pool, ceil (ranuni (1) * pool_len)) ;
103 substr (word, word_len, 1) = rnd_letter ;
104 end ;
105
106 do h = mod (input (word, pib6.), &hash_size) by 3 until ( ww [h]
= word ) ;
107 if h => &hash_size then h +- &hash_size ;
108 if ww [h] = '' then do ;
109 n_unique_words ++ 1 ;
110 output ;
111 ww [h] = word ;
112 end ;
113 end ;
114 end ;
115 run ;
NOTE: The data set WORK.WORDS has 5000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
user cpu time 0.04 seconds
system cpu time 0.01 seconds
Memory 493k
Above, the hash table is represented by the array WW. The workings of the
code at the lines 106-113 have been explained ad nauseam (see, for instance,
http://www2.sas.com/proceedings/sugi26/p008-26.pdf).
Now if Version 9 is available for you, you need not rummage for the
algorithm because it is canned:
117 data words ( keep = word ) ;
118 dcl hash hw ( ) ;
119 hw.definekey ('word') ;
120 hw.definedone ( ) ;
121
122 length word $ &max_word_len rnd_letter $ 1 ;
123
124 retain pool 'abcdefghijklmnopqrstuvwxyz' n_unique_words 0 ;
125
126 pool_len = lengthn (pool) ;
127
128 do until ( n_unique_words = &n_unique_words ) ;
129 do word_len = 1 to ceil (ranuni (1) * &max_word_len) ;
130 rnd_letter = substr (pool, ceil (ranuni (1) * pool_len)) ;
131 substr (word, word_len, 1) = rnd_letter ;
132 end ;
133
134 if hw.add () ne 0 then continue ;
135 n_unique_words ++ 1 ;
136 output ;
137 end ;
138 run ;
NOTE: The data set WORK.WORDS has 5000 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.06 seconds
user cpu time 0.04 seconds
system cpu time 0.01 seconds
Memory 475k
The HW.ADD() method returns a non-zero value if the word having just been
concocted is already in the table; else it inserts the word in the table for
future referencing in the same manner, and the word is output. Once the
number of such words is sufficient, the process stops.
Interestingly, this is a situation where an approach many could perceive as
simpler one will work - for the simple reason that such a method has a
negligible probability of generating even a single duplicate. Thus, you can
generate, say, about twice as many words as needed, dedup the file, just in
case, then make the final selection.
Kind regards,
----------------
Paul M. Dorfman
Jacksonville, FL
----------------
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Gerstle, John
> Sent: Monday, June 14, 2004 12:16 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Need to generate random words/letter
> configurations in SAS
>
> Ever interested in other ways of skinning the same kittie...
>
> I come from a non-programming world (entered into using SAS
> via the applied statistics door), and so I find the hashing
> language a little confusing. I haven't truly made the
> cognitive jump to understand what hashing 'is'.
>
> Since Richard mentioned below that one could use a hashing
> method to create generate random unique words/nonwords, I'd
> like to see an example of the hashing version of this
> program. Do you have one handy to share with the list?
>
> John Gerstle
> CDC Information Technological Support Contract (CITS) Biostatistician
>
>
> >> -----Original Message-----
> >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of
> >> Richard A. DeVenezia
> >> Sent: Friday, June 11, 2004 10:30 AM
> >> To: SAS-L@LISTSERV.UGA.EDU
> >> Subject: Re: Need to generate random words/letter configurations in
> SAS
> >>
> >>
> >> Here is one way to ensure unique random words. Other ways would
> include
> >> hashing.
> >>
> >> proc sql;
> >> create table words
> >> ( word char(15) unique not missing)
> >> ;
> >> insert into words values ('.');
> >> quit;
> >>
> >> data words;
> >>
> >> modify words;
> >>
> >> seed = 1;
> >>
> >> do row = 0 by 0 until (row = 5000);
> >> array ch [15] $1 _temporary_;
> >> addr = addr(ch[1]);
> >>
> >> L = 15*ranuni(seed)+1; * random length;
> >> do i = 1 to L;
> >> ch[i] = byte (rank('a') + 26 * ranuni(seed)); * random
> characters
> >> a..z;
> >> word = peekc (addr, L);
> >> end;
> >>
> >> OUTPUT; * will _error_=1 if word is not unique;
> >>
> >> row + (_error_ = 0); * count non-error outputs;
> >> _error_ = 0; * reset error flag (so next OUTPUT will be
> attempted);
> >> end;
> >>
> >> stop; * every MODIFY needs a stop; run;
> >>
> >> --
> >> Richard A. DeVenezia
> >> http://www.devenezia.com/downloads/sas/samples
>
|