Date: Mon, 6 Aug 2007 02:12:36 -0400
Reply-To: Paul Dorfman <sashole@BELLSOUTH.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Dorfman <sashole@BELLSOUTH.NET>
Subject: Re: nway unique counts
Ken,
Philosophically, Howard left nothing to be added in his last post in the
thread. Technically, we may try to tackle the issue in a couple of novel
directions.
First, let's sweep off the table the question of forming all possible
_types_. As Howard has noted, what you want is precisely the opposite of
NWAY, which is delivered by MEANS/SUMMARY omitting the option. Given:
%let class = sta gen mon ;
you can run:
data model ;
array class $ 34 &class ;
do over class ;
class = vname (class) ;
end ;
run ;
proc summary data = model ;
class &class ;
output out = types (drop = _freq_) ;
run ;
Then a cursory look at file TYPES should preclude the need in further
explanations on how it can be used. As you will see below, different
variations on this theme can be suited to a particular purpose.
Now your main issue is performance. Sure you can use brute force and sort
the file by all _TYPE_ combinations followed by SSN and use pretty
straightforward BY-processing to attain the goal, which is what you have
likely already tried, just to get bogged down by I/O. Other solutions
presented in the thread will run pretty much into the same brick wall. SQL
will actually finish the job, but will probably mega-I/O behind-the-scenes,
making the exercise too slow for you to stomach; and SUMMARY solutions are
likely to run out of memory long before they will have a chance to fill out
the AVL tree.
Having observed in the past that the DATA step hash object is much more
memory savvy than MEANS in terms of implementing a dynamic data dictionary,
I naturally wondered if it can be adapted to your task, all the more that
you have mentioned it in your inquiry. Of course, the goal here is to
produce a single-pass solution.
My first shot at the problem went thusly:
data a ;
input sta:$2. gen:$1. mon:$3. ssn: 9. ;
cards ;
NY M Jan 123456789
NY M Feb 123456789
NY M Mar 123456789
NY M Jan 987654321
NY M Feb 987654321
FL M Mar 987654321
NY F Jan 234567890
NY F Feb 234567890
FL F Jan 345678901
run ;
%let class = sta gen mon ;
data model ;
array class $ 34 &class ;
do over class ;
class = quote (strip (vname (class))) ;
end ;
run ;
proc summary data = model ;
class &class ;
output out = types (drop = _freq_) ;
run ;
%let k0 = '_n_' ;
%let dd = '_type_','count' ;
data _null_ ;
set types (firstobs = 2) ;
array class &class ;
call symput ('k' || put (_n_, best.-L), catx (',', of class [*])) ;
run ;
data _null_ ;
%macro dcl (n) ;
%do j = 0 %to &n ;
dcl hash h&j (ordered:'a') ;
h&j..definekey (&&k&j ) ;
h&j..definedata(&&k&j,&dd ) ;
h&j..definedone() ;
dcl hash c&j (hashexp:16) ;
%if &j > 0 %then %do ; c&j..definekey (&&k&j,'pib_ssn') ; %end ;
%else %do ; c&j..definekey ( 'pib_ssn') ; %end ;
c&j..definedone() ;
%end ;
%mend ;
%dcl (7) ;
do until (z) ;
set a end = z ;
pib_ssn = put (ssn, pib4.) ;
%macro cnt (n) ;
%do j = 0 %to &n ;
if h&j..find() = 0 then do ;
if c&j..check() ne 0 then do ;
count ++ 1 ;
c&j..add() ;
end ;
end ;
else do ;
count = 1 ;
_type_ = &j ;
c&j..add() ;
end ;
h&j..replace() ;
%end ;
%mend ;
%cnt (7) ;
end ;
%macro out (n) ;
%do j = 0 %to &n ;
h&j..output(dataset: "out&j") ;
%end ;
%mend ;
%out (7) ;
stop ;
run ;
data out ;
%macro all (n) ;
if 0 then set out&n ;
set %do j = 0 %to &n ; out&j %end ;
%mend ;
%all (7) ;
run ;
I understand that at first, you may have a bit of a difficulty to discern
the pattern, but MPRINT should make it all clear. Run against your sample,
it produces the correct result. Run against your real large file, it will
perhaps run faster than anything else... if it could manage to finish the
job remaining within the confines of the allotted memory.
Unfortunately, I doubt the latter will be the case, unless you have highly
repetitive SSN values on the file. Note that I have already shrunk the
memory footprint roughly twice by populating the C-tables with
put(ssn,pib4.) instead of full SSN. That means storing 4 SSN bytes per hash
key, so in the worst case scenario, a single _TYPE_ would take up
4e8*4~1.6Gb of RAM. If the same fell into 1-, 2-, and 3-part keys, that
would mean that you would need ~6Gb for SSN only to run the step.
A closer look at the code reveals that the main consumers of memory are the
C-tables; H-tables store low-cardinality keys and thus are miniscule in
comparison. The chief problem with the design of the C-tables is their
reliance on the class composite key (sta gen mon) added to every item
storing a unique SSN image. This will surely strike any SAS programmer worth
his semicolon as inefficient. Doubtless we can bit-manage (sta gen mon) with
its 50*2*12=1200 cardinality into mere 2 bytes (and borrow the remaining
2**16-1200 bytes for a digit from SSN), but this still bumps up the RAM to
~10Gb while making the program too data-esoteric.
The whole thing imparts the feeling of coding eclecticism somehow
interfering with the goal. Intuitively, macros do not really mesh with hash
objects accepting SAS variables as constructor parameters, and there should
be a way to dynamically associate each unique CLASS _TYPE_ key with only
those SSN values that fall into the key's domain. That would make it
unnecessary to store the CLASS _TYPE_ keys together with SSN in the C-table,
thus cutting drastically on real storage.
Incidentally, it shows that mixing old and new technologies does not work
all that well. Once I decided to completely focus my mentality on the
dynamic nature of the hash object philosophy (including the ability of hash
object to store other objects), a different way of doing the same thing
sprang out rather naturally:
data test ;
input state:$2. gender:$1. mon:$3. ssn: 9. ;
cards ;
NY M Jan 123456789
NY M Feb 123456789
NY M Mar 123456789
NY M Jan 987654321
NY M Feb 987654321
FL M Mar 987654321
NY F Jan 234567890
NY F Feb 234567890
FL F Jan 345678901
run ;
%let blank_class = state gender mon ;
%let comma_class = %sysfunc(tranwrd(&blank_class, %str( ), %str(,))) ;
%let items_class = %sysfunc(countw(&blank_class)) ;
data model ;
array _cc_ $ 32 _cc_1-_cc_&items_class ;
do over _cc_ ;
_cc_ = scan ("&blank_class", _i_) ;
end ;
run ;
proc summary data = model ;
class _cc_: ;
output out = types (keep = _cc_: _type_) ;
run ;
data count_distinct (keep = &blank_class _type_ count) ;
dcl hash hhh() ;
hhh.definekey ('_type_') ;
hhh.definedata ('hh','hi') ;
hhh.definedone () ;
dcl hash hh() ;
dcl hiter hi ;
dcl hash h() ;
do z = 0 by 0 until (z) ;
set types end = z nobs = n ;
array _cc_ _cc_: ;
hh = _new_ hash (ordered: 'a') ;
do over _cc_ ;
if _type_ = 0 then _cc_ = '_n_' ;
if missing (_cc_) then continue ;
hh.definekey (_cc_) ;
hh.definedata(_cc_) ;
end ;
hi = _new_ hiter ('hh') ;
hh.definedata ('_type_','count','h') ;
hh.definedone () ;
hhh.add() ;
end ;
do z = 0 by 0 until (z) ;
set test end = z ;
pib_ssn = put (ssn, pib4.) ;
do _type_ = 0 to n - 1 ;
hhh.find() ;
if hh.find() ne 0 then do ;
count = 0 ;
h = _new_ hash () ;
h.definekey ('pib_ssn') ;
h.definedone() ;
end ;
if h.check() ne 0 then do ;
count ++ 1 ;
h.add() ;
end ;
hh.replace() ;
end ;
end ;
do _type_ = 0 to n - 1 ;
call missing (&comma_class) ;
hhh.find() ;
do _iorc_ = hi.next() by 0 while (_iorc_ = 0) ;
output ;
_iorc_ = hi.next() ;
end ;
end ;
stop ;
run ;
Now, H-tables contain only 4-byte SSN images. There are as many H-tables per
_TYPE_ as there are unique _TYPE_ keys; in other words, for each new value
of STATE, a separate instance of H-table is created, and it will contain
only SSN values attributed to that key, and the same is true for all other
CLASS key combinations.
HH-tables are keyed by CLASS key combinations, and each key points to
_type_, count, its own H-table and its iterator object as data. This way,
when a CLASS key of any _TYPE_ comes with the input data, the program knows
which H-table, containing its own SSN values, to search. When we need to
output the H-table, we know which iterator to use.
Finally, HHH-table is keyed by _TYPE_, so in this case, it contains 8
HH-tables (in turn, containing H-tables). It enables us to go from one
instance of HH-table to another in a simple DO-loop. In the end, we just use
_TYPE_ as a key to go over each instance of HHH-table and spit out the
content of each HH-table to the data set COUNT_DISTINCT.
Even though the code is rather concise, at first it may seem like a
coding/production nightmare, but it only requires a serious penetration into
the topic to understand how it works in concert. Using a seemingly more
complex code to achieve a substantial gain in efficiency is much more than a
mere exercise in multi-level hashing.
Of course, my expectation that the second solution is more efficient than
the first is based on the assumption that after certain point, the overhead
of maintaining a great number of hash-of-hash-of-hash instances is offset by
being rid of the necessity of storing CLASS _TYPE_ keys alongside with the
variable[s] whose distinct values are being counted. I suspect that for a
rather low SSN cardinality the first solution may be less memory-hungry.
Otherwise, whilst the hash-of-hash overhead of the second solution may be
significant, it will be compensated by the more efficient storage scheme if
SSN cardinality is truly high. Needless to say, it has to be tested on a
machine with prodigious memory far exceeding my ThinkPad's (addressable)
3Gb.
Kind regards
------------
Paul Dorfman
Jax, FL
------------
On Wed, 1 Aug 2007 19:49:58 -0400, Kenneth Karan <posible88-sswug@YAHOO.COM>
wrote:
>I already have a way to do this that uses macros, proc sort, and first. by
>processing. This is horrendously inefficient. The dataset I need to use
>this on is way too large. Can someone recommend an approach that would be
>more efficient? Would indexes help? Data step hash objects?
>
>I could load this into Oracle and use the "count(distinct(SSN))" functions
>with the "group by cube" statement but I am trying to avoid involving IT.
>
>Thanks in advance.
>
>~ Ken
>
>Given the dataset:
>
>Obs State Gender Mon SSN
>
> 1 NY M Jan 123456789
> 2 NY M Feb 123456789
> 3 NY M Mar 123456789
> 4 NY M Jan 987654321
> 5 NY M Feb 987654321
> 6 FL M Mar 987654321
> 7 NY F Jan 234567890
> 8 NY F Feb 234567890
> 9 FL F Jan 345678901
>
>I would like to produce the dataset:
>
>Obs State Gender Mon _type_ People
>
> 1 0 4
> 2 Feb 1 3
> 3 Jan 1 4
> 4 Mar 1 2
> 5 F 2 2
> 6 M 2 2
> 7 F Feb 3 1
> 8 F Jan 3 2
> 9 M Feb 3 2
> 10 M Jan 3 2
> 11 M Mar 3 2
> 12 FL 4 2
> 13 NY 4 3
> 14 FL Jan 5 1
> 15 FL Mar 5 1
> 16 NY Feb 5 3
> 17 NY Jan 5 3
> 18 NY Mar 5 1
> 19 FL F 6 1
> 20 FL M 6 1
> 21 NY F 6 1
> 22 NY M 6 2
> 23 FL F Jan 7 1
> 24 FL M Mar 7 1
> 25 NY F Feb 7 1
> 26 NY F Jan 7 1
> 27 NY M Feb 7 2
> 28 NY M Jan 7 2
> 29 NY M Mar 7 1