LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2007, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 6 Aug 2007 02:12:36 -0400
Reply-To:     Paul Dorfman <sashole@BELLSOUTH.NET>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Paul Dorfman <sashole@BELLSOUTH.NET>
Subject:      Re: nway unique counts
Comments: To: Kenneth Karan <posible88-sswug@YAHOO.COM>

Ken,

Philosophically, Howard left nothing to be added in his last post in the thread. Technically, we may try to tackle the issue in a couple of novel directions.

First, let's sweep off the table the question of forming all possible _types_. As Howard has noted, what you want is precisely the opposite of NWAY, which is delivered by MEANS/SUMMARY omitting the option. Given:

%let class = sta gen mon ;

you can run:

data model ;

array class $ 34 &class ;

do over class ;

class = vname (class) ;

end ;

run ;

proc summary data = model ;

class &class ;

output out = types (drop = _freq_) ;

run ;

Then a cursory look at file TYPES should preclude the need in further explanations on how it can be used. As you will see below, different variations on this theme can be suited to a particular purpose.

Now your main issue is performance. Sure you can use brute force and sort the file by all _TYPE_ combinations followed by SSN and use pretty straightforward BY-processing to attain the goal, which is what you have likely already tried, just to get bogged down by I/O. Other solutions presented in the thread will run pretty much into the same brick wall. SQL will actually finish the job, but will probably mega-I/O behind-the-scenes, making the exercise too slow for you to stomach; and SUMMARY solutions are likely to run out of memory long before they will have a chance to fill out the AVL tree.

Having observed in the past that the DATA step hash object is much more memory savvy than MEANS in terms of implementing a dynamic data dictionary, I naturally wondered if it can be adapted to your task, all the more that you have mentioned it in your inquiry. Of course, the goal here is to produce a single-pass solution.

My first shot at the problem went thusly:

data a ; input sta:$2. gen:$1. mon:$3. ssn: 9. ; cards ; NY M Jan 123456789 NY M Feb 123456789 NY M Mar 123456789 NY M Jan 987654321 NY M Feb 987654321 FL M Mar 987654321 NY F Jan 234567890 NY F Feb 234567890 FL F Jan 345678901 run ;

%let class = sta gen mon ;

data model ; array class $ 34 &class ; do over class ; class = quote (strip (vname (class))) ; end ; run ; proc summary data = model ; class &class ; output out = types (drop = _freq_) ; run ;

%let k0 = '_n_' ; %let dd = '_type_','count' ;

data _null_ ; set types (firstobs = 2) ; array class &class ; call symput ('k' || put (_n_, best.-L), catx (',', of class [*])) ; run ;

data _null_ ; %macro dcl (n) ; %do j = 0 %to &n ; dcl hash h&j (ordered:'a') ; h&j..definekey (&&k&j ) ; h&j..definedata(&&k&j,&dd ) ; h&j..definedone() ; dcl hash c&j (hashexp:16) ; %if &j > 0 %then %do ; c&j..definekey (&&k&j,'pib_ssn') ; %end ;

%else %do ; c&j..definekey ( 'pib_ssn') ; %end ;

c&j..definedone() ; %end ; %mend ;

%dcl (7) ;

do until (z) ; set a end = z ; pib_ssn = put (ssn, pib4.) ;

%macro cnt (n) ; %do j = 0 %to &n ; if h&j..find() = 0 then do ; if c&j..check() ne 0 then do ; count ++ 1 ; c&j..add() ; end ; end ; else do ; count = 1 ; _type_ = &j ; c&j..add() ; end ; h&j..replace() ; %end ; %mend ;

%cnt (7) ; end ;

%macro out (n) ; %do j = 0 %to &n ; h&j..output(dataset: "out&j") ; %end ; %mend ;

%out (7) ; stop ; run ;

data out ; %macro all (n) ; if 0 then set out&n ; set %do j = 0 %to &n ; out&j %end ; %mend ;

%all (7) ; run ;

I understand that at first, you may have a bit of a difficulty to discern the pattern, but MPRINT should make it all clear. Run against your sample, it produces the correct result. Run against your real large file, it will perhaps run faster than anything else... if it could manage to finish the job remaining within the confines of the allotted memory.

Unfortunately, I doubt the latter will be the case, unless you have highly repetitive SSN values on the file. Note that I have already shrunk the memory footprint roughly twice by populating the C-tables with put(ssn,pib4.) instead of full SSN. That means storing 4 SSN bytes per hash key, so in the worst case scenario, a single _TYPE_ would take up 4e8*4~1.6Gb of RAM. If the same fell into 1-, 2-, and 3-part keys, that would mean that you would need ~6Gb for SSN only to run the step.

A closer look at the code reveals that the main consumers of memory are the C-tables; H-tables store low-cardinality keys and thus are miniscule in comparison. The chief problem with the design of the C-tables is their reliance on the class composite key (sta gen mon) added to every item storing a unique SSN image. This will surely strike any SAS programmer worth his semicolon as inefficient. Doubtless we can bit-manage (sta gen mon) with its 50*2*12=1200 cardinality into mere 2 bytes (and borrow the remaining 2**16-1200 bytes for a digit from SSN), but this still bumps up the RAM to ~10Gb while making the program too data-esoteric.

The whole thing imparts the feeling of coding eclecticism somehow interfering with the goal. Intuitively, macros do not really mesh with hash objects accepting SAS variables as constructor parameters, and there should be a way to dynamically associate each unique CLASS _TYPE_ key with only those SSN values that fall into the key's domain. That would make it unnecessary to store the CLASS _TYPE_ keys together with SSN in the C-table, thus cutting drastically on real storage.

Incidentally, it shows that mixing old and new technologies does not work all that well. Once I decided to completely focus my mentality on the dynamic nature of the hash object philosophy (including the ability of hash object to store other objects), a different way of doing the same thing sprang out rather naturally:

data test ;

input state:$2. gender:$1. mon:$3. ssn: 9. ;

cards ;

NY M Jan 123456789

NY M Feb 123456789

NY M Mar 123456789

NY M Jan 987654321

NY M Feb 987654321

FL M Mar 987654321

NY F Jan 234567890

NY F Feb 234567890

FL F Jan 345678901

run ;

%let blank_class = state gender mon ;

%let comma_class = %sysfunc(tranwrd(&blank_class, %str( ), %str(,))) ;

%let items_class = %sysfunc(countw(&blank_class)) ;

data model ;

array _cc_ $ 32 _cc_1-_cc_&items_class ;

do over _cc_ ;

_cc_ = scan ("&blank_class", _i_) ;

end ;

run ;

proc summary data = model ;

class _cc_: ;

output out = types (keep = _cc_: _type_) ;

run ;

data count_distinct (keep = &blank_class _type_ count) ;

dcl hash hhh() ;

hhh.definekey ('_type_') ;

hhh.definedata ('hh','hi') ;

hhh.definedone () ;

dcl hash hh() ;

dcl hiter hi ;

dcl hash h() ;

do z = 0 by 0 until (z) ;

set types end = z nobs = n ;

array _cc_ _cc_: ;

hh = _new_ hash (ordered: 'a') ;

do over _cc_ ;

if _type_ = 0 then _cc_ = '_n_' ;

if missing (_cc_) then continue ;

hh.definekey (_cc_) ;

hh.definedata(_cc_) ;

end ;

hi = _new_ hiter ('hh') ;

hh.definedata ('_type_','count','h') ;

hh.definedone () ;

hhh.add() ;

end ;

do z = 0 by 0 until (z) ;

set test end = z ;

pib_ssn = put (ssn, pib4.) ;

do _type_ = 0 to n - 1 ;

hhh.find() ;

if hh.find() ne 0 then do ;

count = 0 ;

h = _new_ hash () ;

h.definekey ('pib_ssn') ;

h.definedone() ;

end ;

if h.check() ne 0 then do ;

count ++ 1 ;

h.add() ;

end ;

hh.replace() ;

end ;

end ;

do _type_ = 0 to n - 1 ;

call missing (&comma_class) ;

hhh.find() ;

do _iorc_ = hi.next() by 0 while (_iorc_ = 0) ;

output ;

_iorc_ = hi.next() ;

end ;

end ;

stop ;

run ;

Now, H-tables contain only 4-byte SSN images. There are as many H-tables per _TYPE_ as there are unique _TYPE_ keys; in other words, for each new value of STATE, a separate instance of H-table is created, and it will contain only SSN values attributed to that key, and the same is true for all other CLASS key combinations.

HH-tables are keyed by CLASS key combinations, and each key points to _type_, count, its own H-table and its iterator object as data. This way, when a CLASS key of any _TYPE_ comes with the input data, the program knows which H-table, containing its own SSN values, to search. When we need to output the H-table, we know which iterator to use.

Finally, HHH-table is keyed by _TYPE_, so in this case, it contains 8 HH-tables (in turn, containing H-tables). It enables us to go from one instance of HH-table to another in a simple DO-loop. In the end, we just use _TYPE_ as a key to go over each instance of HHH-table and spit out the content of each HH-table to the data set COUNT_DISTINCT.

Even though the code is rather concise, at first it may seem like a coding/production nightmare, but it only requires a serious penetration into the topic to understand how it works in concert. Using a seemingly more complex code to achieve a substantial gain in efficiency is much more than a mere exercise in multi-level hashing.

Of course, my expectation that the second solution is more efficient than the first is based on the assumption that after certain point, the overhead of maintaining a great number of hash-of-hash-of-hash instances is offset by being rid of the necessity of storing CLASS _TYPE_ keys alongside with the variable[s] whose distinct values are being counted. I suspect that for a rather low SSN cardinality the first solution may be less memory-hungry. Otherwise, whilst the hash-of-hash overhead of the second solution may be significant, it will be compensated by the more efficient storage scheme if SSN cardinality is truly high. Needless to say, it has to be tested on a machine with prodigious memory far exceeding my ThinkPad's (addressable) 3Gb.

Kind regards ------------ Paul Dorfman Jax, FL ------------

On Wed, 1 Aug 2007 19:49:58 -0400, Kenneth Karan <posible88-sswug@YAHOO.COM> wrote:

>I already have a way to do this that uses macros, proc sort, and first. by >processing. This is horrendously inefficient. The dataset I need to use >this on is way too large. Can someone recommend an approach that would be >more efficient? Would indexes help? Data step hash objects? > >I could load this into Oracle and use the "count(distinct(SSN))" functions >with the "group by cube" statement but I am trying to avoid involving IT. > >Thanks in advance. > >~ Ken > >Given the dataset: > >Obs State Gender Mon SSN > > 1 NY M Jan 123456789 > 2 NY M Feb 123456789 > 3 NY M Mar 123456789 > 4 NY M Jan 987654321 > 5 NY M Feb 987654321 > 6 FL M Mar 987654321 > 7 NY F Jan 234567890 > 8 NY F Feb 234567890 > 9 FL F Jan 345678901 > >I would like to produce the dataset: > >Obs State Gender Mon _type_ People > > 1 0 4 > 2 Feb 1 3 > 3 Jan 1 4 > 4 Mar 1 2 > 5 F 2 2 > 6 M 2 2 > 7 F Feb 3 1 > 8 F Jan 3 2 > 9 M Feb 3 2 > 10 M Jan 3 2 > 11 M Mar 3 2 > 12 FL 4 2 > 13 NY 4 3 > 14 FL Jan 5 1 > 15 FL Mar 5 1 > 16 NY Feb 5 3 > 17 NY Jan 5 3 > 18 NY Mar 5 1 > 19 FL F 6 1 > 20 FL M 6 1 > 21 NY F 6 1 > 22 NY M 6 2 > 23 FL F Jan 7 1 > 24 FL M Mar 7 1 > 25 NY F Feb 7 1 > 26 NY F Jan 7 1 > 27 NY M Feb 7 2 > 28 NY M Jan 7 2 > 29 NY M Mar 7 1


Back to: Top of message | Previous page | Main SAS-L page