Date: Sat, 4 Aug 2007 13:29:43 -0400
Reply-To: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM>
Subject: Re: nway unique counts
Content-Type: text/plain; charset=ISO-8859-1
On Wed, 1 Aug 2007 19:49:58 -0400, Kenneth Karan <posible88-sswug@YAHOO.COM>
wrote:
>I already have a way to do this that uses macros, proc sort, and first. by
>processing. This is horrendously inefficient. The dataset I need to use
>this on is way too large.
This comment did not register because around here people are always talking
about having huge files, where the definition of "huge" is: "bigger than I
am accustomed to using". So a guy who usually works with 10K rows thinks
that 0.5M rows is huge.
Kenneth subsequently revealed that he has 400M rows, which is kind of
extreme and which changes the picture completely.
Q for Kenneth: You say "I already have a way to do this". Do you mean you
have successfully processed at full scale (400M)?
More comments and questions interspersed below.
>Can someone recommend an approach that would be
>more efficient? Would indexes help? Data step hash objects?
>
>I could load this into Oracle and use the "count(distinct(SSN))" functions
>with the "group by cube" statement but I am trying to avoid involving IT.
>
>Thanks in advance.
>
>~ Ken
>
>Given the dataset:
>
>Obs State Gender Mon SSN
>
> 1 NY M Jan 123456789
> 2 NY M Feb 123456789
> 3 NY M Mar 123456789
> 4 NY M Jan 987654321
> 5 NY M Feb 987654321
> 6 FL M Mar 987654321
> 7 NY F Jan 234567890
> 8 NY F Feb 234567890
> 9 FL F Jan 345678901
>
>I would like to produce the dataset:
>
>Obs State Gender Mon _type_ People
>
> 1 0 4
> 2 Feb 1 3
> 3 Jan 1 4
> 4 Mar 1 2
> 5 F 2 2
> 6 M 2 2
> 7 F Feb 3 1
> 8 F Jan 3 2
> 9 M Feb 3 2
> 10 M Jan 3 2
> 11 M Mar 3 2
> 12 FL 4 2
> 13 NY 4 3
> 14 FL Jan 5 1
> 15 FL Mar 5 1
> 16 NY Feb 5 3
> 17 NY Jan 5 3
> 18 NY Mar 5 1
> 19 FL F 6 1
> 20 FL M 6 1
> 21 NY F 6 1
> 22 NY M 6 2
> 23 FL F Jan 7 1
> 24 FL M Mar 7 1
> 25 NY F Feb 7 1
> 26 NY F Jan 7 1
> 27 NY M Feb 7 2
> 28 NY M Jan 7 2
> 29 NY M Mar 7 1
The problem in a nutshell is to count observations for each level and
crossing but to ignore repetitions of SSN.
By the way, Kenneth and SAS seem to use the term NWAY in opposite fashions.
In PROC MEANS/SUMMARY, the NWAY option tells SAS to only summarize within
the most granular subsets (that is, the crossings of *all* CLASS variables).
Kenneth wants the opposite, including counts for all of the broader subsets
defined by fewer variables, and for the whole population.
On Thu, 2 Aug 2007 11:40:48 -0400, Kenneth Karan <posible88-sswug@YAHOO.COM>
wrote:
>I tried that solution and it does appear to work. Much more generalizable
>than any I have come up with. Many thanks, Ken
Not clear to which suggestion this comment pertains. In any case I presume
that "does appear to work" refers to a small logic test only, not to
performance at full scale.
On Thu, 2 Aug 2007 14:52:41 -0400, Kenneth Karan <posible88-sswug@YAHOO.COM>
wrote:
>On Thu, 2 Aug 2007 13:33:11 -0400, Howard Schreier <hs AT dc-sug DOT org>
><nospam@HOWLES.COM> wrote:
>>How large? How many categorical variables, and what cardinality?
>
>408,807,823 observations
>5 variables with low-cardinality
>1 variable - the person identifier - with high cardinality and it is alpha-
>numeric.
>
>See http://en.wikipedia.org/wiki/Cardinality_%28SQL_statements%29 for
>definition I used.
>
>What is a stratifier?
A grouping variable.
>
>Thank you for your suggestion. I will try.
>
>~ Ken
On Thu, 2 Aug 2007 15:20:37 -0400, Kenneth Karan <posible88-sswug@YAHOO.COM>
wrote:
>Howard's approach is very clever. While it is a bit bulky to code all the
>nested do-loops the savings could be worth it. The $64K question is, does
>SAS store the results of the ForSql view in hidden tables during
>processing? Does it store at least part of the view?
No. It will feed the rows generated by the view directly to the next
process. That's the good news. The bad news is that the next process is a
sort. So this is not a promising direction to pursue.
>
>If one just looks at the ForSql view, the test dataset grows from 9 to 72
>observations -- a factor of 8. Given the size of my data, that would be
>way too large and generate way too much I/O.
If I understand, you have 5 dimensions in reality (in contrast to 3 in the
example). So that's a factor of 32 instead of 8.
>
>Many thanks for the response.
>
>~ Ken
I suspect that all of the approaches suggested so far will fail. PROC
SUMMARY is memory intensive and will probably crash in the process of
counting repetitions of distinct SSNs. Hash objects are useful for tasks
like this, but are also memory intensive.
You should look at Mike Raithel's "Summarizing Impossibly Large SASĀ® Data
Sets For the Data Warehouse Server Using Horizontal Summarization". It's
been published in different versions over the years, so search for the latest.
I believe that an efficient solution to this problem will involve a
combination of techniques and a careful analysis of the data set to find any
exploitable characteristics.
Questions ...
What are the 5 actual variables?
How long is the time dimension?
Is the existing data set sorted? By which variables?
For each categorical variable, to what extent do SSNs tend to be
concentrated? I would expect a particular SSN to turn up under either M or F
quite consistently. Concentration by State would be less (though people
often move and travel). Concentration by Month is low, I would guess.