| Date: | Wed, 16 Aug 2006 23:51:04 -0400 |
| Reply-To: | Don Henderson <donaldjhenderson@HOTMAIL.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
|
| From: | Don Henderson <donaldjhenderson@HOTMAIL.COM> |
| Subject: | Re: Sum-able Summaries |
| In-Reply-To: | <200608170334.k7GMGaav027301@mailgw.cc.uga.edu> |
| Content-Type: | text/plain; charset="US-ASCII" |
Carl,
What you have described is well known in statistics and is referred to as
"sufficiency" or sometimes as a set of "sufficient statistics." My stats
experience is a bit rusty so I may be slightly off in the
terminology/description. And if so, I am sure the statisticians on the list
will (hopefully) nicely correct me.
If you do a Google search on sufficient statistics you will get a lot of
hits. Some of them will be very theoretical; but many will be very
practical. The concept of sufficient statistics is also a key part of OLAP
and the ability to roll up summaries on the fly from a defined set of
"sufficient statistics."
In your case you just need to identify what statistics/summaries you may
eventually need and from there you can determine what the set of sufficient
statistics are.
You have a good start and maybe even a complete list of what you need by
including the:
- sum
- count
- min
- max
- sum of squares
You do need to be aware of the point that Howard made about the count of
observations vs. the count of non-missing values for a variable. If you can
assume that none of the variables is ever missing, then you only need to
keep the count of the rows (what Howard refers to as _FREQ_).
I would respectfully disagree with Howard on the sum vs. the mean. If you
have either along with the count, you can calculate the other. But if you
store the mean, you will have to calculate the sum before you can aggregate
further. Since I expect that most of the time you will be further
summarizing your summary data, it might be best to store the sum rather than
the mean. Just my opinion however.
Regards,
-don h
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Howard
Schreier <hs AT dc-sug DOT org>
Sent: Wednesday, August 16, 2006 11:35 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Sum-able Summaries
On Wed, 16 Aug 2006 16:38:18 -0400, Carl Kyonka <Carl.Kyonka@ENBRIDGE.COM>
wrote:
>I have some fairly large datasets of computer performance information
>(20+GB). Much of the data is collected at 5 minute intervals. I think I
>need this level of detail for one or two months into the past, but further
>than that, it would be better to have a suitable summary of the data. But
>how do I effectively (efficiently?) summarize this data? The goal here
>would be to keep the long-term data small.
>
>It seems to me that for each measure, a summary might include:
>
>Sum of all observations
Perhaps keep the mean instead. It is little more natural to use in
subsequent processing.
>Count of all observations
This wuld be the N statistic (number of non-missing values), right?
Elsewhere you have _FREQ_ or some other variable recording the number of
observations in each group; this bit of information is not specific to any
measure.
>Min
>Max
>Sum of squares
>
>For example, if the C: drive of a Windows server is measured for its %
>disk active time, and this is done every five minutes, one summary might
>be over 8 hour intervals. So 96 observations (60 min/5 min * 8 hours = 96
>observations) would be collapsed into one summary with six numerical
>variables and some number of CLASS variables (server name, disk name,
>datetime span, monitoring frequency, Windows object, Windows counter and
>instance).
>
>One other aim in this summarization is to be sum-able. That is, it should
>be possible to further summarize the summary records into even longer
>timespans or other aggregates based on the CLASS variables.
>
>I'm sure this has been done in some contexts (MXG, MICS, cubes, etc.), but
>I am not aware of discussions which stats to use in the summary. Does
>anyone know of such a discussion or have experience in generating them?
>
>Carl Kyonka
>Capacity & Performance
>Enbridge
>416 495 5076
|