LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2006, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 16 Aug 2006 23:51:04 -0400
Reply-To:   Don Henderson <donaldjhenderson@HOTMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
Comments:   RFC822 error: <W> MESSAGE-ID field duplicated. Last occurrence was retained.
From:   Don Henderson <donaldjhenderson@HOTMAIL.COM>
Subject:   Re: Sum-able Summaries
In-Reply-To:   <200608170334.k7GMGaav027301@mailgw.cc.uga.edu>
Content-Type:   text/plain; charset="US-ASCII"

Carl,

What you have described is well known in statistics and is referred to as "sufficiency" or sometimes as a set of "sufficient statistics." My stats experience is a bit rusty so I may be slightly off in the terminology/description. And if so, I am sure the statisticians on the list will (hopefully) nicely correct me.

If you do a Google search on sufficient statistics you will get a lot of hits. Some of them will be very theoretical; but many will be very practical. The concept of sufficient statistics is also a key part of OLAP and the ability to roll up summaries on the fly from a defined set of "sufficient statistics."

In your case you just need to identify what statistics/summaries you may eventually need and from there you can determine what the set of sufficient statistics are.

You have a good start and maybe even a complete list of what you need by including the:

- sum - count - min - max - sum of squares

You do need to be aware of the point that Howard made about the count of observations vs. the count of non-missing values for a variable. If you can assume that none of the variables is ever missing, then you only need to keep the count of the rows (what Howard refers to as _FREQ_).

I would respectfully disagree with Howard on the sum vs. the mean. If you have either along with the count, you can calculate the other. But if you store the mean, you will have to calculate the sum before you can aggregate further. Since I expect that most of the time you will be further summarizing your summary data, it might be best to store the sum rather than the mean. Just my opinion however.

Regards, -don h

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Howard Schreier <hs AT dc-sug DOT org> Sent: Wednesday, August 16, 2006 11:35 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Sum-able Summaries

On Wed, 16 Aug 2006 16:38:18 -0400, Carl Kyonka <Carl.Kyonka@ENBRIDGE.COM> wrote:

>I have some fairly large datasets of computer performance information >(20+GB). Much of the data is collected at 5 minute intervals. I think I >need this level of detail for one or two months into the past, but further >than that, it would be better to have a suitable summary of the data. But >how do I effectively (efficiently?) summarize this data? The goal here >would be to keep the long-term data small. > >It seems to me that for each measure, a summary might include: > >Sum of all observations

Perhaps keep the mean instead. It is little more natural to use in subsequent processing.

>Count of all observations

This wuld be the N statistic (number of non-missing values), right? Elsewhere you have _FREQ_ or some other variable recording the number of observations in each group; this bit of information is not specific to any measure.

>Min >Max >Sum of squares

> >For example, if the C: drive of a Windows server is measured for its % >disk active time, and this is done every five minutes, one summary might >be over 8 hour intervals. So 96 observations (60 min/5 min * 8 hours = 96 >observations) would be collapsed into one summary with six numerical >variables and some number of CLASS variables (server name, disk name, >datetime span, monitoring frequency, Windows object, Windows counter and >instance). > >One other aim in this summarization is to be sum-able. That is, it should >be possible to further summarize the summary records into even longer >timespans or other aggregates based on the CLASS variables. > >I'm sure this has been done in some contexts (MXG, MICS, cubes, etc.), but >I am not aware of discussions which stats to use in the summary. Does >anyone know of such a discussion or have experience in generating them? > >Carl Kyonka >Capacity & Performance >Enbridge >416 495 5076


Back to: Top of message | Previous page | Main SAS-L page