Date: Thu, 19 Jul 2007 10:59:53 +1000
Reply-To: d@dkvj.biz
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David Johnson <d@DKVJ.BIZ>
Subject: Re: reducing length of numeric variables
In-Reply-To: <200707181506.l6IAlGlk029666@mailgw.cc.uga.edu>
Content-Type: text/plain; charset="iso-8859-1"
I shouldn't write messages just before I go to bed. The expression I
intended was "3-digit number".
Be careful when referring to the storage of numbers. While the best OS uses
true floating point, Windows and Unix use IEEE and the differences can be
found in the tables on numeric representation in the appropriate companion
to the operating system in the online help. The integer precision for IEEE
at 3 bytes is 8192, which makes intuitive sense.
As for data movement between disk and memory, it does have a significant
difference with some data and some processes, and is one of a series of
steps one can try to optimise a process. Clearly, one should have exhausted
the long standing optimisations that limit width and depth of data retrieved
first, but when one speaks of very large data sets, sometimes one has to
tackle the efficiencies that lie further down the list.
Bear in mind the implications of queuing theory. If data are stored on a
network resource, then either the network traffic, or the disk access, or
both will be affected by other users. The longer one user takes to retrieve
data, the longer other users will wait, and the longer we will wait when
they finally get their turn. You may not see it much on the autobahn, but a
single slower moving vehicle on a motorway can have very significant effects
for all users. The English motorways pay "tribute" to this.
Kind regards
David
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]On Behalf Of
Gerhard Hellriegel
Sent: Thursday, 19 July 2007 1:06 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: reducing length of numeric variables
David,
0-999 has nothing to do with a 3-byte number in floating point!
3 byte fp means:
1 byte for the mantissa, 2 bytes for the num. That means 2*8 = 16
positions available for powers of 2 incl. sign. So you have something like
2**15 for the significant decimals, what's around 32768 for the biggest
number in theory. For some reasons, that is only theory and in practice it
is something around 28000 on mainframe and 8000 on PCs.
You don't have any advantages with the PDV, because in PDV all nums are 8-
byte long. The only thing is the movement between disk and memory. I can't
imagine that this is a significant difference. There are so many other
things to tune programs, I'd not concentrate on such peanuts.
Someone many years ago told me to use only lowercase letters to preserve
disk space. I'm still not sure, if he was serious...
But for american people that is so much easier as for germans! I could
give you some examples where it is definitifely NOT possible to use
lowercase words!
On Thu, 19 Jul 2007 00:32:18 +1000, David Johnson <d@DKVJ.BIZ> wrote:
>Space may not be terribly expensive, but processing and reporting time
>certainly is. A 10% reduction in the PDV has clear and measurable outcomes
>for the time to process the data, and it is that I/O process that lies at
>the root of my argument.
>
>I don't disagree with the idea that thoughtless shortening of numbers
cause
>problems and solved my fair share of them prior to 2000 on mainframe
systems
>that were designed with only the 20th century in mind. The classic error
of
>which was the tape retention date which at 9900 was interpreted as "don't
>expire" but intuitively was wrong when dates moved from 1999 to 2000.
>Countless similar issues arose and the writing was on the wall for those
>attentive to the issue when a utility company set up their billing system
in
>the early 80's to handle account values up to 99.99, never expecting that
>within a few years some of their bills would increase by an order of
>magnitude and fail to be printed correctly.
>
>On the other hand, pay attention to specific examples I quote:
>. A time value from -12:00:00 to 12:00:00. The range will always be in a
>SAS table from -( 12 * 60 * 60) to (12 * 60 * 60). That matches now and
>forever the IEEE based storage on Windows and Unix for SAS tables.
>
>. Similarly, a 3 byte positive integer will have values from 0 to 999 and
>without a change in the particular code structure, will function as
>designed.
>
>. Time values up to 1 day fall within 0 to (24 * 60 * 60) and unless we
have
>a change to decimal time or stardate, this will not change either.
>
>. And my decision to hold date times as 6 byte numbers will preserve times
>to the second until the 67th century. I dare say I shan't care too much
by
>then, and doubt the system will see the 22nd century, never mind the 67th.
>
>
>There are circumstances where shortening of numeric values can be done
with
>due caution and consideration.
>
>Kind regards
>
>David
>
>
>-----Original Message-----
>From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]On Behalf Of
>Gerhard Hellriegel
>Sent: Tuesday, 17 July 2007 6:53 PM
>To: SAS-L@LISTSERV.UGA.EDU
>Subject: Re: reducing length of numeric variables
>
>
>I never thought, that so many people do that! I thought, that was a relict
>of 10 years ago, when space was expensive. I've found always, that shorten
>numbers was a dangerous thing! A source for strange errors, for errors
>which might be invisible for long time! Old MXG routines did that often
>(you know, some of them are elder than 10 years). I once saw a customer
>who accounted printed lines for his host printers. Nobody thought that
>there might be more than 1 billion lines printed. The printers would need
>months for that and they had only one month to count. His print center
>grew and grew and the fast laser printers came - no problem to print
>several times the amount in 1 day which took a month in elder days. The
>counter was too small, several months nobody recognized that!
>All that trouble to have a reduction of some MB for a dataset with some
GB?
>If you have a dataset with many numeric variables, better try
>COMPRESS=BINARY and see what that can do for you!
>A few days ago, another thing: a ID number, which cannot get bigger that
>some digits, all stored numeric. The length of 3 for that integer numbers
>is big enough. Ok on mainframe! But the philosopy of that site is, that
>all programs can also run on other platforms, e.g. Win or UNIX. Ok on
>mainframe, too short for windows!
>So many programs had to be changed for much money! Much more than the disk
>space might cost to store a 8-byte variable!
>
>So I'd suggest: think more than 5 times, before you reduce numeric
lengths!
>
>
>
>
>On Tue, 17 Jul 2007 18:21:50 +1000, David Johnson <d@DKVJ.BIZ> wrote:
>
>>Well put Paul, I put his email aside to tell him to read the manual on
>>numeric representation, but your detailed reply obviates that need. I
>could
>>have told you that privately, but I'm going to express another point of
>view
>>to see whether we get a good debate going on the subject.
>>
>>In a (currently unpublished) document I posit the view that if you deal
>with
>>particularly large data sets, any steps you take in terms of reducing the
>>block size has a benefit. Where the table has both many rows and a wide
>>PDV, the I/O is going to be the major process burden, and pulling a
little
>>more CPU for the reconversion of short numeric columns is unlikely to be
>>significant to the overall time. Thus we may structure the data where
all
>>the numeric columns are together and parsimonious, and the character
>columns
>>also adjacent and then the record compressed. I shall have to recheck
>part
>>of that paradigm because I can't now remember whether the compression of
>>adjacent character columns was a benefit introduced in V8 and obviated in
>>SAS9, or whether it is a more recent SAS suggestion. At the time of
>writing
>>I don't have web access to search the SAS website for the reference.
>>
>>Let me comment though on the nature of the data. A positive integer
>column
>>with 3 digit values; a datetime column with values down to integer
>seconds;
>>a time zone column with values in the range of -12 to +12 hours; three
>>positive integer column with values up to perhaps 8M; and a handful of
>>others. All would seem to benefit from shortening. As for the quantity
>of
>>the data, after some 3 to 5 months, there is too much data for the
>operating
>>system to store in a single file, so one needs to partition the tables
>into
>>monthly or similar segments, and then read them as a single table through
>a
>>"Union" view.
>>
>>I hadn't considered using an IB format, but it might be of benefit for
the
>>datetimes, which are stored as 6-byte numbers. If I can crib another
byte
>>or two off the PDV, that will have benefits both for the daily loading
and
>>reporting processes. Unfortunately, if they are stored as character
>>variables, they will need all code to make conversions which may not be a
>>good solution. Considering options is good, even if you eventually stay
>>with an existing method.
>>
>>Kind regards
>>
>>David
>>
>>-----Original Message-----
>>From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]On Behalf Of Paul
>>Dorfman
>>Sent: Monday, 16 July 2007 5:01 AM
>>To: SAS-L@LISTSERV.UGA.EDU
>>Subject: Re: reducing length of numeric variables
>>
>>
>>Ovidiu,
>>
>>Note that a numeric variable can be stored with a length shorter than 8
>>on disk only. Once it is back to the PDV, it is expanded to the whole 8
>>bytes no matter how it might have been stored, from which point on, its
>>mantissa, exponent and sign (53, 10, and 1 bytes, respectively) are
>>represented as usual.
>>
>>If you really want to use less disk space to store integer numbers, save
>>them as character values using binary integer formats IBw./PIBw. and
>>their informat counterparts to turn the string into computable numbers.
>>This way, 3 bytes can store a positive number of up to
>>input('ffffff'x,pib3.)
>>=256**3-1=16,777,215, or half that in the absolute value if signed. Just
>>do not get carried away with the PIB width: the largest integer constant
>>('exactin',8) corresponds to input('ffffffffffff1f'x,pib7.)-1, so staying
>>withing
>>PIB6. guarantees exact back-and-forth conversion. However, at lengths
>>greater than 4 character bytes, there is no real advantage of saving
>>space this way, particularly because conversions tie up computing
>>resources. Perhaps 1-4 character bytes (the equivalent of the maximal
>>positive integer from 255 to 255**4-1) would be reasonable.
>>
>>Kind regards
>>-----------------
>>Paul Dorfman
>>Jax, FL
>>-----------------
>>>
>>> From: Ovidiu Negrila <diddy1512@YAHOO.COM>
>>> Date: 2007/07/15 Sun AM 09:09:42 EDT
>>> To: SAS-L@LISTSERV.UGA.EDU
>>> Subject: reducing length of numeric variables
>>>
>>> Hi,
>>>
>>> In Windows the shortest length for a numeric variable is 3 and the
>>> maximum stored values is 8182. I observed that even numbers higher
>>than
>>> 8192 are also stored and the odd numbers are stored as even
>>numbers too
>>> (sored value=odd value -1 ).
>>> I want to know how is the internal binary representation: the sign, the
>>> exponent, the mantissa.
>>> I will be really grateful for some examples.
>>>
>>> Thanks,
>>>
>>> Ovidiu
>>>
|