| Date: | Sun, 15 Jan 2006 17:04:35 -0800 |
| Reply-To: | David L Cassell <davidlcassell@MSN.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | David L Cassell <davidlcassell@MSN.COM> |
| Subject: | Re: Character vs. Numeric Keys |
| In-Reply-To: | <200601151845.k0FI631g032237@mailgw.cc.uga.edu> |
| Content-Type: | text/plain; format=flowed |
|---|
hayflinger@GMAIL.COM wrote:
>I am responsible for bringing a very large data source into my company
>in a SAS format. The variables that act as a key between several
>tables are being stored as a numeric variable. I am trying to convince
>them that they should be stored as a character. There are several
>variables that are a less than 3 positions that are also stored as
>numeric....My argument for these have been accepted because they were
>storing them with a length of 8 and the length of the characters are
>much shorter.
>
>My main complaint is that we will need to join two tables based on this
>variable. I argue against performing a join in SAS using the following
>example -
>
> 48 data _null_;
> 49 put 'Does I ever equal 1?';
> 50 do i=0 to 1.5 by .1;
> 51 if i=1 then put ' YES (' i= ' )' ;
> 52 else put ' NO (' i= ' )' ;
> 53 end;
> 54 run;
>
> Does I ever equal 1?
> NO (i=0 )
> NO (i=0.1 )
> NO (i=0.2 )
> NO (i=0.3 )
> NO (i=0.4 )
> NO (i=0.5 )
> NO (i=0.6 )
> NO (i=0.7 )
> NO (i=0.8 )
> NO (i=0.9 )
> NO (i=1 )
> NO (i=1.1 )
> NO (i=1.2 )
> NO (i=1.3 )
> NO (i=1.4 )
> NO (i=1.5 )
>
>As you can see, clearly 'i' did equal one, but they way SAS stores
>numerics made the condition fail. The numbers that we are using for
>keys are 12 digits. Many of the tables we are talking about are >100
>million records...the largest is approaching 1.75 billion records. If
>it matters, it is being stored in a UNIX box with SAS 8.2....(yeah, I
>know...nothing like moving at the speed of business!)
>
>Is this a valid argument? Also, what are the ramifications of
>indexing a numeric variable?
>
>Can anyone link to some best practice information about how keys should
>be sorted in SAS.
I have a few comments. I agree with Lou, as often happens, so you can start
with his comments and go from there.
[1] The round-off situation you show doesn't matter for integers. An
8-byte
SAS integer can hold about 15 places accurately. If your numbers are not
going to get larger than 12 digits, and they are not going to have decimal
places, then you should still be safe.
[2] The issue of 'small' numbers can be handled in a couple ways. If you
have
satellite data which can be held in auxiliary tables, then SAS formats may
allow
you to hold the necessary data and remove unneeded variables from the
primary
table(s). If the data take on integer values less than a few thousand or so
(the exact numbers can be found in the SAS Online Docs) then you can change
the LENGTH of the numeric variables to something which still preserves the
accuracy but takes up less storage room on disk. For Windows and unix
systems the minimum is length 3, but VMS allows length 2. And, of course,
you can change very small integers that stay between 0 and 9 to 1-character
strings. But this tends to cause a lot of bookkeeping work on your part, as
you
have to keep converting from character to numeric and back in order to work
with them. So you have to trade programmer efficiency and maintenance
efficiency for storage efficiency. I try to avoid this, unless the numbers
are
really just place-holders for other information, like marital status or
something,
which could just as easily be stored as 'A' through 'J' with no loss of
utility.
[3] SAS indices don't care whether you have numeric or character variables
as
your indexing variable. If your data are already sorted on this variable
and you
want to merge data sets, then the index doesn't help much. In fact, you can
be
led astray by the temptation to index instead of planning a more complex way
of joining which might be more (time) efficient. What sort of use are you
planning on getting out of the index or indices? There's a book on indices
by
Mike Raithel which might be a good guide for you. I find that if the goal
is
selection of 'small' subsets from the data, then indices can be a big plus.
If
you're going to end up reading the entire data set anyway, then indexed
access
may be inconveniently slow. It depends. Sometimes you just have to try it
several ways to see what is going to be best for your situation.
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
|