LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2006, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Sun, 15 Jan 2006 17:04:35 -0800
Reply-To:   David L Cassell <davidlcassell@MSN.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   David L Cassell <davidlcassell@MSN.COM>
Subject:   Re: Character vs. Numeric Keys
In-Reply-To:   <200601151845.k0FI631g032237@mailgw.cc.uga.edu>
Content-Type:   text/plain; format=flowed

hayflinger@GMAIL.COM wrote: >I am responsible for bringing a very large data source into my company >in a SAS format. The variables that act as a key between several >tables are being stored as a numeric variable. I am trying to convince >them that they should be stored as a character. There are several >variables that are a less than 3 positions that are also stored as >numeric....My argument for these have been accepted because they were >storing them with a length of 8 and the length of the characters are >much shorter. > >My main complaint is that we will need to join two tables based on this >variable. I argue against performing a join in SAS using the following >example - > > 48 data _null_; > 49 put 'Does I ever equal 1?'; > 50 do i=0 to 1.5 by .1; > 51 if i=1 then put ' YES (' i= ' )' ; > 52 else put ' NO (' i= ' )' ; > 53 end; > 54 run; > > Does I ever equal 1? > NO (i=0 ) > NO (i=0.1 ) > NO (i=0.2 ) > NO (i=0.3 ) > NO (i=0.4 ) > NO (i=0.5 ) > NO (i=0.6 ) > NO (i=0.7 ) > NO (i=0.8 ) > NO (i=0.9 ) > NO (i=1 ) > NO (i=1.1 ) > NO (i=1.2 ) > NO (i=1.3 ) > NO (i=1.4 ) > NO (i=1.5 ) > >As you can see, clearly 'i' did equal one, but they way SAS stores >numerics made the condition fail. The numbers that we are using for >keys are 12 digits. Many of the tables we are talking about are >100 >million records...the largest is approaching 1.75 billion records. If >it matters, it is being stored in a UNIX box with SAS 8.2....(yeah, I >know...nothing like moving at the speed of business!) > >Is this a valid argument? Also, what are the ramifications of >indexing a numeric variable? > >Can anyone link to some best practice information about how keys should >be sorted in SAS.

I have a few comments. I agree with Lou, as often happens, so you can start with his comments and go from there.

[1] The round-off situation you show doesn't matter for integers. An 8-byte SAS integer can hold about 15 places accurately. If your numbers are not going to get larger than 12 digits, and they are not going to have decimal places, then you should still be safe.

[2] The issue of 'small' numbers can be handled in a couple ways. If you have satellite data which can be held in auxiliary tables, then SAS formats may allow you to hold the necessary data and remove unneeded variables from the primary table(s). If the data take on integer values less than a few thousand or so (the exact numbers can be found in the SAS Online Docs) then you can change the LENGTH of the numeric variables to something which still preserves the accuracy but takes up less storage room on disk. For Windows and unix systems the minimum is length 3, but VMS allows length 2. And, of course, you can change very small integers that stay between 0 and 9 to 1-character strings. But this tends to cause a lot of bookkeeping work on your part, as you have to keep converting from character to numeric and back in order to work with them. So you have to trade programmer efficiency and maintenance efficiency for storage efficiency. I try to avoid this, unless the numbers are really just place-holders for other information, like marital status or something, which could just as easily be stored as 'A' through 'J' with no loss of utility.

[3] SAS indices don't care whether you have numeric or character variables as your indexing variable. If your data are already sorted on this variable and you want to merge data sets, then the index doesn't help much. In fact, you can be led astray by the temptation to index instead of planning a more complex way of joining which might be more (time) efficient. What sort of use are you planning on getting out of the index or indices? There's a book on indices by Mike Raithel which might be a good guide for you. I find that if the goal is selection of 'small' subsets from the data, then indices can be a big plus. If you're going to end up reading the entire data set anyway, then indexed access may be inconveniently slow. It depends. Sometimes you just have to try it several ways to see what is going to be best for your situation.

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Back to: Top of message | Previous page | Main SAS-L page