LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2001, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 4 Jun 2001 12:57:51 -0400
Reply-To:     Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Subject:      Re: PRoc Summary 'Internals'

I think the mystery will have been cleared up if Lionel can confirm that the experience he described at the beginning of this thread was under SAS Version 6.

There does seem to be version-specific behavior, as I earlier speculated and which Thomas observed.

Here again is my test code:

data newloans; input acctkey advtp; format acctkey best12.; cards; 123456789012 3 123456789010 4 1234567890123 5 1234567890120 6 ;

PROC SUMMARY DATA=NEWLOANS NWAY ; CLASS ACCTKEY ; VAR ADVTP ; OUTPUT OUT=ADVTP00 SUM= ; run;

proc print; format acctkey 13.; run;

As documented, PROC SUMMARY uses formatted values of CLASS variables. Since two of my ACCTKEY values are 13 digits long, their formatted values (BEST12.) are identical in scientific notation with only 8 significant digits (specifically, 1.2345679E12). So my output data set has 3 observations.

This result is the same under either V. 6.12 or V. 8.1 (Windows).

But when I comment out the FORMAT statement, it's a different story. Under 6.12, PROC SUMMARY as a default applies the BEST12. format to the CLASS variable ACCTKEY, but under 8.1, PROC SUMMARY in the absence of an associated format appears to use internal values, and I get 4 observations.

By the way, I see a couple of problems in the Tech Support note (http://www.sas.com/service/techsup/unotes/V6/A/A358.html):

1. It says "the output data set is created using the formatted values" but the manual (SAS Procedures Guide, V. 6, 3rd ed., p. 376) says "When formats combine several internal values into one formatted value, the lowest internal value is output." and that is the behavior which I observe.

2. There is no "BEST12.2" format, it's "BEST12.".

On Mon, 4 Jun 2001 09:50:58 -0400, Soeder, Thomas <txs32954@GLAXOWELLCOME.COM> wrote:

>This note explains the PROC SUMMARY issue for V6. >http://www.sas.com/service/techsup/unotes/V6/A/A358.html > >In my tests, 6.12 Solaris appears to use best12. while for 8e(8.1) Solaris >it appears to be "fixed", at least for the 13 digit numerics I tried. >As documented, PROC SUMMARY uses the formatted values of CLASS variables in >an analysis. The variables are read in by their formatted values, the >analysis is done using the formatted values, and the output data set is >created using the formatted values. >Numeric variables with very large or very small values may appear to be >changed in the procedure's output data set. This is because the default >format used for the variables is BEST12.2. >This situation can be circumvented by using an appropriate format. >Products: BASE >Component: SUMMARY >Priority: N/A >Status: Usage Issue >Date: Tue, 4 Apr 1995 > >System Release Reported Release Fixed >* 6.06 >* 6.07 >IBM OS/2 6.08 TS404 >VSE/ESA (VSE) 6.08 TS404 >Windows 3.11 6.08 TS404 >VM/ESA (CMS) 6.08 TS404 >OS/390 (MVS) 6.08 TS404 >OpenVMS VAX 6.08 TS404 >AIX/6000 6.09 TS027 >ConvexOS 6.09 TS027 >HP-UX Operating Systems 6.09 TS027 >Solaris 6.09 TS027 >OpenVMS Alpha 6.09 TS027 >Windows NT 6.09 TS027 >DEC Ultrix 6.09 TS027 >Intel ABI 6.10 TS018 >Macintosh 6.10 TS037 >Silicon Graphics Unix 6.10 TS018 >Compaq Tru64 Unix 6.10 TS018 >IBM OS/2 6.10 TS019 >Windows 3.11 6.10 TS019 >No Fixes Available > > > > >Date: Fri, 1 Jun 2001 14:31:15 -0700 >From: "Smith, Curtis, Mr, DCAA" <Curtis.Smith@DCAA.MIL> >Subject: Re: PRoc Summary 'Internals' > >I have not been following this tread, but I can tell you factually, that >proc SUMMARY with a CLASS statement will drop observations that have a >missing value in one or more of the CLASS variables, unless you use the >MISSING option on the proc SUMMARY statement. Try running the proc SUMMARY >with the BY statement and then again with the CLASS statement using the >MISSING option and see if you get the same results. Also, I don't know if it >matters, but I always declare the variables I want to SUM. So, I wouldn't >use SUM=;. > >PROC SUMMARY DATA=NEWLOANS NWAY MISSING; > CLASS ACCTKEY ; > VAR ADVTP ; > OUTPUT OUT=ADVTP00 SUM=ADVTP; >run; > >-----Original Message----- >From: Howard Schreier [mailto:Howard_Schreier@ITA.DOC.GOV] >Sent: Friday, June 01, 2001 1:33 PM >To: SAS-L@LISTSERV.UGA.EDU >Subject: Re: PRoc Summary 'Internals' > > >This has me perplexed. > >There have been three theories. > >A. Missing values for the CLASS variable caused some observations to drop >out. > >Others have pointed out that this is something to always keep in mind when >using PROC SUMMARY, but I don't think it's the culprit this time. Consider >that coding a format (13.) would not turn missing values into non-missing >values. Also, if missing values were the problem, I would expect the PROC >SUMMARY with the BY statement to produce just one additional observation, >whereas it produced many thousands of additional observations. > >B. Resources were exhausted. > >I don't see how declaring a format for the CLASS variable would have >resolved this. > >C. PROC SUMMARY operates on formatted values of the CLASS variables. > >However, there was no format associated with the CLASS variable ACCTKEY; the >PROC CONTENTS output verfies this. > >So none of the three theories is convincing. > >But the one involving formatted values of CLASS variables seems closest. > >Consider the following test code: > > data newloans; > input acctkey advtp; > format acctkey best12.; > cards; > 123456789012 3 > 123456789010 4 > 1234567890123 5 > 1234567890120 6 > ; > > PROC SUMMARY DATA=NEWLOANS NWAY ; > CLASS ACCTKEY ; > VAR ADVTP ; > OUTPUT OUT=ADVTP00 SUM= ; > run; > > proc print; > format acctkey 13.; run; > >Two of the ACCTKEY values have 13 digits and when formatted (BEST12.) become >identical. Consequently, PROC SUMMARY lumps them together and produces three >observations. When I run the same code without the FORMAT statement, >however, I get four observations, as expected. > >Lionel's code does not associate a format with ACCTKEY, and so should not >lump together observations. But perhaps there is some system option, or >version-specific or platform-specific behavior involved. > >Lionel: I suggest running my test code in your environment, with and without >the FORMAT statement. At the least, this might isolate the problem. > >I have never encountered this problem before. That may be due to my practice >of using character rather than numeric data for most ID and categorical >variables, with the exceptions being relatively *small* integers. > >So, my workaround would be something like: > > ACCTKEY = put(BRANCH_N,z7.) || > put(ACCOUNT_,z4.) || > put(LON_NUM ,z2.) ; > >On Thu, 31 May 2001 08:54:34 -0400, teedl <teedl@TDBANK.CA> wrote: > >>I have just come across a small interesting issue while performing some >>summarisation on data using Proc Summary and I was hoping some of you >>familiar with this process, or SAS Internal processes could provide me with >>some insight as to what is happening here. I am pasting the log below. >> >>This particular process reads a DB2 table consisting of all new load >>advances(cheques), selects the desired records based on the date parameter >>passed to it, and then generates a key value. Since a loan can have >multiple >>advances occur within the time period requested, and the desire is to have >a >>single record per loan, the records need to be 'summarized' based on the >>AcctKey. In originally setting this process up I had used the Proc Summary >>NWAY, Class process. However, in reviewing the results I noted that the >>number of records output was much lower than expected. While multiple >>advances do occur, it is not a common occurence. After thinking that the >>uniqueness of my AcctKey was the issue, I spent some time testing that. My >>AcctKey logic proved to be, at least in this case, correct. Lon_Num is >never >>higher than 40, Account_ is a 7 digit number, and Branch is a 4 digit >>number. The resulting Key is, in essence, BBBBAAAAAAANN. While trying to >>determine what was going on, I modified the routine slightly to sort the >>file prior to summarization, and the use the By process. Expecting to get >>the same results I was astonished to find that the By on the Summary >>actually produced the correct results! To make matters even more >>interesting, if I add a Format AcctKey 13. to the DATA step, the Summary >>Class process works fine! So then, it appears that Class is using some form >>of Format unkown to me, while By does not. Can anyone enlighten me on what >>is internally happening here? And is their a way, using Proc Summary >itself, >>that I can produce the same result for both the Class and By, without >>knowing in advance the format structure? >> >>In essence, I am a little surprised that I would get a different result >from >>what appears to be two identical processes. >> >> >> >>Thanks. >> >> >>Lionel >> >> >>PS to Paul Dorfman. This routine is a 'patch' to an existing process. As a >>result, I did not try to use any of your neat Hashing/Direct >>Addressing/Bitmapping stuff. I will, however, be using that stuff in some >of >>my newer projects. >> >>1 >>2 DATA NEWLOANS (KEEP=ACCTKEY ADVTP LON_NUM ACCOUNT_ BRANCH_N >>3 XDATE FROMDATE TODATE >>4 ); >>5 RETAIN FROMDATE TODATE DSTAMP ; >>6 SET LOANSDB2.V_NWLN_L (KEEP=BRANCH_N ACCOUNT_ LON_NUM >>7 LON_TYPE LON_SUBT TRANS_TI PROCEED_ >>8 ); >>9 IF NOT FROMDATE THEN DO ; >>10 INFILE PARMFILE ; >>11 INPUT >>12 %PARMFILE >>13 ; >>14 FROMDATE=INPUT(PUT(BSTAMP,8.),YYMMDD8.) ; >>15 TODATE =INPUT(PUT(DSTAMP,8.),YYMMDD8.) ; >>16 END ; >>17 XDATE =DATEPART(TRANS_TI) ; >>18 IF (FROMDATE LT XDATE LE TODATE) ; >>19 ACCTKEY =BRANCH_N*1000000000+ACCOUNT_*100+LON_NUM ; >>20 ACCTYPE =LON_TYPE*1000+LON_SUBT ; >>21 ADVTP =PROCEED_ ; >>22 >> >>NOTE: THE INFILE PARMFILE IS: >> DSNAME=FSDU.I.M.PARMFILE.VER2.G0035V00, >> UNIT=3390,VOLUME=INFC59,DISP=SHR,BLKSIZE=27960, >> LRECL=40,RECFM=FB >> >>NOTE: 1 RECORD WAS READ FROM THE INFILE PARMFILE. >>NOTE: THE DATA SET WORK.NEWLOANS HAS 136982 OBSERVATIONS AND 8 VARIABLES. >> >>23 PROC CONTENTS ; >>24 >> >>NOTE: THE PROCEDURE CONTENTS PRINTED PAGE 1. >> >>25 PROC SORT ; >>26 BY ACCTKEY ; >>27 >> >>NOTE: THE DATA SET WORK.NEWLOANS HAS 136982 OBSERVATIONS AND 8 VARIABLES. >> >>28 PROC SUMMARY ; >>29 BY ACCTKEY ; >>30 VAR ADVTP ; >>31 OUTPUT OUT=ADVTP00 SUM= ; >>32 >> >>NOTE: THE DATA SET WORK.ADVTP00 HAS 115176 OBSERVATIONS AND 4 VARIABLES. >> >>33 PROC SUMMARY DATA=NEWLOANS NWAY ; >>34 CLASS ACCTKEY ; >>35 VAR ADVTP ; >>36 OUTPUT OUT=ADVTP00 SUM= ; >>NOTE: THE DATA SET WORK.ADVTP00 HAS 18826 OBSERVATIONS AND 4 VARIABLES. >> >>NOTE: SAS INSTITUTE INC., SAS CAMPUS DRIVE, CARY, NC USA 27513-2414 >> >> >> CONTENTS PROCEDURE >> >>DATA SET NAME: WORK.NEWLOANS OBSERVATIONS: >>136982 >>MEMBER TYPE: DATA VARIABLES: 8 >>ENGINE: V609 INDEXES: 0 >>CREATED: 8:05 THURSDAY, MAY 31, 2001 OBSERVATION LENGTH: >64 >>LAST MODIFIED: 8:05 THURSDAY, MAY 31, 2001 DELETED OBSERVATIONS: 0 >>PROTECTION: COMPRESSED: >NO >>DATA SET TYPE: SORTED: >NO >>LABEL: >> >> -----ENGINE/HOST DEPENDENT INFORMATION----- >> >> DATA SET PAGE SIZE: 6144 >> NUMBER OF DATA SET PAGES: 1443 >> FILE FORMAT: 607 >> FIRST DATA PAGE: 1 >> MAX OBS PER PAGE: 95 >> OBS IN FIRST DATA PAGE: 75 >> PHYSICAL NAME: SYS01151.T080551.RA000.UCLN23RT.R0423646 >> RELEASE CREATED: 6.090470 >> RELEASE LAST MODIFIED: 6.090470 >> CREATED BY: UCLN23RT >> LAST MODIFIED BY: UCLN23RT >> SUBEXTENTS: 2 >> TOTAL BLOCKS USED: 1443 >> >> -----ALPHABETIC LIST OF VARIABLES AND ATTRIBUTES----- >> >> # VARIABLE TYPE LEN POS FORMAT INFORMAT LABEL >> >------------------------------------------------------------------------ >> 4 ACCOUNT_ NUM 8 24 9. 9. >ACCOUNT_NUM >> 7 ACCTKEY NUM 8 48 >> 8 ADVTP NUM 8 56 >> 3 BRANCH_N NUM 8 16 7. 7. BRANCH_NUM >> 1 FROMDATE NUM 8 0 >> 5 LON_NUM NUM 8 32 7. 7. LON_NUM >> 2 TODATE NUM 8 8 >> 6 XDATE NUM 8 40 > > >Tom Soeder >Clinformatics, Inc. >txs32954@gsk.com >Sanders 1464B >919-483-6936


Back to: Top of message | Previous page | Main SAS-L page