LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2001, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 19 Apr 2001 09:10:56 -0700
Reply-To:     Dale McLerran <dmclerra@MY-DEJA.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Dale McLerran <dmclerra@MY-DEJA.COM>
Subject:      Re: converting ALL data from Character to Numeric
Comments: To: whitloi1@westat.com
Content-Type: text/plain

Ian,

I am not quite sure how to interpret the columns of timings that you provide for your results. You have only one column of cpu times. Is that for the Unix system? Are the other two columns total times on the pc and Unix systems? Regardless, I think you need to take another look at the times for the proc means with the view. Also, take a look at the real time for the SAS session. When you use the view, the real time for proc means with the view is the real time reported for proc means, not the time reported for proc means and the time reported for the view. The view real time is included in the proc means real time. This is not true, though, for the cpu times. CPU times are reported for the view and the means procedure separately. Proc means using the view which converts the character data to numeric does indeed take less time than proc means using data saved as numeric. My real time for proc means converting the data on the fly is indeed 2:12:08, not 4:23:09 (4:23:09=2:11:01 + 2:12:08, not 4:33:09 as shown in your reply, though this is a moot point).

Now, think about the argument here. Data can be handled much faster by the processor than they can be handled in disk I/O. In some situations (as those that I described), we can shave time off our processing by minimizing the disk I/O. We have 1/3rd or 1/8th the amount of disk I/O at a penalty of some additional cpu time. But the limiting factor is really moving bytes from disk into the cpu.

A little closer look at your results suggests that for your environment (whether PC or Unix), the time for the means procedure is indeed less when using the numeric data of either length. Obviously, the statistics are system dependent. Perhaps you have SCSI disks whereas I have EIDE drives. My processor is a Pentium II 450, so a little dated there. I really do not understand how you could come up with 2:42:40 cpu time for the view approach. Regardless, word to the wise is to evaluate the various alternatives on whatever system you have. Also, pay very close attention to how SAS reports real and cpu times when a view is used. The procedure (or datastep) in which the view is used absorbs the view time into its own report of real time, but does not absorb the view cpu time into its report of cpu time.

Dale

>Date: Thu, 19 Apr 2001 09:50:24 -0400 >From: Ian Whitlock <whitloi1@WESTAT.COM> >Subject: Re: converting ALL data from Character to Numeric >To: SAS-L@LISTSERV.UGA.EDU, Dale McLerran <dmclerra@MY-DEJA.COM> > >Dale, > >Your test looks pretty good, but remember to add the cost of view plus >the cost of the means when calculating what it take to do the means. > >I ran two tests with your code (except that I made an output stat file). >Here is a summary of the results. > >Means Method Dale (cpu) >view 4:33.09 1:41.54 >length 3 2.19.19 46.61 >length 8 3.04.98 49.79 > >Means Method Ian Pc Ian Hp-Unix (cpu) >view 3:07.67 5:25.68 2:42.40 >length 3 38.89 52.43 52.17 >length 8 1:14.85 50.84 50.33 > >Ian Whitlock <whitloi1@westat.com> > >On Wed, 18 Apr 2001 14:31:47 -0700, Dale McLerran <dmclerra@MY-DEJA.COM> >wrote: > >>Jim, >> >>I have not followed the entire thread here, so my apologies if this >>is in any way redundant. However, I believe that there may be a >>twist to this that might not have been considered previously. If >>the database was sent as all character, that suggests to me that >>there may be a lot variables which have integer assignments, and >>small integer assignments at that. Often such data have values >>which are stored as 1 byte character. Having seen my share of >>questionnaire data with possible responses 1-5, such data can be >>stored conveniently in 1 byte character. If you recode them to >>be numeric, then you must store them in minimum 3 bytes on Windows >>platforms. >> >>If this situation applies to your data, then you might consider >>the following: keep the database as character data, but convert >>the data from character to numeric on the fly as you use the >>data. What are the advantages of this? First, it takes less >>storage space for the 1 byte character variables than it does for >>the numeric representation. Second, it will be faster to access >>the more compact character data. You have to transfer fewer >>bytes whenever you use the data. However, the conversion from >>character to numeric will take a little time. It may take less >>time to do the character to numeric conversion than it does to >>load the extra bytes of data. If that is true, then you may get >>a double bonus by leaving the data as character and using a >>datastep view to perform the conversion from character to >>numeric as the data are used. >> >>How does this work and what evidence do I have that this is indeed >>true? Take a look at the following program. I first construct >>some data with character representation for all the variables. >>I then create a datastep view which is used to convert from >>character to numeric as the data are accessed. Two subsequent >>datasteps construct numeric data representations of the same >>data, one in which the byte length for the numeric variables is >>set to 3 and one in which the byte length defaults to 8. I then >>run proc means using the datastep view and each of the numeric >>variable datasets. Proc means requires the least time when using >>the datastep view. When the data are represented as numeric with >>length 3 bytes, proc means takes only a little more time than >>working with the datastep view. However, if you allow the byte >>length to default to 8, then proc means takes considerably longer >>to run. >> >>Below is the program I ran, with real and cpu times noted for the >>three proc mean steps. Real time for the datastep view was 2:12 >>(minutes:seconds), 2:19 for numeric length 3, and 3:05 for numeric >>length 8. CPU time is greater for the datastep view (0:58 + 0:43= >>1:41) than for either of the numeric data representations (0:46 >>and 0:49). For the datastep view, the conversion from character >>to numeric required 58 seconds, while the computation of the >>means required 43 seconds. Despite the extra cpu time required >>when the data remain as character, the means procedure executes >>faster using the datastep view simply because there is reduced >>I/O. The operations required in converting from character to >>numeric require less time than reading an additional 2 bytes and >>certainly less than reading an additional 7 bytes. It should be >>noted here that these data were stored on my hard drive. If your >>data are coming across a network, then you should see even more >>time savings if your data are saved as character. The I/O across >>a network will be even more of a bottleneck than if you are >>reading off your own hard drive. >> >>Of course, if your character data are not representations of small >>integer values, all of these observations are totally irrelevant. >>But if the character data are representations of small integer >>values, then you save disk space and real time if you leave the >>data as character and use the view. The view can be constructed >>using techniques presented by others on this list (like Ian Whitlock), >>changing the data step which performs the conversion only to add >>the view option. >> >> >>data test; >> retain test1-test1000 '1'; >> do i=1 to 100000; >> output; >> end; >> drop i; >>run; >> >>options nomprint; >> >>%macro testview; >>data testview / view=testview; >> set test(rename=(%do i=1 %to 1000; test&i=ctest&i %end;)); >> %do i=1 %to 1000; >> test&i=input(ctest&i,best12.); >> %end; >> drop ctest1-ctest1000; >>run; >>%mend; >> >>%testview >> >>data testnum3; >> length test1-test1000 3; >> set testview; >>run; >> >>data testnum8; >> set testview; >>run; >> >>proc means data=testview noprint; >> var _all_; >> output out=_null_ mean=mean1-mean1000; >>run; >> >>NOTE: View WORK.TESTVIEW.VIEW used: >> real time 2:11.01 >> cpu time 58.31 seconds >> >>NOTE: There were 100000 observations read from the dataset WORK.TEST. >>NOTE: There were 100000 observations read from the dataset WORK.TESTVIEW. >>NOTE: PROCEDURE MEANS used: >> real time 2:12.08 >> cpu time 43.23 seconds >> >> >>proc means data=testnum3 noprint; >> var _all_; >> output out=_null_ mean=mean1-mean1000; >>run; >> >>NOTE: There were 100000 observations read from the dataset *WORK.TESTNUM3. >>NOTE: PROCEDURE MEANS used: >> real time 2:19.19 >> cpu time 46.61 seconds >> >> >>proc means data=testnum8 noprint; >> var _all_; >> output out=_null_ mean=mean1-mean1000; >>run; >> >>NOTE: There were 100000 observations read from the dataset WORK.TESTNUM8. >>NOTE: PROCEDURE MEANS used: >> real time 3:04.98 >> cpu time 49.79 seconds >> >> >>Dale >> >>>Date: Wed, 18 Apr 2001 09:33:06 -0400 >>>Reply-To: Jim Agnew <Agnew@HSC.VCU.EDU> >>>From: Jim Agnew <Agnew@HSC.VCU.EDU> >>>Subject: Re: converting ALL data from Character to Numeric >>>To: SAS-L@LISTSERV.UGA.EDU >>> >>>I really should have been more specific about the platform.. sas version >6, windows95... I'm still digesting all this wonderful >>>info... >>> >>>Jim >>> >>>Jim Agnew wrote: >>>> >>>> Dear kind fellows.... >>>> >>>> We have been handed this large database by a company to analyze, >however it's in character format only. We'd like to convert the >>>> "numeric data" to numeric from character format where it's stored >presently. >>>> >>>> Other than by hand-coding each and every single variable, is there a >way to wholesale convert it all? >>>> >>>> Thanks, and we will summarize. >>>> >>>> Jim and Cren >> >> >> >> >>--------------------------------------- >>Dale McLerran >>Fred Hutchinson Cancer Research Center >>mailto: dmclerra@fhcrc.org >>Ph: (206) 667-2926 >>Fax: (206) 667-5977 >>--------------------------------------- >> >>------------------------------------------------------------ >>--== Sent via Deja.com ==-- >>http://www.deja.com/

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@fhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

------------------------------------------------------------ --== Sent via Deja.com ==-- http://www.deja.com/


Back to: Top of message | Previous page | Main SAS-L page