LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 1998, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 10 Nov 1998 14:40:06 -0500
Reply-To:   WHITLOI1 <WHITLOI1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:   WHITLOI1 <WHITLOI1@WESTAT.COM>
Subject:   Re: help with reading a large data set
Comments:   To: Tim Pi <timpi@FMRCO.COM>
Content-Type:   text/plain; charset=US-ASCII

Subject: help with reading a large data set Summary: Try a trailing @ for efficiency. Respondent: Ian Whitlock <whitloi1@westat.com>

Tim Pi <timpi@FMRCO.COM> asked a classic question about subsetting.

> I have a large dataset with 80K+ recoreds. However, I only need less > than 200 records everytime when I access it. I use account number to > control which records to be kept. It takes a long time to read all > 80K+ records and then keep only those I need.

After a number of suggestions about using an indexed SAS data set, Karsten Self <Karsten.Self@SCHWAB.COM> remarked that the file is a flat file.

I think the question is what does the subsetting code look like? I ran a simple test under Windows 95 on a slow machine using

/* generate test data */ filename q "c:\my documents\junk\bigtest.dat" ; data _null_ ; retain other "some kind of data or other" ; file q lrecl = 500 ; do acctno = 1 to 80000 ; put acctno z6. @ ; do i = 1 to 15 ; put other @ ; end ; put ; end ; run ;

data wanted ( drop = i ) ; do i = 1 to 200 ; acctno = put ( ranuni(8568341)*80000 , z6. ) ; output ; end ; run ;

/* do the subsetting */ proc sql noprint ; select distinct quote(acctno) into :wlist separated by ',' from wanted ; %put sqlobs = &sqlobs ; quit ;

proc format ; value $wanted &wlist = "W" other = "N" ; run ;

data w ; infile q lrecl = 500 truncover ; input acctno $char6. @ ; if put ( acctno , $wanted1. ) = "W" ; input remndr $char200. ; run ;

The final step reading the 80k records and choosing 200 took under 10 seconds. This does not seem "long" to me. Perhaps the missing piece of information is to read just what you need to make the decision whether to continue reading the record.

Timothy Berryhill <TWB2@PGE.COM> suggested going to IEBGENER or SYNC SORT to do the subsetting before SAS, but this requires MVS and gives up the convenience of simple specification suggested above. Is the original problem under MVS? It isn't mentioned in the problem statement, but became part of the title when Michael A. Raithel <MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM> gave an answer, but we all know he thinks all problems are both created and solved by MVS.

I would be interested in seeing an MVS comparison of times with the simple solution given above using SYNC SORT or IEBGENER. Are there any takers or does it take to long to write the code?

Ian Whitlock


Back to: Top of message | Previous page | Main SAS-L page