| Date: | Tue, 10 Nov 1998 14:40:06 -0500 |
| Reply-To: | WHITLOI1 <WHITLOI1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU> |
| From: | WHITLOI1 <WHITLOI1@WESTAT.COM> |
| Subject: | Re: help with reading a large data set |
|
| Content-Type: | text/plain; charset=US-ASCII |
Subject: help with reading a large data set
Summary: Try a trailing @ for efficiency.
Respondent: Ian Whitlock <whitloi1@westat.com>
Tim Pi <timpi@FMRCO.COM> asked a classic question about subsetting.
> I have a large dataset with 80K+ recoreds. However, I only need less
> than 200 records everytime when I access it. I use account number to
> control which records to be kept. It takes a long time to read all
> 80K+ records and then keep only those I need.
After a number of suggestions about using an indexed SAS data set,
Karsten Self <Karsten.Self@SCHWAB.COM> remarked that the file is a
flat file.
I think the question is what does the subsetting code look like? I
ran a simple test under Windows 95 on a slow machine using
/* generate test data */
filename q "c:\my documents\junk\bigtest.dat" ;
data _null_ ;
retain other "some kind of data or other" ;
file q lrecl = 500 ;
do acctno = 1 to 80000 ;
put acctno z6. @ ;
do i = 1 to 15 ;
put other @ ;
end ;
put ;
end ;
run ;
data wanted ( drop = i ) ;
do i = 1 to 200 ;
acctno = put ( ranuni(8568341)*80000 , z6. ) ;
output ;
end ;
run ;
/* do the subsetting */
proc sql noprint ;
select distinct quote(acctno) into :wlist separated by ','
from wanted ;
%put sqlobs = &sqlobs ;
quit ;
proc format ;
value $wanted
&wlist = "W"
other = "N"
;
run ;
data w ;
infile q lrecl = 500 truncover ;
input acctno $char6. @ ;
if put ( acctno , $wanted1. ) = "W" ;
input remndr $char200. ;
run ;
The final step reading the 80k records and choosing 200 took under 10
seconds. This does not seem "long" to me. Perhaps the missing piece
of information is to read just what you need to make the decision
whether to continue reading the record.
Timothy Berryhill <TWB2@PGE.COM> suggested going to IEBGENER or SYNC
SORT to do the subsetting before SAS, but this requires MVS and gives
up the convenience of simple specification suggested above. Is the
original problem under MVS? It isn't mentioned in the problem
statement, but became part of the title when Michael A. Raithel
<MICHAEL.RAITHEL@RAITHM49.CUSTOMS.SPRINT.COM> gave an answer, but we
all know he thinks all problems are both created and solved by MVS.
I would be interested in seeing an MVS comparison of times with the
simple solution given above using SYNC SORT or IEBGENER. Are there
any takers or does it take to long to write the code?
Ian Whitlock
|