Date: Thu, 13 Dec 2001 12:26:54 -0000
Reply-To: "Vyverman, Koen" <koen.vyverman@FID-INTL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Vyverman, Koen" <koen.vyverman@FID-INTL.COM>
Subject: Re: SASFILE efficiency?
Content-Type: text/plain; charset="iso-8859-1"
LS,
My remarks on SASFILE performance, or the lack thereof, have
sparked the usual bunch of useful and interesting comments.
Thanks are due to -- in no particular order -- Kevin Viel,
Bill Viergever, Paul Dorfman, David Cassell, po' Puddin' Man,
and Soeren Hvidkjaer for their feedback. My original message
is included below the sig.
First, as a general comment, lack of RAM was never an issue.
I kept a close watch on the NT Performance Monitor, and even
with the full 100MB data set loaded in memory, the amount of
available RAM never dipped below 250MB. The system swap file
was never used. The index file is another 11MB, so that would
hardly impact available memory either.
Secondly, my timing measurements did not include the time
required to load/close the SASFILE. And even if they did,
the statements execute in a matter of, say 10 seconds, so
that's surely negligible compared to the typical 1 hour
run-time of the reporting macro.
The general experience with using SASFILE seems to be that
its efficiency benefits are rather restricted to a certain
class of processing. Evidence given by Kevin shows that run-
ning a PROC MEANS on a sizeable data set certainly benefits
from SASFILE. Paul's eloquent argumentation indicates the
same for direct-access with POINT=. There may be others,
but from what I've seen, subsetting with a WHERE-clause
on an indexed key-variable is not one of them.
On the question whether the index is loaded into memory
along with the data set, opinions are divided. Whether it
is or not, may be a largely nuncupatory matter, as the ex-
pected efficiency boost with WHERE processing fails to mani-
fest itself. Given time though, I will attempt some more
rigorous testing and report back in due time.
Finally, David suggested a re-think of my report process
flow, to see whether some efficiency gains might be achieved
by re-arranging things. So, to satisfy curiosity on one hand
and on the other perhaps solicit some useful strategies that
I may have overlooked, here's an outline of what I'm doing:
The large data set, let's call it PAIRS, has three variables:
TOKEN, NEXT_TOKEN, and PROBABILITY. The exercise is one of
simulation, in that I wish to build strings of tokens based
on the content of PAIRS. This works as follows: I pick a
random TOKEN-value to initialize a string. I then subset
PAIRS to this particular TOKEN-value, which gives me a small
data set containing the possible NEXT_TOKEN values, and their
relative probabilities. Proceeding, this small data set is
fed to Dale McLerran's %RANSAMP macro, which, using the
PROBABILITY variable as the statistical weight, produces
a random sample of size 1. The NEXT_TOKEN becomes the next
token in the output string, and the procedure repeats after
replacing TOKEN by NEXT_TOKEN. This goes on and on in a macro
loop, until either no matching records are found in PAIRS
(i.e. the process stumbles upon a value of NEXT_TOKEN which
does not appear as a TOKEN) or until a predefined maximal
number of tokens has been generated in the output string.
Keeping this structure in mind, the only improvement that
readily presents itself would consist of taking the actual
processing that happens in %RANSAMP out of there, and inclu-
ding it in the data step where I subset PAIRS on the given
TOKEN-value. This would eliminate the I/O associated with
creating the small TOKEN / NEXT_TOKEN lookup data set, and
I could pass the randomly selected new value on as a macro
variable. Come to think of it, I'll just go ahead and do
that :-)
Thanks again for your time and thoughts,
Koen.
---------------------------------
Koen Vyverman
Database Marketing Manager
Fidelity Investments - Luxembourg
---------------------------------
> -----Original Message-----
> From: Vyverman, Koen [mailto:koen.vyverman@FID-INTL.COM]
> Sent: Wednesday, December 12, 2001 15:00
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: SASFILE efficiency?
>
>
> LS,
>
> I would be interested to learn whether anyone has adopted the
> SASFILE statement and noted a significant reduction in program
> execution time ...
>
> As it is, I'm having a 100MB indexed SAS data set here, and
> a reporting macro crunching its way through it by means of
> data steps with subsetting WHERE statements.
>
> Encouraged by what I read about SASFILE, I decided to try
> the following:
> sasfile dataset load;
> %report(...)
> sasfile dataset close;
>
> And see what happens: nothing much. In fact, whereas my %report
> used to take about an hour to run, with the SASFILE statements
> it takes on the average 25% _longer_!
>
> My set-up here is SAS8.2 on WinNT4.0 (SP6), ultra-wide SCSI
> hard disk with lots of space, 512MB of RAM. Using the perfor-
> mance monitor, I can see that upon loading the dataset into
> memory, the expected amount of RAM is being eaten away, so
> that part at least works as advertized.
>
> Would it be unreasonable to suspect that the SAS index file
> is actually _not_ being memorized along with the data set,
> thereby still necessitating physical disk-reads of said index,
> as opposed to the supposedly faster memory access?
>
> But even then, I fail to comprehend why the process would
> overall take longer to run, unless my box here uses some
> sort of frighteningly slow RAM ...
>
> Any input/feedback appreciated,
> Koen.
>
> ---------------------------------
> Koen Vyverman
> Database Marketing Manager
> Fidelity Investments - Luxembourg
> ---------------------------------
>
|