Date: Fri, 4 Oct 2002 18:20:24 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: SAS is slow? (123 mb/sec on pc???)
Content-Type: text/plain; charset="iso-8859-1"
This discussion seems to divide naturally into two threads:
1) compilers vs. database programming environments:
Anyone can understand that an organization would prefer to pay once to
buy or build a program. Buying and maintaining a programming environment
such as SAS, PLSQL/Oracle, MS Office/SQLServer, Focus, or the like takes
substantial resources and time to learn. Many organizations are maintaining
several programming environments concurrently.
To understand why organizations buy and maintain database programming
environments, one only has to consider the useful life of a single compiled
application program. I can think of a few that have remained unchanged and
in place for a number of years. Typically in those cases, users have to
adapt to the programs. In most cases application programs undergo almost
continual modification and updating as requirements, data properties,
operating systems, and hardware change. Most organizations opt for
programming environments that will support application development or
database programming or both. In fact, it seems likely that we will see even
fewer instances of classic compiled programs as server pages, Java, VB/NET,
database access engines,and other late binding methods further blur the
distinction between compilers and interpreters. The trend toward running
programs under database programming environments rather than as executables
on operating systems, plus the fact that neither program compilation nor
interpretation (except for XML parsing) takes much CPU or clock time, makes
compilation vs. database programming environment a dead issue. Data access
middleware and database objects have become necessary extensions of
traditional operating systems;
2) relative performance of programming environments:
Which database programming systems an organization buys and maintains, as
I see it, has become the central question today. For those of us who work
with large and complex collections of data, the SAS programming environment
offers both a full procedural programming language, a complete
implementation of the primary query language, SQL, and a host of hooks and
handles into files, database systems, and other data sources. SAS spans all
major computing platforms and operating systems, and it offers advanced
statistical and mathematic procedures.
During the last 24 hours I've posted a couple examples at different ends
of the database programming spectrum. The response to 'Compare two datasets
without re-sorting?' explains how the SAS SQL compiler combines dynamic
indexing and scanning behind the scenes to make short work of a common task
involving very large sets of data. None of the RDBMS' make it that simple
and easy. The response to 'Better way to code this???' demonstrates how to
compile a format from data in an ill-structured and highly repetitive
program file. SAS not only performs this task in a few fractions of seconds,
it also reports errors in data and provides views of results. (It also shows
another example of the DoW loop construct.) In the vast regions of computing
space outside OLTP databases and PC office environments, SAS rules. Show me
To answer your last question, SAS Mecca isn't anywhere near Hardware
Valhalla. I have a faster machine at home than at the office. That forces us
to program smarter. Don't make us brag about it ;)
Sigurd the SQLizer
From: Mauro Morandin [mailto:my_family_name@LIBERO.IT]
Sent: Thursday, October 03, 2002 8:56 PM
Subject: SAS is slow? (123 mb/sec on pc???)
thanks for the many, many ideas and thoughts.
The topic is really interesting and far too broad to be covered
in an email or two. I surely feel the need to quantify how fast
SAS is compared to other languages, but I want to do it on real
problems. So, I really don't understand how you could be so enthusiastic
about Dorfman running a useless program and showing everyone
that SAS can read a 100MB file in less than a second. Everyone was like
"Hurray, SAS is really fast ...." ... at what ????? Reading a file into
its input buffer and throwing it away. So what now ????
I already hear you: "But that's what you told us to do?"
But does it make sense just to read it? To try how fast the interpreter
is YOU HAVE TO USE THE INTERPRETER ( ... AS MUCH AS YOU CAN WITH
DIFFERENT INSTRUCTIONS AND LOOPS). This makes sense to me. And then do
the same thing with other languages. This not only makes sure that you
USE the interpreter with possibly a lot of different instructions, but
also makes sure that YOU don't incur in some I/O bottleneck, which would
of course false your results, because you don't want to measure your
hard disk/memory speed but the speed of your SAS interpreter.
To all the people who say: "I don't understand why someone should spend
it's time to write a program which runs some seconds faster than that?"
I answer: "Because this is just a 100Mbyte test program. You see what
happens if you have a 100 Gbyte DW? These 2 seconds could become 24
hours. And if you're 24 hours late with your reports they could be useless."
Said that, I explain why I sometimes feel disappointed with the
performance of SAS. I'm now a freelance SAS consultant. I have been a
SAS employee some years ago, for several years. I don't like people not
beeing honest about SAS. And saying that SAS is a compiled language is
not honest, because it makes other people (mostly managers) believe that
a SAS program is as fast as a program written in C.
I have seen SAS "go really fast" with PROC SORT and PROC MEANS. Really
fast for me means hitting the I/O bandwidth limit, which can be around
50-100 Mbyte/s for a server PC/UNIX with 4 disks in RAID0. This is
enough for a lot of application domains, so I don't feel the need to
look for something to speed things up a bit. But SAS is not only PROC's.
The problems I have to deal with are mostly DW problems, like building
fact tables and dimensions with surrogate keys and a lot of computed
variables. The fact tables are big beasts and I find myself looking at
the performance monitor on AIX to see what SAS does. I look at the SAS
log .... hmmmm ... data step ..... I look at the monitor .... less than
10Mbyte/s .... then ... proc sql ... hmmmm ... 6 tables star schema join
.... hmmm .... monitor says .... 5-8Mbyte/s.
My figures on SAS performance on AIX RISC6000 S85 are:
PROC SORT (900Mbyte) in 2:00 (2 minutes)
DATA STEP (just a set statement) (900Mbyte) 0:20 seconds
These are good figures, but I can't build a DW only with PROC SORT's and
I can't show you the code, but you can be pretty sure that I know all
the tricks how to write tight SAS code. Moreover, we have five SAS
programmers on the project who look into each other's code.
I love SAS, because it makes my life much easier. It is such a powerful
framework. But sometimes I feel the need to go faster than that, and I
don't like hearing people say that SAS is compiled.
The last thing I did last Wednesday was writing a SAS program (a data
step) to split a 660 Mbyte XML file into pieces of 100.000 records each.
I can't show you the code, because I don't own the copyright (I'm just
the author), but I can surely rewrite the program in Python. This is the
first thought I had when I saw the disappointing 1.5-2 Mbyte/s of SAS
throughput (on AIX). I have done a similar program in Python some months
ago, which did more than 5 Mbyte/s (on my laptop).
Anyway, I will surely send a copy of my python program to SAS-L. By the
way: With Python I have the choice to rewrite part of the code in C if I
need more speed.
I suppose you also never saw a SAS project reduce it's scope, because
SAS + HARDWARE + software requirements were not chosen appropriately. So
where are you living, .... in HARDWARE VALHALLA ??? :-)
red hat certified engineer