Date: Wed, 29 Oct 1997 09:30:41 -0800
Reply-To: "kmself@ix.netcom.com" <kmself@ix.netcom.com>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Karsten M. Self" <kmself@IX.NETCOM.COM>
Organization: Self Analysis
Subject: Re: SAS SQL experiences wanted
Content-Type: text/plain; charset="us-ascii"
Bram:
I'd be very interested in hearing followup reports on your experience with
this system.
70 m x 100 byte recl datasets puts you in the 7GB range for your input
tables. This is most decidedly large file processing. It may indicate
you'd want to look at alternatives to traditional SQL joins. I'll get to
this later after covering some of the basics.
You will need adequate disk and memory for your processes. Actual
requirements will vary depending on the type of processing you do,
benchmarking will be essential. If there is any way to begin modeling
your anticipated processing now, I would encourage you do it with an eye
to the following:
- Collect FULLSTIMER statistics from a similar system (eg: an HPUX box,
other Unix if available, non-Unix platform as third preference), with some
representative samples of your data, say incrementing by factors of 10.
So you might have a primary table of 7000, 70 000, 700 000 records. As
you are subsetting your primary table, look for a stabilization of the
maximum number of returned values as input table size increases. This
would be a good thing.
- I frequently write 'sniffer' Unix shell scripts to watch job
performance. These usually run something like the following, which
displays stats to screen and writes to logfile simultaneously. Replace
text in <brackets> with appropriate values. You can replace the 'while
true' with a test on existence of the process to have the script terminate
when the job completes.
#!/bin/ksh
while true
do
# timestamp
date
# Get job size/runtime -- delete grep process in ps output
ps -af | grep <process id> | grep -v grep
# Get free space on filesystem
df <SASWORK filesystem(s)>
# Get largest working files
du -a <SASWORK directory/ies> | sort -nr | head
# what file(s) are we working on now?
ls -lt <SASWORK directory> | head -<some number 2 - 10>
# blank line to delimit
echo
sleep <usu 60 - 600 seconds>
cls
done | tee logfile
- Look at resource utilization patterns of memory, scratch storage, and
scaling with size. Utilization which increases non-linearly with problem
size will have to be addressed.
- Look also at CPU and i/o requirements. These will impact performance
rather than sizing, but may drive considerations such as job scheduling,
addition of CPU, and load distribution of disk I/O among multiple
controllers.
- Ballpark recommendations: you will want scratchspace available to
satisfy at least 2.5 x the largest temporary working data set you
anticipate creating during typical processing, if you plan on sorting this
data. I would think that 2-4G scratch would be a minimum requirement
unless you will be subsetting your data considerably. More might be
required, and potentially enough to sort your 70m records, which works out
to 7GB x 2.5 plus any additional requirements. SQL will utilize memory,
additional solutions become viable as well. Look to your memory
utilization and swap usage patterns for specific tuning. Ranges from
512MB to several GB are becoming more common, though still somewhat
expensive. Anything less than 254 MB would be decidedly anemic.
On the Unix/systems side, make sure you can utilize the resources
available.
- Check your SAS MEMSIZE option, you'll want it large (or 0 for
unlimited) for your job.
- Check the system 'ulimit' values for memory and data (file) size.
- Depending on your version of HPUX, I believe you will want or require
partitioned libnames, (PARTSIZE option under SAS), particularly for your
SAS work library.
- My recommendation is to use the largest single (rather than multiple)
filesystem possible. The reason for this is that additional partitioned
files are placed on subsequent directories, making space
management/distribution a real headache.
- On the hardware side, several disks striped together will tend to
increase total throughput, especially where multiple jobs, or simultaneous
I/O from scratch (typical under SAS) are required.
- Your SASWORK and other large storage disks should be tuned for large
file performance with large blocks and small inode tables. See a Unix
systems performance tuning text for details (O'Reilly publishes a good
one), or talk to your sysadmin/systems vendor.
- Scratch should be a high-speed (but expendable -- you'll be burning
through it) device. Permanent storage can be lower speed (fewer accesses)
but more dependable. A backup scratch device or rapid replacement
commitment from your vendor is a good policy.
- Look into job scheduling. You probably don't want to have multiple
queries running against this system at the same time. I'm playing with
some 'at' and 'batch' submission shell scripts right now myself. These
would be good to clarify/architect now, rather than later.
- Clarify your data backup needs early. Scratch does not require backup,
permanent archives should have some form of backup-on-demand or
backup-on-request policy, preferably incorporated with your data
acquisition process.
General optimization tips for SQL are available from a number of SUGI
papers by Paul Kent. Reprints are available through the SAS Institute,
contact your service rep or the publications department directly for
information. The one to get is "SQL Joins, the long and the short of it",
which describes the undocumented _METHOD option to PROC SQL. This option
provides coded information on what methods are used by SQL to satisfy your
query.
The paper also discusses optimization methods for queries. In a nutshell,
large tables should be sorted, and small tables sorted or indexed, during
joins.
Your question and application are classic data warehousing examples. If
you haven't already, familiarize yourself with the Star Schema (which is
essentially the data model you've proposed). Several discussions of
implementing this in SAS have been discussed in SUGI and regional user
group conferences.
A method you may want to explore would be substituting the use of SAS
formats for the secondary joins you have indicated. This requires
sufficient memory (or memory plus swap, at greatly reduced performance) to
store your secondary tables ("drugs", possibly a "patients" table?) as SAS
formats. This enables you to use shorter identification keys in your
primary "drugs" table, potentially character hexadecimal representation as
described in Rick Aster's "Professional SAS Programming Secrets" or my
SUGI 22 paper "Unix Large File Processing Secrets" (copy available in
Adobe Acrobat format on request). A two byte character field (16 bits) is
capable of representing 65 536 distinct values, and might significantly
reduce your storage and I/O requirements.
Finally, if it is possible to determine beforehand the values associated
with various subsetting WHERE clauses, you will generally find better
efficiencies with DATA set options WHERE= statements than with SQL WHERE
clauses.
If WHERE restrictions typically evaluate to value lists of less than
32,767 bytes, total (the maximum size of the macrovariable table space --
though this might be extendable, check on it), it might be possible to
create macrovariable representations of these lists from an initial query
and apply them to the main query. You will get the most mileage in
subsetting the main table by the most restrictive WHERE selection.
Example, assuming a format '$mcfmt.' containing drug-translation:
/* Get WHERE criteria */
proc sql noprint;
select insur_id into :insid /* optional: seperated by */
from prescrip
where put(med_code, $mcfmt.) = 'atc2'
;
quit;
proc sql;
create table person as
select insur_id, med_code
from prescrip(where= (insur_id in(&insid)))
;
Miscellaneous comments:
- SQL, thankfully, rarely performs a full Cartesian join. Instead it
tends to create smaller Cartesian joins, apply restrictions, then
continue. This can still create large intermediate working sets.
- There are a couple of bugs associated with SQL and partitioned
libraries. Contact the SAS Institute for the latest information. You may
have to push Tech Support before they understand the issue, but there was
a bug, and a fix, in SAS 6.12, reported/resolved at SUGI time (March
1997), involving SQL temporary files.
And miscellaneous questions:
- Is your permanent data archive on disk or tape?
- What is your typical working set size?
- Is SAS your archive data storage format or are you using a DBMS?
Which?
- What is your current platform?
Best of luck.
On Wednesday, October 22, 1997 10:39 PM, Bram van den Berg
[SMTP:akberg@ibm.net] wrote:
> Subject: SAS SQL experiences wanted
>
> Hello SASusers,
>
> I have planned migrating to a HP Unix machine and I consider doing
> SQL queries and joins on large files (about 70 million obs) on that
> machine; with SAS of course. I would be grateful for any experience
> or useful tips, for instance: memory size, system parameters,
> size of temporary files, construction of query, tools to measure
> performance, etc.
>
> For instance the next, typical, example; can it be run, how will
> it behave, are there memory constraints?
>
> Dataset "prescrip" (about 70 million obs, 29 vars, Obs.Length 100)
> contains data about medicines delivered to insured people.
> Essential vars: "insur_id" and "med_code".
> On an average, each person has 16 prescriptions.
>
> Dataset "drugs" (about 55000 obs, 58 vars, Obs.Length 434) contains
> general data about medicines. Essential vars: "atc_code" and "med_code".
> med_code is a unique variable and each atc_code describes a group of
> med_codes.
>
> The question:
> find ALL medicines (and the persons who used them) of ALL persons
> who used SOME medicine belonging to the group ATC2.
> That is, we also want to see medicines, used by these people,
> that belong to other ATC-groups.
>
> Approach:
> 1. determine all med_code's belonging to group ATC2 (from dataset DRUGS)
> 2. determine all insur_id's having a med_code that's in the set found in
> 1.
> 3. select all obs with the insur_id's found in 2.
> In the program this is coded in the order 3,2,1.
>
> The sample SAS program:
>
> data prescrip;
> /* PRESCRIPTION file.
> To the insured, with identification "insur_id", has been prescribed
> the medicine with identification "med_code".
> */
> INPUT insur_id med_code $;
> cards;
> 123 medici04
> 321 medici02
> 789 medici03
> 111 medici04
> 123 medici05
> 222 medici04
> 123 medici03
> 789 medici08
> 333 medici04
> 321 medici10
> 123 medici11
> * ;
>
> data drugs;
> /* DRUGS file.
> Contains data about each medicine. A unique identification, med_code,
> the ATC-group it belongs to, atc_code, the name and many other
> attributes.
> */
> INPUT atc_code $ med_code $ med_name &$22.;
> cards;
> atc1 medici01 Drug 1 from ATCgroup 1
> atc1 medici02 Drug 2 from ATCgroup 1
> atc1 medici03 Drug 3 from ATCgroup 1
> atc2 medici04 Drug 1 from ATCgroup 2
> atc2 medici05 Drug 2 from ATCgroup 2
> atc2 medici06 Drug 3 from ATCgroup 2
> atc2 medici07 Drug 4 from ATCgroup 2
> atc2 medici08 Drug 5 from ATCgroup 2
> atc3 medici09 Drug 1 from ATCgroup 3
> atc3 medici10 Drug 2 from ATCgroup 3
> atc3 medici11 Drug 3 from ATCgroup 3
> atc3 medici12 Drug 4 from ATCgroup 3
> atc3 medici13 Drug 5 from ATCgroup 3
> atc3 medici14 Drug 6 from ATCgroup 3
> * ;
>
> proc sql;
> create index insur_id on prescrip(insur_id);
> create index med_code on prescrip(med_code);
> create index med_code on drugs(med_code);
> title1 "Listing of persons and medicines used by them";
> title2 "who used at least one drug from group ATC2.";
>
> create table person as
> select insur_id, med_code /* select all obs with insur_id's found */
> from prescrip
> where insur_id in
> (select insur_id /* what insur_id's use a medicine, found with
> next subquery */
> from prescrip
> where med_code in
> (select med_code /* what med_code's belong to group ATC2? */
> from drugs
> where atc_code = 'atc2'))
> order by insur_id, med_code;
> quit;
>
> proc print;
> run;
>
> /* OUTPUT
>
> Listing of persons and medicines used by them
> who used at least one drug from group ATC2.
>
> OBS INSUR_ID MED_CODE
>
> 1 111 medici04
> 2 123 medici03
> 3 123 medici04
> 4 123 medici05
> 5 123 medici11
> 6 222 medici04
> 7 333 medici04
> 8 789 medici03
> 9 789 medici08
>
> */
>
> Thanks in advance
> Bram van den Berg at Ziekenfondsraad, Netherlands
Karsten M. Self (kmself@ix.netcom.com)
What part of "gestalt" don't you understand?
(Welchen Teil von <<Gestalt>> verstehn Sie nicht?)