LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 1997, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 29 Oct 1997 09:30:41 -0800
Reply-To:     "kmself@ix.netcom.com" <kmself@ix.netcom.com>
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         "Karsten M. Self" <kmself@IX.NETCOM.COM>
Organization: Self Analysis
Subject:      Re: SAS SQL experiences wanted
Comments: To: "akberg@ibm.net" <akberg@ibm.net>
Content-Type: text/plain; charset="us-ascii"

Bram:

I'd be very interested in hearing followup reports on your experience with this system.

70 m x 100 byte recl datasets puts you in the 7GB range for your input tables. This is most decidedly large file processing. It may indicate you'd want to look at alternatives to traditional SQL joins. I'll get to this later after covering some of the basics.

You will need adequate disk and memory for your processes. Actual requirements will vary depending on the type of processing you do, benchmarking will be essential. If there is any way to begin modeling your anticipated processing now, I would encourage you do it with an eye to the following:

- Collect FULLSTIMER statistics from a similar system (eg: an HPUX box, other Unix if available, non-Unix platform as third preference), with some representative samples of your data, say incrementing by factors of 10. So you might have a primary table of 7000, 70 000, 700 000 records. As you are subsetting your primary table, look for a stabilization of the maximum number of returned values as input table size increases. This would be a good thing.

- I frequently write 'sniffer' Unix shell scripts to watch job performance. These usually run something like the following, which displays stats to screen and writes to logfile simultaneously. Replace text in <brackets> with appropriate values. You can replace the 'while true' with a test on existence of the process to have the script terminate when the job completes.

#!/bin/ksh while true do # timestamp date # Get job size/runtime -- delete grep process in ps output ps -af | grep <process id> | grep -v grep # Get free space on filesystem df <SASWORK filesystem(s)> # Get largest working files du -a <SASWORK directory/ies> | sort -nr | head # what file(s) are we working on now? ls -lt <SASWORK directory> | head -<some number 2 - 10> # blank line to delimit echo sleep <usu 60 - 600 seconds> cls done | tee logfile

- Look at resource utilization patterns of memory, scratch storage, and scaling with size. Utilization which increases non-linearly with problem size will have to be addressed.

- Look also at CPU and i/o requirements. These will impact performance rather than sizing, but may drive considerations such as job scheduling, addition of CPU, and load distribution of disk I/O among multiple controllers.

- Ballpark recommendations: you will want scratchspace available to satisfy at least 2.5 x the largest temporary working data set you anticipate creating during typical processing, if you plan on sorting this data. I would think that 2-4G scratch would be a minimum requirement unless you will be subsetting your data considerably. More might be required, and potentially enough to sort your 70m records, which works out to 7GB x 2.5 plus any additional requirements. SQL will utilize memory, additional solutions become viable as well. Look to your memory utilization and swap usage patterns for specific tuning. Ranges from 512MB to several GB are becoming more common, though still somewhat expensive. Anything less than 254 MB would be decidedly anemic.

On the Unix/systems side, make sure you can utilize the resources available.

- Check your SAS MEMSIZE option, you'll want it large (or 0 for unlimited) for your job.

- Check the system 'ulimit' values for memory and data (file) size.

- Depending on your version of HPUX, I believe you will want or require partitioned libnames, (PARTSIZE option under SAS), particularly for your SAS work library.

- My recommendation is to use the largest single (rather than multiple) filesystem possible. The reason for this is that additional partitioned files are placed on subsequent directories, making space management/distribution a real headache.

- On the hardware side, several disks striped together will tend to increase total throughput, especially where multiple jobs, or simultaneous I/O from scratch (typical under SAS) are required.

- Your SASWORK and other large storage disks should be tuned for large file performance with large blocks and small inode tables. See a Unix systems performance tuning text for details (O'Reilly publishes a good one), or talk to your sysadmin/systems vendor.

- Scratch should be a high-speed (but expendable -- you'll be burning through it) device. Permanent storage can be lower speed (fewer accesses) but more dependable. A backup scratch device or rapid replacement commitment from your vendor is a good policy.

- Look into job scheduling. You probably don't want to have multiple queries running against this system at the same time. I'm playing with some 'at' and 'batch' submission shell scripts right now myself. These would be good to clarify/architect now, rather than later.

- Clarify your data backup needs early. Scratch does not require backup, permanent archives should have some form of backup-on-demand or backup-on-request policy, preferably incorporated with your data acquisition process.

General optimization tips for SQL are available from a number of SUGI papers by Paul Kent. Reprints are available through the SAS Institute, contact your service rep or the publications department directly for information. The one to get is "SQL Joins, the long and the short of it", which describes the undocumented _METHOD option to PROC SQL. This option provides coded information on what methods are used by SQL to satisfy your query.

The paper also discusses optimization methods for queries. In a nutshell, large tables should be sorted, and small tables sorted or indexed, during joins.

Your question and application are classic data warehousing examples. If you haven't already, familiarize yourself with the Star Schema (which is essentially the data model you've proposed). Several discussions of implementing this in SAS have been discussed in SUGI and regional user group conferences.

A method you may want to explore would be substituting the use of SAS formats for the secondary joins you have indicated. This requires sufficient memory (or memory plus swap, at greatly reduced performance) to store your secondary tables ("drugs", possibly a "patients" table?) as SAS formats. This enables you to use shorter identification keys in your primary "drugs" table, potentially character hexadecimal representation as described in Rick Aster's "Professional SAS Programming Secrets" or my SUGI 22 paper "Unix Large File Processing Secrets" (copy available in Adobe Acrobat format on request). A two byte character field (16 bits) is capable of representing 65 536 distinct values, and might significantly reduce your storage and I/O requirements.

Finally, if it is possible to determine beforehand the values associated with various subsetting WHERE clauses, you will generally find better efficiencies with DATA set options WHERE= statements than with SQL WHERE clauses.

If WHERE restrictions typically evaluate to value lists of less than 32,767 bytes, total (the maximum size of the macrovariable table space -- though this might be extendable, check on it), it might be possible to create macrovariable representations of these lists from an initial query and apply them to the main query. You will get the most mileage in subsetting the main table by the most restrictive WHERE selection.

Example, assuming a format '$mcfmt.' containing drug-translation:

/* Get WHERE criteria */ proc sql noprint; select insur_id into :insid /* optional: seperated by */ from prescrip where put(med_code, $mcfmt.) = 'atc2' ; quit;

proc sql; create table person as select insur_id, med_code from prescrip(where= (insur_id in(&insid))) ;

Miscellaneous comments: - SQL, thankfully, rarely performs a full Cartesian join. Instead it tends to create smaller Cartesian joins, apply restrictions, then continue. This can still create large intermediate working sets. - There are a couple of bugs associated with SQL and partitioned libraries. Contact the SAS Institute for the latest information. You may have to push Tech Support before they understand the issue, but there was a bug, and a fix, in SAS 6.12, reported/resolved at SUGI time (March 1997), involving SQL temporary files.

And miscellaneous questions: - Is your permanent data archive on disk or tape? - What is your typical working set size? - Is SAS your archive data storage format or are you using a DBMS? Which? - What is your current platform?

Best of luck.

On Wednesday, October 22, 1997 10:39 PM, Bram van den Berg [SMTP:akberg@ibm.net] wrote: > Subject: SAS SQL experiences wanted > > Hello SASusers, > > I have planned migrating to a HP Unix machine and I consider doing > SQL queries and joins on large files (about 70 million obs) on that > machine; with SAS of course. I would be grateful for any experience > or useful tips, for instance: memory size, system parameters, > size of temporary files, construction of query, tools to measure > performance, etc. > > For instance the next, typical, example; can it be run, how will > it behave, are there memory constraints? > > Dataset "prescrip" (about 70 million obs, 29 vars, Obs.Length 100) > contains data about medicines delivered to insured people. > Essential vars: "insur_id" and "med_code". > On an average, each person has 16 prescriptions. > > Dataset "drugs" (about 55000 obs, 58 vars, Obs.Length 434) contains > general data about medicines. Essential vars: "atc_code" and "med_code". > med_code is a unique variable and each atc_code describes a group of > med_codes. > > The question: > find ALL medicines (and the persons who used them) of ALL persons > who used SOME medicine belonging to the group ATC2. > That is, we also want to see medicines, used by these people, > that belong to other ATC-groups. > > Approach: > 1. determine all med_code's belonging to group ATC2 (from dataset DRUGS) > 2. determine all insur_id's having a med_code that's in the set found in > 1. > 3. select all obs with the insur_id's found in 2. > In the program this is coded in the order 3,2,1. > > The sample SAS program: > > data prescrip; > /* PRESCRIPTION file. > To the insured, with identification "insur_id", has been prescribed > the medicine with identification "med_code". > */ > INPUT insur_id med_code $; > cards; > 123 medici04 > 321 medici02 > 789 medici03 > 111 medici04 > 123 medici05 > 222 medici04 > 123 medici03 > 789 medici08 > 333 medici04 > 321 medici10 > 123 medici11 > * ; > > data drugs; > /* DRUGS file. > Contains data about each medicine. A unique identification, med_code, > the ATC-group it belongs to, atc_code, the name and many other > attributes. > */ > INPUT atc_code $ med_code $ med_name &$22.; > cards; > atc1 medici01 Drug 1 from ATCgroup 1 > atc1 medici02 Drug 2 from ATCgroup 1 > atc1 medici03 Drug 3 from ATCgroup 1 > atc2 medici04 Drug 1 from ATCgroup 2 > atc2 medici05 Drug 2 from ATCgroup 2 > atc2 medici06 Drug 3 from ATCgroup 2 > atc2 medici07 Drug 4 from ATCgroup 2 > atc2 medici08 Drug 5 from ATCgroup 2 > atc3 medici09 Drug 1 from ATCgroup 3 > atc3 medici10 Drug 2 from ATCgroup 3 > atc3 medici11 Drug 3 from ATCgroup 3 > atc3 medici12 Drug 4 from ATCgroup 3 > atc3 medici13 Drug 5 from ATCgroup 3 > atc3 medici14 Drug 6 from ATCgroup 3 > * ; > > proc sql; > create index insur_id on prescrip(insur_id); > create index med_code on prescrip(med_code); > create index med_code on drugs(med_code); > title1 "Listing of persons and medicines used by them"; > title2 "who used at least one drug from group ATC2."; > > create table person as > select insur_id, med_code /* select all obs with insur_id's found */ > from prescrip > where insur_id in > (select insur_id /* what insur_id's use a medicine, found with > next subquery */ > from prescrip > where med_code in > (select med_code /* what med_code's belong to group ATC2? */ > from drugs > where atc_code = 'atc2')) > order by insur_id, med_code; > quit; > > proc print; > run; > > /* OUTPUT > > Listing of persons and medicines used by them > who used at least one drug from group ATC2. > > OBS INSUR_ID MED_CODE > > 1 111 medici04 > 2 123 medici03 > 3 123 medici04 > 4 123 medici05 > 5 123 medici11 > 6 222 medici04 > 7 333 medici04 > 8 789 medici03 > 9 789 medici08 > > */ > > Thanks in advance > Bram van den Berg at Ziekenfondsraad, Netherlands

Karsten M. Self (kmself@ix.netcom.com)

What part of "gestalt" don't you understand? (Welchen Teil von <<Gestalt>> verstehn Sie nicht?)


Back to: Top of message | Previous page | Main SAS-L page