Date: Wed, 22 Dec 1999 13:13:57 -0800
Reply-To: "William W. Viergever" <wwvierg@IBM.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "William W. Viergever" <wwvierg@IBM.NET>
Subject: Re: Question about improving efficiency in database management
In-Reply-To: <83qs6l$otp$1@nnrp1.deja.com>
Content-Type: text/plain; charset="us-ascii"
A couple of thoughts/questions:
1) what is it that you do want to accomplish with the "header" portion of these data (sounds like HCFA 1500 data or UB-92-like claims)?.
The answer to to this will help guide us in suggesting a method; e.g., if you just wanted to get a count of *all* encounters you could simply do your sort with a NODUPKEY option (back-off purists!! <g>) or something similar via SQL (Sig?, SASHole? <g>), if you want *unique* persons vs. actual encounters then that would require a different approach.
I'm also concerned that you may *think* you don't need anything from the detail segments (er, em: "line items"), but you never know (there is a Dilbertian moral somewhere here!) *and* it's actually quite easy to "collapse" the mutliple records into one per encounter, and still maintain all financial and utilization data that's burried in the detail (e.g., initialize and retain some "buckets" on the first.dot and then examine every detail line and figure out which bucket to increment or fill ... if substr(bil_code,1,2) = '20' then do; icu_chg + chgs; icu_day + units;) Just remember to code a residual or "other" category to make sure everything is mutually exclusive and exhaustive.
2) My last concern is the PROC MEANS taking hours ... what kind of hardware are you running? Here's a snipit from a LOG where I loaded up the 1998, general acute (GAC), Medicaid inpatient claims for California (including all ancillary detail lines). My dataset is over 5 million records big and my FREQ and MEANS flew in a couple of minutes each.
Me thinks something is *not right* in Denmark (didn't want to offend any Danes <g>).
NOTE: The data set SAS1.HIP98GAC has 5682728 observations and 49 variables.
NOTE: The DATA statement used 14 minutes 18.81 seconds.
412 *-- 2nd: do the proverbial edit stats;
413 proc freq data=&dsn1..&file1;
414 tables plancode clmtype adjust ac2 procind accmcode code_sel b_cnty vendcode
415 sex aidcat aidgrp new_mc1 new_mc2 grptyp1 grptyp2 pos tos / missing;
416 format code_sel $cdsel. b_cnty $county. aidgrp $grpnam. grptyp1 grptyp2 $grp_typ.;
417 title1 "Freqs off &file1: &tit1";
418 run;
NOTE: For table location in print file, see
page 1 for PLANCODE
page 1 for CLMTYPE
page 1 for ADJUST
page 2 for AC2
page 2 for PROCIND
page 3 for ACCMCODE
page 11 for CODE_SEL
page 12 for B_CNTY
page 13 for VENDCODE
page 13 for SEX
page 14 for AIDCAT
page 16 for AIDGRP
page 16 for NEW_MC1
page 16 for NEW_MC2
page 17 for GRPTYP1
page 17 for GRPTYP2
page 17 for POS
page 17 for TOS
NOTE: The PROCEDURE FREQ printed pages 1-17.
NOTE: The PROCEDURE FREQ used 2 minutes 40.06 seconds.
419
420 proc means data=&dsn1..&file1 n nmiss sum mean min max maxdec=2;
421 class cross;
422 var tot_pd amt_bld amt_pd age;
423 title1 "Means off &file1: &tit1";
424 run;
NOTE: At least one W.D format was too small for the number to be printed. The decimal may be shifted by the "BEST" format.
NOTE: The PROCEDURE MEANS printed page 18.
NOTE: The PROCEDURE MEANS used 1 minute 43.92 seconds.
At 03:54 PM 12/22/1999 +0000, machellew@MY-DEJA.COM wrote:
>Hello Sas-L'ers!
>
>Being a self-taught SAS person, I learned on an as need basis... What
>this has meant is that many techniques that I use to do something,
>although may be perfectly legitimate from an output point of view, are
>hardly optimal from an efficiency standpoint.
>
>Here's one example of a question I have...
>
>I have a 2 million observation 30 variable administrative dataset which
>contains patient and physician variables and a line for each service of
>utilization. This means that there may be more than one line for each
>patient/physician encounter for a given date.
>
>If I want to reduce this data so that only one observation exists per
>date (that is, I am not interested in the actual service(s) used)
>
>I know of several options:
>
>1)
>proc means sum noprint;
>class doctor patient date;
>var i;
>id {list of variables I don't want to lose or have to re-merge in later}
>output out=x sum=junk noprint; run;
>
>
>2) proc sort; by doctor patient date;
> data a;
>set__;
>by doctor patient date;
>if first.date then do; keep=1; end;
>run;
>
>data b;
>set a;
>where keep=1;
>run;
>
>... there has to be a better way. The proc means method ran for over 26
>hours when I finally had to halt execution. Is there a more efficient
>method using PROC SQL?
>
>
>Thanks,
>
>Machelle Wilchesky
>
>
>Sent via Deja.com http://www.deja.com/
>Before you buy.
----------------------------------------------------------------------------
William W. Viergever Voice : (916) 483-8398
Viergever & Associates Fax : (916) 483-8399
A SAS Institute Quality Partner (USA) E-mail : wwvierg@ibm.net
Sacramento, CA 95825
"Time flies like an arrow. Fruit flies like a banana."
- Groucho Marx
----------------------------------------------------------------------------