LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 1997, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 30 Oct 1997 16:20:19 -0500
Reply-To:     WHITLOI1 <whitloi1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         WHITLOI1 <whitloi1@WESTAT.COM>
Subject:      Re[2]: How to JOIN without SQL
Comments: To: Douglas Dame <dougdame@HPE.UFL.EDU>

Subject: Re: How to JOIN without SQL Summary: Style and efficiency. Respondent: Ian Whitlock <whitloi1@westat.com>

Douglas Dame <dougdame@HPE.UFL.EDU> in response to the question

>I need to calculate the distance between every California zip code (about >2500 of them) and every California hospital (about 500 of them). I have a >zip file with coordinates and a hosp file with coordinates. The ideal way >to do this is to use PROC SQL to join every observation in one file with >every observation in the other file. > > >I would like to be able to use a datastep to do the following: for each >record in the hospital file, go through each record in the zip file, >calculating distance and outputting the record to an output file on disk. I >wanted to do something like: > > DATA DISTFILE; > SET HOSPFILE; > DO OBSNUM=1 TO 2500; > SET ZIPFILE POINT=OBSNUM; > DISTANCE = ....................... > etc. >

suggested an array approach

DD> (code not tested) DD> DD> data _null_; DD> set SASHELP.VTABLE; DD> where LIBNAME='PERM_LIB' and MEMNAME='HOSPFILE'; DD> call symput('NO_HOSPS', put(OBS,8.)); DD> run; DD> DD> data DISTANCE; DD> set PERM_LIB.HOSPFILE (in=HOSP keep=HOSP_ID LAT LONG) DD> PERM_LIB.ZIPFILE (keep=ZIPCODE LAT LONG); DD> *--- Construct array of lat & long values for hosps ---*; DD> *--- Array hosps rather than Zips because it''s a ---*; DD> *--- much shorter (narrower) list. ---*; DD> retain HOSP_1 - HOSP_&NO_HOSPS "________" DD> LAT_1 - LAT_&NO_HOSPS 0 DD> LONG_1 - LONG_&NO_HOSPS 0 ; DD> array _HOSP_ HOSP_1 - HOSP_&NO_HOSPS; DD> array _LAT_ LAT_1 - LAT_&NO_HOSPS; DD> array _LONG_ LONG_1 - LONG_&NO_HOSPS; DD> if HOSP then do; DD> *--- fill the arrays ---*; DD> _I_=_N_; DD> _HOSP_=HOSP_ID; DD> _LAT_=LAT; DD> _LONG_=LONG; DD> end; DD> else do over _HOSP_; DD> *--- process each zip ---*; DD> HOSP_ID=_HOSP_; DD> DISTANCE=...function of ZIP's (LAT & LONG) and DD> _HOSP_'s (_LAT_ & _LONG_); DD> output; DD> end; DD> keep HOSP_ID ZIPCODE DISTANCE; DD> run;

Is it worth it? There are 1.25 million reads in the first program. There are 3000 Douglas's version. I'll go along with Douglas since he pointed it out, but would consider some changes.

I would use _temporary_ arrays since the access is much faster and efficiency seems to be the question. This means no need to explicitly RETAIN or ask the compiler to mark as dropped. It also has the advantage that I don't name the variables and hence don't run into the error caused by

DD> call symput('NO_HOSPS', put(OBS,8.));

putting spaces between the root and index of names. In addition I would not use implicit arrays because of efficiency and because I believed them when they said implicit arrays will go away some time. I also think they are less clear to most current SAS programmers.

Second, I would rearrange the structure of the DATA step to clearly indicate set up and process parts. It probably makes no difference, but I would use

call symput('NO_HOSPS', put(OBS-delobs,8.));

to get the number of records to be processed rather than the number of physical records. (Since I read Mike Murphy's question about that step while writing this, I tried using the "dictionary" file in SQL instead. On my machine the DATA step took 23+ seconds. The SQL too .66 seconds. So maybe I would change that step.)

Here is my revised version of the main step. (also untested)

data DISTANCE ( keep = HOSP_ID ZIPCODE DISTANCE ) ;

*--- Construct array of lat & long values for hosps ---*; *--- Fewer hosps than Zips ---*;

array _HOSP_ $ 8 (&no_hosps) _temporary_ ; array _LAT_ (&no_hosps) _temporary_ ; array _LONG_ (&no_hosps) _temporary_ ;

do while ( not eof ) ; i + 1 ; set PERM_LIB.HOSPFILE (keep=HOSP_ID LAT LONG) end = eof ; _hosp_ ( i ) = hosp_id ; _lat_ ( i ) = lat ; _long_ ( i ) = long ; end ;

*--- Process each zip ---;

set PERM_LIB.ZIPFILE (keep=ZIPCODE LAT LONG);

HOSP_ID=_HOSP_; do i = 1 to dim ( _hosps_ ) ; DISTANCE=...function of ZIP's (LAT & LONG) and _HOSP_'s (_LAT_ & _LONG_); output ; end ;

run ;

After I finished the above, I couldn't leave it without a crude test between the two methods. Run under Windows 3.1 with a Pentium 150 and an average hard disk.

407 data w ; do x = 1 to 500 ; output ; end ; run ;

NOTE: The data set WORK.W has 500 observations and 1 variables. NOTE: The DATA statement used 0.66 seconds.

408 409 data w2 ; do y = 1 to 2500 ; output ; end ; run ;

NOTE: The data set WORK.W2 has 2500 observations and 1 variables. NOTE: The DATA statement used 0.7 seconds.

410 411 data w3 ; 412 set w ; 413 do pt = 1 to nobs ; 414 set w2 point = pt nobs = nobs ; 415 output ; 416 end ; 417 run ;

NOTE: The data set WORK.W3 has 1250000 observations and 2 variables. NOTE: The DATA statement used 43.45 seconds.

418 419 data w3 ; 420 array _x (500) _temporary_ ; 421 do while ( not eof ) ; 422 i + 1 ; 423 set w end = eof ; 424 _x(i) = x ; 425 end ; 426 427 set w2 ; 428 do i = 1 to dim ( _x ) ; 429 output ; 430 end ; 431 run ;

NOTE: The data set WORK.W3 has 1250000 observations and 3 variables. NOTE: The DATA statement used 38.56 seconds.

Ian Whitlock <whitloi1@westat.com>


Back to: Top of message | Previous page | Main SAS-L page