LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 1997, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 17 Oct 1997 13:43:43 -0400
Reply-To:     WHITLOI1 <whitloi1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         WHITLOI1 <whitloi1@WESTAT.COM>
Subject:      Re[2]: Merge and interleave simultaneously?
Comments: To: LPogoda <lpogoda@AOL.COM>, reeve@cwmills.umd.edu

Subject: Re: Merge and interleave simultaneously? Summary: Double merge - test it, View - test it, SQL - test it. Respondent: Ian Whitlock <whitloi1@westat.com>

LPogoda <lpogoda@AOL.COM> writes

>In previous replies under this heading, I suggested a couple of approches >using a three-way merge. Various deficiencies in my "solutions" were >correctly pointed out to me, so I'd like to try again. > >The original question (which I've lost) asked how to interleave two data >sets while simultaneously appending data from a third data set to the >interleaved observations. Because of the size of the original data sets, >the questioner wanted to avoid using intermediate data sets if at all >possible. So first interleaving the two data sets and merging the result >with the third was not an acceptable solution. > >One way to get a result would be by: > >proc sort data = men; > by occ; >proc sort data = women; > by occ; >proc sort data = occ; > by occ; >data all (keep = id occ desc); > merge men (in = men) occ (in = occupy); > by occ; > if men and occupy then output; > merge women (in = women) occ (in = occupy); > by occ; > if women and occupy then output; >run;

My initial reaction to the last DATA step was - Who stops the DATA step first? If it is the women then some men will not go on the file! Here is the test.

data men ; input id $ occ $ ; sex = 'M' ; cards ; 1 1 2 1 3 1 4 2 5 2 ; data women ; input id $ occ $ ; sex = 'F' ; cards ; 6 1 ; data occ ; input occ $ desc $ ; cards ; 1 b 2 c ;

data all (keep = id occ desc); merge men (in = men) occ (in = occupy); by occ; if men and occupy then output;

merge women (in = women) occ (in = occupy); by occ; if women and occupy then output; run;

proc print data = all ; run ;

I removed the sorts (since the data is sorted) and added a PROC PRINT. Here is the print.

OBS ID OCC DESC

1 1 1 b 2 6 1 b 3 2 1 b 4 3 1 c

As you can see some of the men are missing. But look at DESC, why is the value C and not B as it should be? If you understand that you are a long way to understanding how SAS works.

On the second loop of the DATA step there were no women so the women's merge went on to the next OCC value. This clobbered the correct values of OCC and DESC. On the third loop there was still a man with OCC=1. No data was read from the OCC file hence the bad value of DESC. Why was OCC correct? Because this came from the man's record.

If you must use two merges in a DATA step you had better understand very well what is happening and test very thoroughly.

Several answers suggested views. I wanted to see how well they performed on a fast Pentium II with lots of space and memory. What is the best way? We need to bring the men and women together first otherwise we are doomed to handling the OCC data twice.

Here is the log.

313 /* merge3wy.sas */ 314 data men women ; 315 retain a1 - a20 56 ; 316 do occ = 1 to 2000 ; 317 do seq = 1 to 500 ; 318 id + 1 ; 319 sex = 'M' ; 320 output men ; 321 sex = 'F' ; 322 output women ; 323 end ; 324 end ; 325 run ;

NOTE: The data set WORK.MEN has 1000000 observations and 24 variables. NOTE: The data set WORK.WOMEN has 1000000 observations and 24 variables. NOTE: The DATA statement used 53.0 seconds.

326 327 data occ ; 328 do occ = 1 to 2000 ; 329 rate = ceil ( ranuni (8768503 ) * 2000 ) ; 330 output ; 331 end ; 332 run ;

NOTE: The data set WORK.OCC has 2000 observations and 2 variables. NOTE: The DATA statement used 0.22 seconds.

333 334 /* merge with a view */ 335 336 data vpers / view = vpers ; 337 set men women ; 338 by occ ; 339 run ;

NOTE: DATA STEP view saved on file WORK.VPERS. NOTE: The original source statements cannot be retrieved from a stored DATA STEP view nor will a stored DATA STEP view run under a different release of the SAS system or under a different operating system. Please be sure to save the source statements for this DATA STEP view. NOTE: The DATA statement used 0.11 seconds.

340 341 data w ; 342 merge vpers occ ; 343 by occ ; 344 run ;

NOTE: The view WORK.VPERS.VIEW used 4 minutes 34.46 seconds.

NOTE: The data set WORK.W has 2000000 observations and 25 variables. NOTE: The DATA statement used 4 minutes 35.17 seconds.

I was surprised to see how well the view actually did. Could it really beat one step processing? This code will not work unless all OCC codes are realized in each file, so it is not a general solution, but I wanted to compare times.

345 346 /* one step - assuming ever occ in each file */ 347 348 data w ; 349 set occ ; 350 do until ( last.occ ) ; 351 set men ; 352 by occ ; 353 output ; 354 end ; 355 do until ( last.occ ) ; 356 set women ; 357 by occ ; 358 output ; 359 end ; 360 run ;

NOTE: The data set WORK.W has 2000000 observations and 25 variables. NOTE: The DATA statement used 5 minutes 31.53 seconds.

Yes the view does seem to be better. Anybody have a good explanation of why?

At least several suggestions to use SQL have been made. Personally I would avoid SQL when computer efficiency is the issue. Second I would not expect it to save data passing here. Even when there is only one statement, SQL will set up temporary tables. In other words I expect it to take the union and then do the merge. Just what the question asked how to avoid. But let's take a look anyway. The test data is the same as the above, the machine is the same, but in a different session. I used the INOBS= option to control how much processing would take place. Here is the log.

142 proc sql inobs = 200000 ; 143 create table w as 144 select * 145 from ( 146 select * 147 from men 148 union 149 select * 150 from women 151 ) as temp , 152 occ 153 where temp.occ = occ.occ 154 ; WARNING: Variable OCC already exists on file WORK.W. WARNING: Only 200000 records were read from WORK.MEN (alias=) due to INOBS= option. WARNING: Only 200000 records were read from WORK.WOMEN (alias=) due to INOBS= option. NOTE: Table WORK.W created, with 400000 rows and 25 columns.

155 quit ; NOTE: The PROCEDURE SQL used 3 minutes 46.84 seconds.

Not great considering we took only 1/5 of the men and women, but it is working. Now let's double the input.

169 proc sql inobs = 400000 ; 170 create table w as 171 select * 172 from ( 173 select * 174 from men 175 union 176 select * 177 from women 178 ) as temp , 179 occ 180 where temp.occ = occ.occ 181 ; WARNING: Variable OCC already exists on file WORK.W. WARNING: Only 400000 records were read from WORK.MEN (alias=) due to INOBS= option. WARNING: Only 400000 records were read from WORK.WOMEN (alias=) due to INOBS= option. ERROR: File WORK.#TF00040.UTILITY is damaged. I/O processing did not complete. ERROR: Task terminated by user request due to insufficient resources. NOTE: The SAS System stopped processing this step because of errors. WARNING: The data set WORK.W may be incomplete. When this step was stopped there were 0 observations and 25 variables. NOTE: The PROCEDURE SQL used 9 minutes 15.58 seconds.

182 quit ;

Conclusion, one has to be very careful using SQL with large files. Yes SQL has a solid role in SAS processing, but for this question it is the wrong tool.

Ian Whitlock


Back to: Top of message | Previous page | Main SAS-L page