Date: Fri, 17 Oct 1997 13:43:43 -0400
Reply-To: WHITLOI1 <whitloi1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: WHITLOI1 <whitloi1@WESTAT.COM>
Subject: Re[2]: Merge and interleave simultaneously?
Subject: Re: Merge and interleave simultaneously?
Summary: Double merge - test it, View - test it, SQL - test it.
Respondent: Ian Whitlock <whitloi1@westat.com>
LPogoda <lpogoda@AOL.COM> writes
>In previous replies under this heading, I suggested a couple of approches
>using a three-way merge. Various deficiencies in my "solutions" were
>correctly pointed out to me, so I'd like to try again.
>
>The original question (which I've lost) asked how to interleave two data
>sets while simultaneously appending data from a third data set to the
>interleaved observations. Because of the size of the original data sets,
>the questioner wanted to avoid using intermediate data sets if at all
>possible. So first interleaving the two data sets and merging the result
>with the third was not an acceptable solution.
>
>One way to get a result would be by:
>
>proc sort data = men;
> by occ;
>proc sort data = women;
> by occ;
>proc sort data = occ;
> by occ;
>data all (keep = id occ desc);
> merge men (in = men) occ (in = occupy);
> by occ;
> if men and occupy then output;
> merge women (in = women) occ (in = occupy);
> by occ;
> if women and occupy then output;
>run;
My initial reaction to the last DATA step was - Who stops the DATA step
first? If it is the women then some men will not go on the file!
Here is the test.
data men ;
input id $ occ $ ;
sex = 'M' ;
cards ;
1 1
2 1
3 1
4 2
5 2
;
data women ;
input id $ occ $ ;
sex = 'F' ;
cards ;
6 1
;
data occ ;
input occ $ desc $ ;
cards ;
1 b
2 c
;
data all (keep = id occ desc);
merge men (in = men) occ (in = occupy); by occ;
if men and occupy then output;
merge women (in = women) occ (in = occupy); by occ;
if women and occupy then output;
run;
proc print data = all ; run ;
I removed the sorts (since the data is sorted) and added a PROC PRINT.
Here is the print.
OBS ID OCC DESC
1 1 1 b
2 6 1 b
3 2 1 b
4 3 1 c
As you can see some of the men are missing. But look at DESC, why is
the value C and not B as it should be? If you understand that you are
a long way to understanding how SAS works.
On the second loop of the DATA step there were no women so the women's
merge went on to the next OCC value. This clobbered the correct values
of OCC and DESC. On the third loop there was still a man with OCC=1.
No data was read from the OCC file hence the bad value of DESC. Why
was OCC correct? Because this came from the man's record.
If you must use two merges in a DATA step you had better understand
very well what is happening and test very thoroughly.
Several answers suggested views. I wanted to see how well they
performed on a fast Pentium II with lots of space and memory. What is
the best way? We need to bring the men and women together first
otherwise we are doomed to handling the OCC data twice.
Here is the log.
313 /* merge3wy.sas */
314 data men women ;
315 retain a1 - a20 56 ;
316 do occ = 1 to 2000 ;
317 do seq = 1 to 500 ;
318 id + 1 ;
319 sex = 'M' ;
320 output men ;
321 sex = 'F' ;
322 output women ;
323 end ;
324 end ;
325 run ;
NOTE: The data set WORK.MEN has 1000000 observations and 24 variables.
NOTE: The data set WORK.WOMEN has 1000000 observations and 24 variables.
NOTE: The DATA statement used 53.0 seconds.
326
327 data occ ;
328 do occ = 1 to 2000 ;
329 rate = ceil ( ranuni (8768503 ) * 2000 ) ;
330 output ;
331 end ;
332 run ;
NOTE: The data set WORK.OCC has 2000 observations and 2 variables.
NOTE: The DATA statement used 0.22 seconds.
333
334 /* merge with a view */
335
336 data vpers / view = vpers ;
337 set men women ;
338 by occ ;
339 run ;
NOTE: DATA STEP view saved on file WORK.VPERS.
NOTE: The original source statements cannot be retrieved from a stored
DATA STEP view nor
will a stored DATA STEP view run under a different release of the
SAS system or under a
different operating system.
Please be sure to save the source statements for this DATA STEP
view.
NOTE: The DATA statement used 0.11 seconds.
340
341 data w ;
342 merge vpers occ ;
343 by occ ;
344 run ;
NOTE: The view WORK.VPERS.VIEW used 4 minutes 34.46 seconds.
NOTE: The data set WORK.W has 2000000 observations and 25 variables.
NOTE: The DATA statement used 4 minutes 35.17 seconds.
I was surprised to see how well the view actually did. Could it really
beat one step processing? This code will not work unless all OCC codes
are realized in each file, so it is not a general solution, but I
wanted to compare times.
345
346 /* one step - assuming ever occ in each file */
347
348 data w ;
349 set occ ;
350 do until ( last.occ ) ;
351 set men ;
352 by occ ;
353 output ;
354 end ;
355 do until ( last.occ ) ;
356 set women ;
357 by occ ;
358 output ;
359 end ;
360 run ;
NOTE: The data set WORK.W has 2000000 observations and 25 variables.
NOTE: The DATA statement used 5 minutes 31.53 seconds.
Yes the view does seem to be better. Anybody have a good explanation
of why?
At least several suggestions to use SQL have been made. Personally I
would avoid SQL when computer efficiency is the issue. Second I would
not expect it to save data passing here. Even when there is only one
statement, SQL will set up temporary tables. In other words I expect
it to take the union and then do the merge. Just what the question
asked how to avoid. But let's take a look anyway. The test data is
the same as the above, the machine is the same, but in a different
session. I used the INOBS= option to control how much processing would
take place. Here is the log.
142 proc sql inobs = 200000 ;
143 create table w as
144 select *
145 from (
146 select *
147 from men
148 union
149 select *
150 from women
151 ) as temp ,
152 occ
153 where temp.occ = occ.occ
154 ;
WARNING: Variable OCC already exists on file WORK.W.
WARNING: Only 200000 records were read from WORK.MEN (alias=) due to
INOBS= option.
WARNING: Only 200000 records were read from WORK.WOMEN (alias=) due to
INOBS= option.
NOTE: Table WORK.W created, with 400000 rows and 25 columns.
155 quit ;
NOTE: The PROCEDURE SQL used 3 minutes 46.84 seconds.
Not great considering we took only 1/5 of the men and women, but it is
working. Now let's double the input.
169 proc sql inobs = 400000 ;
170 create table w as
171 select *
172 from (
173 select *
174 from men
175 union
176 select *
177 from women
178 ) as temp ,
179 occ
180 where temp.occ = occ.occ
181 ;
WARNING: Variable OCC already exists on file WORK.W.
WARNING: Only 400000 records were read from WORK.MEN (alias=) due to
INOBS= option.
WARNING: Only 400000 records were read from WORK.WOMEN (alias=) due to
INOBS= option.
ERROR: File WORK.#TF00040.UTILITY is damaged. I/O processing did not
complete.
ERROR: Task terminated by user request due to insufficient resources.
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.W may be incomplete. When this step was
stopped there were 0
observations and 25 variables.
NOTE: The PROCEDURE SQL used 9 minutes 15.58 seconds.
182 quit ;
Conclusion, one has to be very careful using SQL with large files. Yes
SQL has a solid role in SAS processing, but for this question it is the
wrong tool.
Ian Whitlock