|
Anonymous user <Number_42@CARAMAIL.COM> asked how to do Cartesian by
processing without SQL. This is a valid question and perhaps even
important if one is trying to learn various techniques, so examples of how
easy it is in SQL just doesn't cut it.
In fact, there was an article published in the elite but now defunct
"Observations" showing that the POINT= strategy was far more efficient than
using SAS SQL as presented in some early version of version 6. I think I
posted a suggestion to SAS-L or possibly wrote a coders corner paper
showing how the authors failed to get the most efficient advantage out of
their technique.
Michael posted the solution in the message below. I changed ID=b to ID=d
in ONE and got
339 * merge the POINTER data sets;
340 * use the point information to cycle through observations with
matching
341 IDs;
342
343 data one_two (keep=id fruit color);
344 merge one_point two_point;
345 by id;
346 do i = one_start to one_end;
347 set one point=i;
348 do j = two_start to two_end;
349 set two point=j;
350 output;
351 end;
352 end;
353 run;
ERROR: Invalid DO loop control information, either the INITIAL or TO
expression is missing or
the BY expression is missing, zero, or invalid.
one_start=. id=b one_end=. two_start=2 two_end=3 FIRST.id=1 LAST.id=1 i=.
fruit=apple j=2
color=amber _ERROR_=1 _N_=2
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set WORK.ONE_POINT.
NOTE: There were 3 observations read from the data set WORK.TWO_POINT.
WARNING: The data set WORK.ONE_TWO may be incomplete. When this step was
stopped there were 2
observations and 3 variables.
WARNING: Data set WORK.ONE_TWO was not replaced because this step was
stopped.
A corrected version of the last step is
data one_two (keep=id fruit color);
length fruit color $ 8 ;
merge one_point two_point;
by id;
do i = max(one_start,0) to max(one_end,0);
if i = 0 then fruit = " " ;
else set one point=i;
do j = max(two_start,0) to max(two_end,0);
if j = 0 then color = " " ;
else set two point=j;
output;
end;
end;
run;
In a more complex setting one could create BLANK_ONE and BLANK_TWO with
one record each of the relevan subsidary variables. Then one could
use
retain pt 1 ;
and
if i = 0 then set blank_one point = pt ;
to replace
if i = 0 then fruit = " " ;
The blanking is important becasue SAS does not automatically set variables
from a SAS dataset to missing when either the SET statement fails or is
not executed.
Ian Whitlock
=================
Date: Tue, 8 Nov 2005 10:14:26 -0500
Reply-To: "Michael S. Zdeb" <msz03@HEALTH.STATE.NY.US>
Sender: "SAS(r) Discussion"
From: "Michael S. Zdeb" <msz03@HEALTH.STATE.NY.US>
Subject: Re: How to make repeats in merge-by-group to work?
Comments: To: Number_42@CARAMAIL.COM
In-Reply-To: <1131450900.655635.242990@g43g2000cwa.googlegroups.com>
Content-type: text/plain; charset=US-ASCII
Hi...I agree with Toby Dunn...use PROC SQL. But, for those of us who used
SAS prior to V6, there was no PROC SQL and many-to-many merges still had to
be done with a data step. For those of us who remember NESUG '88 (or for
that matter, 1988), the proceedings contain a paper by Robina Thornton and
Merry Rabb on "Advanced SET and MERGE Processing" that shows how to do
many-to-many merging in a data step merge. The method does what PROC SQL
does in the background, i.e. it ensures that all observations in data sets
one and two with the same ID "see" each other at some point in the data
step processing. So, here it is (from NESUG '88) with the posted data and
it does give the desired result...
data one;
input id $ fruit $;
datalines;
a apple
a apple
b banana
c coconut
c coconut
c coconut
;
proc sort data=one;
by id;
run;
data two;
input id $ color $;
datalines;
a amber
b brown
b black
c cocoa
c cream
;
proc sort data=two;
by id;
run;
* create POINTERS for both data sets - reduces the data step merge to
one-to-one merging;
data one_point (keep=id one_start one_end);
retain one_start;
set one;
by id;
if first.id then one_start = _n_;
if last.id then do;
one_end = _n_;
output;
end;
run;
data two_point (keep=id two_start two_end);
retain two_start;
set two;
by id;
if first.id then two_start = _n_;
if last.id then do;
two_end = _n_;
output;
end;
run;
* merge the POINTER data sets;
* use the point information to cycle through observations with matching
IDs;
data one_two (keep=id fruit color);
merge one_point two_point;
by id;
do i = one_start to one_end;
set one point=i;
do j = two_start to two_end;
set two point=j;
output;
end;
end;
run;
proc print data=one_two;
run;
Gives...
Obs id fruit color
1 a apple amber
2 a apple amber
3 b banana brown
4 b banana black
5 c coconut cocoa
6 c coconut cream
7 c coconut cocoa
8 c coconut cream
9 c coconut cocoa
10 c coconut cream
PS Another posted solution...
DATA test(drop=id2 );
set one;
do i=1 to nn;
set two(rename=(id=id2)) nobs=nn point=i;
if id=id2 then output;
end;
RUN;
also works but requires a complete pass through data set two for each
observation in data set one rather than only looking within observations
with the same ID.
Mike Zdeb
U@Albany School of Public Health
1 University Drive
Rensselaer, NY 12144-3456
(P)518-402-6479
(F)630-604-1475
<snip>
|