Date: Sat, 26 Apr 2003 08:40:36 -0400
Reply-To: "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM>
Subject: Re: Confusing results from a simple merge
Curt:
The confusing results are due to the repeated by groups and 'special logic'
coincendence.
There are 100 by-groups (id1=1..20 id2=1..5)
In table A, each by-group has 7 observations (id3=1..7)
In table B, each by-gruop has 4 observations (id4=1..4)
When you merge dy id1 id2, the first 4 observations of each tables by group
are lined up.
During the first 4 observations of each by group id3 ^= 5 is met, and thus
b2 is set to missing.
After the 4th observation from table A all values from table B are
effectively 'locked' until explicitly assigned or the next by group starts.
During the 5th observation from table A's by group the id3^=5 is false and
the if falls through and b2 remains missing (since it is 'locked').
During the 6th and 7th observations from table A's by group id3^=5 is again
true and b2 is explicitly set to missing.
Moral: BEWARE REPEATED OBSERVATIONS IN BY GROUPS.
Here is a data step to show more explicitly what happened
data ab; * Merge A and B here;
data foo;
merge a b ;
by id1 id2;
if id3 ^= 5 then do;
b2was=b2;
b2wassetmissing=1;
b2=.;
if id3=5 then put 'ERROR!';
end;
run;
--
Richard A. DeVenezia, http://www.devenezia.com
"Curt Seeliger" <seeliger.curt@epa.gov> wrote in message
news:200304251649.10310.seeliger.curt@epa.gov...
> Folks,
>
> It's late friday afternoon and most of you have returned home safely.
>
> I have two datasets A and B with by-variables id1 and id2 in common, and
> I'd like to copy the variables b1-b3 from B to A, and have them be
> non-missing only at specific records (where id3=5, in this example).
> Code which *should* do this is listed below. I've been able to get it to
> run by splitting the merging data step -- those statements are commented
> out, along with some other operations I'd be doing with the original
> 'real' data.
>
> The problem is that b2 is set to missing regardless of the value of id3.
> Separating the datastep by uncommenting out the line of statements causes
> the code to run as intended. The second odd thing about this is that the
> ERROR message is never printed, indicating that the IF statement is
> working correctly. It's as if b2 is being set to missing outside the IF
> statement.
>
> Maybe the biggest problem is that I'm missing something really obvious.
> Thanks for any clues,
>
> data a; * Two fake data sets;
> do id1=1 to 20;
> do id2=1 to 5;
> do id3=1 to 7;
> a1=id1; a2=id1*id2;
> output;
> end;
> end;
> end;
> run;
>
> data b;
> do id1=1 to 20;
> do id2=1 to 5;
> do id4=1 to 4;
> b1=id1*id2; b2=id1*id2+id4; b3=id4;
> output;
> end;
> end;
> end;
> run;
>
> data ab; * Merge A and B here;
> merge a/*(in=in_a)/**/
> b/* (where=(id4=1)) /**/
> ;
> by id1 id2;
> *if in_a;
> /*run; data ab; set ab; /* Split the data step, or don't */
> *drop id4;
> if id3^=5 then do;
> b2=.; if id3=5 then put 'ERROR!';
> end;
> run;
> proc print data=ab uniform; * wonder why? I do. ;
> run;
>
> The only explaination I'm coming up with is that the DSV isn't been loaded
> with the data from B for records with id3>4 (which is the number of id4
> values in B) until the dataset AB is created. (You might need to run the
> code to see what I mean, checking for id3^=4 instead). This makes the
> problem somewhat simpler: Why isn't the DSV completed until after the
> output dataset is written?
>
> cur
>
> --
> Curt Seeliger, Data Ranger
> CSC, EPA/WED contractor
> 541/754-4638
> seeliger.curt@epa.gov
|