Date: Wed, 2 Oct 2002 19:29:06 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: PROC SQL, union, PROC SORT, nodup
Content-Type: text/plain; charset="iso-8859-1"
The embedded UNION query eliminates duplicates in (ds_a OR ds_b) before the
implicit inner join with ds_k. SET ds_a ds_b does not. Had you replaced
UNION CORR with OUTER UNION CORR, the UNION and SET would produce the same
results. The JOIN will as well if ds_k contains no duplicates of duplicated
keys in the OUTER UNION.
The SORT NODUPLICATES after the MERGE operates on a dataset that may combine
variables and values from two datasets, whereas SAS SQL does not save 2nd or
later instances of the same column variable name. The difficulty of
explaining what actually happens in a MERGE of two datasets that have
duplicates of BY variables and duplicates of variable labels on both sides
has led some SAS data step experts to say 'NEVER do that'.
Sig
-----Original Message-----
From: Talbot Katz [mailto:TopKatz@MSN.COM]
Sent: Wednesday, October 02, 2002 5:31 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: PROC SQL, union, PROC SORT, nodup
Hi, Gang!
Yet another shaggy dog story from y.t. I have three data sets, call them
ds_a, ds_b, and ds_k; ds_a and ds_b have the identical layout, and ds_k
shares a common key with them. I want to create a data set which chooses
some records from ds_a, some from ds_b, and matches them up against ds_k.
Suppose they are already all sorted by the common key; then I could do this
with two data steps :
data ds_c;
set ds_a ds_b;
by common_key;
run;
data ds_m;
merge ds_c (in = c) ds_k (in = k);
by common_key;
if c and k;
run;
When I started this task, the data sets weren't pre-sorted, so I took the
lazy route, and tried to do the task with PROC SQL, as follows:
proc sql ;
create table ds_m as
select * from
(select *
from
ds_a
union corr
select *
from
ds_b) as do,
ds_k as dc
where do.common_key = dc.common_key
;
quit ;
Lo and behold, the second method produced fewer output records than the
first method did. I tried PROC SORT with NODUPLICATES on both output data
sets, but no records were eliminated from either one. After fooling around
for most of the day with little success, I finally found a group of four
records in ds_b which looked identical to my naked eye (although PROC SORT
with NODUPLICATES wouldn't collapse them at all), but showed up in
differing amounts in the two output data sets. They satisfied a condition,
which we'll call condition4; none of the records in ds_a satisfy
condition4. I put these four records in a dataset called ds_4 and I got
the following results :
proc sql ;
create table ds_m4 as
select * from
(select *
from
ds_4) as do,
ds_k as dc
where do.common_key = dc.common_key
and condition4
;
quit ;
This resulted in a four record output data set, ds_m4. Then I tried :
proc sql ;
create table ds_m4 as
select * from
(select *
from
ds_a
union corr
select *
from
ds_4) as do,
ds_k as dc
where do.common_key = dc.common_key
and condition4
;
quit ;
This resulted in a ONE record output data set, ds_m4. Somehow the "UNION
CORR" condition caused duplicate elimination to take place in PROC SQL,
even though PROC SORT could not expose the records as duplicates.
Can anyone tell me what's going on? Is the PROC SORT NODUPLICATES option
defective? Is "UNION CORR" supposed to eliminate duplicates? Is there a
way of doing this with PROC SQL that wouldn't eliminate duplicates?
Thanks!
-- TMK --