|
> I have a question about SQL here. I was recently trying to
> determine the
> best way to eliminate duplicate records and decided to use PROC SORT
> NODUPKEY. There were three key variables I wanted to check,
> but I also
> wanted to keep the remaining variables. In the SQL version I
> was playing
> with, it worked well (select distinct(var1, var2, var3)...)
> but I couldn't
> figure out how to get the other variables included in the
> resulting dataset.
> Could someone post such an example?
John, I can't quite get which way didn't work. Here are two general forms
(syntax not guaranteed):
proc sort data=in out=out nodupkey;
by key1 key2 key3;
run;
<should keep all variables>
proc sql;
create table out as
select distinct(var1,var2,var3),<othervarinfo>
from table in
group by var1 var2 var3;
quit;
The kicker is the othervarinfo. Unless you somehow tell SQL which row to
pick, what can it do?
Example Table In:
State Zipcode
GA 30329
GA 30341
Now I ask SQL to give me distinct(state), zipcode. Which row should it give
back? Who knows. It won't. That is why Peter's example used max(zipcode),
there must be some way to determine which of the rows to return. Note that
this is not a problem if you just select distinct(state), it is only when
you combine this with asking for the nonkey field that is non-unique across
that key.
Basically, it is a logical dilemna. SAS solves this by making it's own
assumptions, which are documented in proc sort. SQL does not make these
assumptions, so the user must supply his/her own assumptions.
Brad
|