LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 2008, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 3 Oct 2008 16:14:29 -0500
Reply-To:     Mary <mlhoward@avalon.net>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Mary <mlhoward@AVALON.NET>
Subject:      Re: Comparing 2 data sets (easy one again!)
Comments: To: Toby Dunn <tobydunn@HOTMAIL.COM>
Content-Type: text/plain; charset="iso-8859-1"

Toby,

Is this a SAS problem that the sub-query is so slow? I've tried it in Oracle and it doesn't seem too bad, but wasn't dealing with millions of records; I did use it on millions of records a few years ago doing data warehousing in DB2 and SQL Server; but DB2 was very fast at the left join approach as well.

I can see that the anti-join might be implemented better in SAS in that it is very similar to the merge in a data step. EXCEPT doesn't exist in Oracle SQL; it uses the MINUS instead.

So is this a SAS implementation problem with sub-queries or are sub-queries generally not good performers in any SQL language?

-Mary ----- Original Message ----- From: Toby Dunn To: SAS-L@LISTSERV.UGA.EDU ; Mary H Sent: Friday, October 03, 2008 4:02 PM Subject: Re: Comparing 2 data sets (easy one again!)

Mary,

There are several ways which arent dependent on macro memory size:

/***************/ /** Sub Query **/ /***************/

Proc SQL ; Create Table SubQuery As Select X From One Where X Not In ( Select Y From Two ) ; Quit ;

/***************/ /** Except **/ /***************/

Proc SQL ; Create Table Except As Select X From One Except Select Y From Two ; Quit ;

/***************/ /** Anti-Join **/ /***************/

Proc SQL ; Create Table AntiJoin As Select X From One Left Join Two On X = Y Where Missing( Y ) ; Quit ;

The SubQuery is the absolute loser of these three, its porformance suceeds as sucking to the point I wonder why any one would use this method for the stated problem.

The Except clause is actually only slightly faster than the Anti-Join and I do mean slight, I took my test out to over a million obs and then randomized them before running through the SQL code. HOwever, the Except clause has problems when you want more than just the list of Id's returned as SAS hasnt implemented a way yet to request which variables for the Except clause to use. The Anti-Join allows one to have as many fields as is desired from one or both of the data sets returned and allows the user to specify which variables they want to match or join on. In this type of scenario you have to add extra code to go back and get the extra fields and this makes the Anti-Join actually faster.

On Fri, 3 Oct 2008 15:54:07 -0500, Mary <mlhoward@AVALON.NET> wrote:

>Yes, clearly if one has too many results then my code could have also been >modified to put the results into a data set: > >proc sql noprint; >create table ids as >select ID >from set1 >where ID not in >(select ID from set2); >quit; > >But yours is another way to do it; I wonder if it is faster since it needs >to do a left join and that's generally slower, but it is certainly an >alternative. The user was just asking for ID's but you could select all the >data with the *. > >Thanks. > >-Mary > > > >----- Original Message ----- >From: Toby Dunn >To: SAS-L@LISTSERV.UGA.EDU >Sent: Friday, October 03, 2008 3:32 PM >Subject: Re: Comparing 2 data sets (easy one again!) > > >Mary, > >The method you posted will work only for small data sets where the unique >number of ID's will fit into the maximum length of a macro variable. > > >Proc Compare comes to mind as a solution. > > >If one wanted to do this via SQL then try: > >Proc SQL ; >Create Table Need As > Select One.* > From One As One > Left Join > Two As Two > On One.ID = Two.ID > Where Missing( Two.ID ) ; >Quit ; > >The above is called an Anti-Join and yes the Where clause is correct. >Actually one could have used a Having but in some testing I did the other >day it performed slighty worse than the Where clause. Either way they >both produce the same results. > > > > >On Fri, 3 Oct 2008 10:16:14 -0500, Mary <mlhoward@AVALON.NET> wrote: > >>Using the Not in with a select on on the other data set can do this: >> >>proc sql noprint; >>select ID into: VarList >>separated by " " >>from set1 >>where ID not in >>(select ID from set2); >>quit; >>%put &varlist; >> >>-Mary >> >> >>----- Original Message ----- >>From: Mike Smith >>To: SAS-L@LISTSERV.UGA.EDU >>Sent: Friday, October 03, 2008 9:58 AM >>Subject: Comparing 2 data sets (easy one again!) >> >> >>Hi all SAS experts, >> >>I am trying to manage two big data sets and trying to figure out which >>caseids of one dataset could be found /not found in the other dataset. >> >>Is there like a procedure aside from just sorting both datasets and >>figuring out which ids arent there? >> >> >>Thanks a lot!


Back to: Top of message | Previous page | Main SAS-L page