Date: Wed, 16 Jan 2008 17:33:29 +0000
Reply-To: Paul Dorfman <sashole@BELLSOUTH.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Dorfman <sashole@BELLSOUTH.NET>
Organization: PDC
Subject: Re: Filter dataset A, if id exists in B,
then keep the record in A?
Marc,
Just how large your data sets? It is not an idle question, for it may well dictate the matching technique choice.
However, it is at least good to know that they are manageable enough to be sorted. One note as to your code (otherwise fine): In the second sort, all you need out of it is unique IDs, hence
proc sort NODUPKEY
data = bad_list
out = bad_list_uniq
;
by id ;
run ;
and then use BAD_LIST_UNIQ thereafter, with no need to drop the satellite variables from it.
Now, even at the risk of appearing grossly battological (as it was once vis-à-vis the DoW-loop), I still must mention hashing. Fancy that the number of unique IDs in the BAD list is small enough and their length is short enough to fit them all in memory. Say for the sake of argument that the ID is numeric (and so is 8 bytes long), and you have 10 million unique IDs in the BAD file. Then you can stick them all in 80 Mb of RAM, which is a trifle nowadays for almost any computer (e.g., the laptop I am using to type this has no problem swallowing 25 times that with 2 gigs to spare), and then simply do:
data bad_employees ;
if _n_ = 1 then do ;
dcl hash bad (dataset: 'bad_list', hashexp: 16) ;
bad.definekey ('id') ;
bad.definedone () ;
end ;
set employees ;
if bad.check() = 0 ;
run ;
without the need to sort anything whatsoever. It is hardly less concise than SQL but is guaranteed to perform quite a bit faster - provided, of course, that the hash fits in memory. That's what hashes are (in small part, among other nifty things) for...
Kind regards
------------
Paul Dorfman
Jax, FL
------------
-------------- Original message ----------------------
From: Marc Weinmann <marc.weinmann.lwed@STATEFARM.COM>
>
> I have dataset A, and dataset B, I want dataset A with only those
> records who's id's appear in dataset B. I keep telling myself this
> should be simple...
>
> Below is a contrived example. I don't like "merging" the datasets as
> below because I don't want any columns from dataset B added into my
> filtered dataset A. To me MERGE seems maybe like the wrong tool for
> the job. Is merge the right tool, are there any other options?
>
> A little added tidbit is my real datasets are large and PROC SQL hasn't
> scaled well for me. I am interested in proc sql solutions if there is
> one, but I kinda shy away from it in general because I've had many jobs
> abort due to work/utility file space limits. (I'm working with people on
> those bottlenecks, i do like proc sql!)
>
> Appreciate any thoughts,
> Marc
>
>
> data employees;
> input name $ id;
> datalines;
> Jerry 1
> Garfield 2
> Heathcliff 3
> run;
>
> data bad_list;
> input id boss_last $ boss_id;
> datalines;
> 1 Fudd 1
> 1 Flinstone 2
> 3 Flinstone 2
> run;
>
> proc sort data=employees out=employees;
> by id;
> run;
>
> proc sort data=bad_list out=bad_list;
> by id;
> run;
>
>
> data bad_employees;
> merge employees (in=Emp) bad_list (in=Bad drop=boss_last boss_id);
> by id;
> if (bad);
> put _all_;
> run;
|