|
> ----------
> From: Alexander Martchenko[SMTP:amartchenko@NETSCAPE.NET]
> Reply To: Alexander Martchenko
> Sent: Wednesday, December 02, 1998 1:08 PM
> To: SAS-L@UGA.CC.UGA.EDU
> Subject: Matching Events
>
> A large (order of 10 million) SAS dataset MASTER has variables DATE
> and TIME and some additional info in a number of other vars but let's
> say there is only 1 extra var MINFO.
>
> I also have a small dataset EVENTS (about 20000 obs) with only DATE
> and TIME. In this small file (but _not_ in big) DATE values are
> unique.
>
> I'm trying to come up with the most efficient way of selecting only
> those events from the big file that are also on the small file.
>
A distinct match is DATE and TIME.
> The big file is refreshed daily and the match has to be run many times
> a day against different small datasets.
>
There are several approaches to this problem. You could use a double
SET statement with KEY= option for the small (keys) file. You could use
a MODIFY statement. You could read the small (keys) file into an array
and do a sequential or binary search.
My preferred MO is a SAS format lookup. It works well for a keys
numbering into the tens of thousands to millions, depending on system
memory. This is usually difficult when two keys are involved, but DATE
+ TIME can be reduced to a single DATETIME value.
I'd suggest:
- normalize the DATE and TIME values of the small (keys) dataset to a
single DATETIME value
datetime= dhms( date, hour(time), minute(time),
trunc(second(time)));
- Sort and unduplicate on this field. Create a CNTLIN dataset (see
PROC FORMAT) documentation and output a SAS format (say, SELECT.).
FMTNAME= "SELECT";
TYPE = "I";
START = DATETIME;
LABEL = 1;
- Subset your large dataset based on this format:
/* untested */
data subset;
set big;
if input( dhms( date, hour(time), minute(time), trunc(second(time)),
select.);
run;
I participated in a thread in comp.soft-sys.sas on various methods of
subsetting data without a SORT/MERGE last week. You can search for it
at Deja News (http://www.dejanews.com/).
Good luck.
> --
> Karsten M. Self (Karsten.Self@schwab.com)
> Trilogy Consulting
>
> What part of "gestalt" don't you understand?
>
> WARNING: All e-mail sent to or from this address will be received by
> the Charles Schwab corporate e-mail system and is subject to archival
> and review by someone other than the recipient.
>
|