LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 1998, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 2 Dec 1998 17:04:16 -0500
Reply-To:   "Self, Karsten" <Karsten.Self@SCHWAB.COM>
Sender:   "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:   "Self, Karsten" <Karsten.Self@SCHWAB.COM>
Subject:   Re: Matching Events
Comments:   To: Alexander Martchenko <amartchenko@NETSCAPE.NET>
Content-Type:   text/plain

> ---------- > From: Alexander Martchenko[SMTP:amartchenko@NETSCAPE.NET] > Reply To: Alexander Martchenko > Sent: Wednesday, December 02, 1998 1:08 PM > To: SAS-L@UGA.CC.UGA.EDU > Subject: Matching Events > > A large (order of 10 million) SAS dataset MASTER has variables DATE > and TIME and some additional info in a number of other vars but let's > say there is only 1 extra var MINFO. > > I also have a small dataset EVENTS (about 20000 obs) with only DATE > and TIME. In this small file (but _not_ in big) DATE values are > unique. > > I'm trying to come up with the most efficient way of selecting only > those events from the big file that are also on the small file. > A distinct match is DATE and TIME.

> The big file is refreshed daily and the match has to be run many times > a day against different small datasets. > There are several approaches to this problem. You could use a double SET statement with KEY= option for the small (keys) file. You could use a MODIFY statement. You could read the small (keys) file into an array and do a sequential or binary search.

My preferred MO is a SAS format lookup. It works well for a keys numbering into the tens of thousands to millions, depending on system memory. This is usually difficult when two keys are involved, but DATE + TIME can be reduced to a single DATETIME value.

I'd suggest:

- normalize the DATE and TIME values of the small (keys) dataset to a single DATETIME value

datetime= dhms( date, hour(time), minute(time), trunc(second(time)));

- Sort and unduplicate on this field. Create a CNTLIN dataset (see PROC FORMAT) documentation and output a SAS format (say, SELECT.).

FMTNAME= "SELECT"; TYPE = "I"; START = DATETIME; LABEL = 1;

- Subset your large dataset based on this format:

/* untested */ data subset; set big; if input( dhms( date, hour(time), minute(time), trunc(second(time)), select.); run;

I participated in a thread in comp.soft-sys.sas on various methods of subsetting data without a SORT/MERGE last week. You can search for it at Deja News (http://www.dejanews.com/).

Good luck.

> -- > Karsten M. Self (Karsten.Self@schwab.com) > Trilogy Consulting > > What part of "gestalt" don't you understand? > > WARNING: All e-mail sent to or from this address will be received by > the Charles Schwab corporate e-mail system and is subject to archival > and review by someone other than the recipient. >


Back to: Top of message | Previous page | Main SAS-L page