Date: Mon, 30 Jan 2006 16:12:47 +0000
Reply-To: iw1junk@COMCAST.NET
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ian Whitlock <iw1junk@COMCAST.NET>
Subject: Re: How Do I De-Duplicate Whilst Importing a Flat File?
Bora,
If the file is sorted by id the answer is trivial, grab the first id and
throw away the rest. This could be implemented by saving the id in a hold
variable
data w ( drop = savid ) ;
retain savid ;
input id @ ;
if id ^= savid then
do ;
input ...;
output ;
savid = id ;
end ;
else
input ;
run ;
An alternative is to make a view and then use FIRST.ID processing to grab
the first record. This maybe the cleaner solution in terms of today's processing
but as the above shows a view is not needed.
If the file is not sorted then either save the id's in hash table. Version
9 makes it easy. For earlier versions use http://lexjansen.com/sugi/ to
search papers and SAS-L for "hash" and "Dorfman" to get details on setting
up the hash. If the ids were say integers between 1 and a couple of
million than a simple array with the index ID could solve the problem. If
there many millions of id's then you will have to do the above in blocks,
sort, and then interleave removing extra duplicates with FIRST.ID
processing.
You might also question the integrity of the IT system that produces
duplicates until it becomes too embarrassing to maintain.
Ian Whitlock
=================
Date: Mon, 30 Jan 2006 17:39:21 +0200
Reply-To: BoraYavuz@HSBC.COM.TR
Sender: "SAS(r) Discussion"
From: Bora Yavuz <BoraYavuz@HSBC.COM.TR>
Subject: How Do I De-Duplicate Whilst Importing a Flat File?
Content-type: text/plain; charset=us-ascii
Hello,
I have this huge flat file which - thanks to the IT guys who spool it! -
contains duplicate records for some IDs. Since all the records for a given
ID are exactly the same on all the columns I'd like keep only one (and any
one) record from each set of duplicates.
My question is: How do I de-duplicate this huge flat file whilst importing
it - and not after having imported and created a SAS data set? The reason
I'm investigating this is to save on I/O time (as you have already
guessed).
There may be two possible scenarios:
--> The flat file is already sorted on "ID".
--> The flat file is not sorted on "ID" (in which case I think there
possibly does not exist a single-pass solution).
Thank you very much in advance,
Bora Y.