LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2006, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 30 Jan 2006 16:12:47 +0000
Reply-To:     iw1junk@COMCAST.NET
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Ian Whitlock <iw1junk@COMCAST.NET>
Subject:      Re: How Do I De-Duplicate Whilst Importing a Flat File?
Comments: cc: BoraYavuz@HSBC.COM.TR

Bora,

If the file is sorted by id the answer is trivial, grab the first id and throw away the rest. This could be implemented by saving the id in a hold variable

data w ( drop = savid ) ; retain savid ; input id @ ; if id ^= savid then do ; input ...; output ; savid = id ; end ; else input ; run ;

An alternative is to make a view and then use FIRST.ID processing to grab the first record. This maybe the cleaner solution in terms of today's processing but as the above shows a view is not needed.

If the file is not sorted then either save the id's in hash table. Version 9 makes it easy. For earlier versions use http://lexjansen.com/sugi/ to search papers and SAS-L for "hash" and "Dorfman" to get details on setting up the hash. If the ids were say integers between 1 and a couple of million than a simple array with the index ID could solve the problem. If there many millions of id's then you will have to do the above in blocks, sort, and then interleave removing extra duplicates with FIRST.ID processing.

You might also question the integrity of the IT system that produces duplicates until it becomes too embarrassing to maintain.

Ian Whitlock ================= Date: Mon, 30 Jan 2006 17:39:21 +0200 Reply-To: BoraYavuz@HSBC.COM.TR Sender: "SAS(r) Discussion" From: Bora Yavuz <BoraYavuz@HSBC.COM.TR> Subject: How Do I De-Duplicate Whilst Importing a Flat File? Content-type: text/plain; charset=us-ascii Hello, I have this huge flat file which - thanks to the IT guys who spool it! - contains duplicate records for some IDs. Since all the records for a given ID are exactly the same on all the columns I'd like keep only one (and any one) record from each set of duplicates. My question is: How do I de-duplicate this huge flat file whilst importing it - and not after having imported and created a SAS data set? The reason I'm investigating this is to save on I/O time (as you have already guessed). There may be two possible scenarios: --> The flat file is already sorted on "ID". --> The flat file is not sorted on "ID" (in which case I think there possibly does not exist a single-pass solution). Thank you very much in advance, Bora Y.


Back to: Top of message | Previous page | Main SAS-L page