Date: Mon, 10 Jan 2000 09:16:38 -0500 Y.Huang@ORGANONINC.COM "SAS(r) Discussion" Ya Huang Re: Weird duplicates To: johnmegargee@NETSCAPE.NET text/plain; charset="iso-8859-1"

John,

It looks like that the first thing to solve your problem is to find a way to calculate the frequency key, the following code provides a very simple algorithm, thought I don't know if it is efficient enough, but at least it is in one run. Once you get the freq key, you can find many way to "dedup" the dataset, the simplest one in syntax might be proc sort nondupkey;

-------

data xx; input a; cards; 1090823644557 1705950484263 4001475563289 ;

data xx; set xx; length b c \$ 13; b=put(a,13.); c=put(13-length(compress(b,'0')),1.)|| put(13-length(compress(b,'1')),1.)|| put(13-length(compress(b,'2')),1.)|| put(13-length(compress(b,'3')),1.)|| put(13-length(compress(b,'4')),1.)|| put(13-length(compress(b,'5')),1.)|| put(13-length(compress(b,'6')),1.)|| put(13-length(compress(b,'7')),1.)|| put(13-length(compress(b,'8')),1.)|| put(13-length(compress(b,'9')),1.);

proc print; run;

------------- The SAS System 08:44 Monday, January 10, 2000 1

OBS A B C

1 1.0908E12 1090823644557 2111221111 2 1.706E12 1705950484263 2111221111 3 4.0015E12 4001475563289 2111221111

HTH

Ya Huang Organon Inc.

> -----Original Message----- > From: John Megargee [mailto:johnmegargee@NETSCAPE.NET] > Sent: Sunday, January 09, 2000 8:14 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Weird duplicates > > > Hi: > > I'm asking for an advise regarding an unusual problem. MVS > Sas dataset 'test' > contains 240 million obs. Each obs has a single numeric > variable 'key' between > 0 and 9999999999999 (13 digit integer key). I'm trying to > identify the keys > whose digits have the same frequency. For example I may have > some observations > anywhere in the file like > > 1090823644557 > ..... > 1705950484263 > ..... > 4001475563289 > ..... > Note that these keys are different but their digits have the same > frequencies: > > digit: 0 1 2 3 4 5 6 7 8 9 > freq: 2 1 1 1 2 2 1 1 1 1 > > I only need to output the first key, i.e. 1090823644557. In > other words, if > the frequency of digits is the same for some group of keys > they are considered > 'duplicate' and I want to 'dedup' the file in this sense, > i.e. output only the > first key with each particular frequency of digits. I > principally know how to > do it in several passes through the file by reshaping, > sorting, reshaping, > etc. but with this many obs the run-times I get are > prohibitive. Any ideas of > doing this most efficiently are greatly appreciated. > > Thanks in advance, John > > > > > > ____________________________________________________________________ > Get your own FREE, personal Netscape WebMail account today at http://webmail.netscape.com.

Back to: Top of message | Previous page | Main SAS-L page