LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (November 2002, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 27 Nov 2002 10:51:15 -0500
Reply-To:     Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject:      Re: Delete data set based on max value of a var
Comments: To: PD <sophe88@YAHOO.COM>
Content-Type: text/plain; charset="iso-8859-1"

Paula,

Comments are embedded below.

IanWhitlock@westat.com

-----Original Message----- From: PD [mailto:sophe88@YAHOO.COM] Sent: Wednesday, November 27, 2002 9:18 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Delete data set based on max value of a var

All right, let me know provide more info for this case.

1. Existence of the data set is not an issue. They are frequency tables, the column var has value 1 only. The row var has 25 distinctive values. The fact is given that they are not going to be empty at any rate.

*** Why would anyone make cross tabs where the column variable is restricted to value 1?

2. This 'delete based on max value' exercise is not performed after all the data sets are generated. Each of these Freq table is about 17k. My initial calculation shows there will be about 3 million small Freq tables created eventually after the macro is completed. Therefore, the goal of this exercise is to 'judge' the data set after the macro spits out a freq table per iteration. If it does not qualify, I don't want it to stay on the hard drive. I expect to have about 20 data sets to qualify.

*** 3 million small freq tables to solve a problem sounds like you asking people on the list to speed up a small difficulty in a Rube Goldberg solution to some problem. You would probably get better results by telling the problem rather than the method of solution.

*** Incidentally I am in the process of writing a macro to do something where the specs said to produce one freq table for each replicate weight. We typically work with 50 to 200 replicate weights. In my macro, the basic counting is done in one DATA step. The 50 to 200 frequency tables are not produced.

*** Do you mean you are looking for 20 freq tables out of 3 million with some property like there is a positive count? Once again I get the feeling that if one knew the problem a good solution might follow.

3. I know we don't have a column-wise Max function in SAS to allow me directly condition the data sets. The max function in SAS is inter-variable essentially. This does not seem to be a problem in this application. I sort the freq table by the variable bribe_dollar (it is only 17k, come on), so the max value of bribe_dollar, unique or not, is always on the last observation, and if last.bribe_dollar.......so on.

*** If the key count is on the last observation then just read the last observation.

data _null_ ; set myfreq point=nobs nobs = nobs ; <check for property and set up action> stop ; run ;

*** If you are in fact looking for 20 tables out of 3 million according to some property, my gut guess is that there is some way to eliminate 2.9 million of them before generation rather than after. I probably could not give the method when I found out the problem, but I would be very surprised if no one on this list could provide the method. With frequency tables of 17K for the cross tab of two variables, in which one is always 1, suggests that looking at the appropriate distribution information first might eliminate the need for producing the frequency table.

By the way, the holidy is Thanksgiving. Thanks for giving me so much attention during the past 2 and half years.

Paula D

John.W@MEDISCIENCE.CO.UK (John Whittington) wrote in message news:<5.1.0.14.2.20021126193034.030174a0@pop3.powernet.co.uk>... > At 14:00 26/11/02 -0500, Ian Whitlock wrote: > > >If you have to know the maximum value, then I agree the only way is to > >find the maximum value. I doubt if there any solution that would work in > >general. > > If one has no additional information about the dataset, beyond the > observations it contains, then there quite clearly can be no way of > determining the maximum value without 'examining' each and every value at > least once - be that 'directly' or in the name of a sort. > > >I do remember several years ago that Karsten Self expressed surprise that > >in a data set indexed by a variable, SQL did not at that time use the > >index information to obtain the maximum value. A birdie indicated he > >would look into it, but I don't know if anything was implemented for this case. > > Yes, I think I remember that. Of course, that index would represent one > manifestation of the 'additional information' of the data to which I referred. > > >If you do not need the precise maximum and know something about the > >distribution you should be able to get an estimate. > > Sure. Having (or taking) some sort of sampole, and 'knowing (or assuming) > something about the distribution' is obviously the very basis of (more > -or-less a definition of) estimation. Of course, in terms of the issue > we've talking about, estimation does not come into it. If one has a > criterion based on 'any value being >x', then one merely has to look for > the first instance (which could be the first examined) of a value >x to > know that the criterion has been fulfilled. As you've said, if there are > such values within the dataset, then one would be very unlucky to have to > examine every single one before finding the one 'which was enough'. > > > >In one of the few good statistics books that I have met, I was impressed > >by the introduction where the author said that in World War II the allies > >got a pretty good estimate of the number of German tanks by taking a > >visible sample of tank numbers and assuming that the tanks had been > >numbered from one to the maximum. > > Yes, that's a very famous story, and you'll find it (or variants of it) > quoted in the elementary chapters of very many statistical texts. The > 'answer' expected varies according to the level of the students. At the > lowest level, most would suggest that the best estimate of the total number > (assuming consecutive numbering from 1 upwards) would be double the mean of > the serial numbers of the observed sample ( a fairly intuitive solution) - > but, at a slightly more advanced level, it's easy to show that there is a > better estimate than that ('double the mean' can obviously give some stupid > answers - e.g. the estimated total can be less than the highest observed > serial number!). However, everything obviously depends upon the observed > tanks being a random sample of those around - if one's sample was based > upon have stumbled across a collection of 'very early' or 'very late' > tanks, then the estimate could obviously be drastically wrong! > > Kind Regards, > > > John > > ---------------------------------------------------------------- > Dr John Whittington, Voice: +44 (0) 1296 730225 > Mediscience Services Fax: +44 (0) 1296 738893 > Twyford Manor, Twyford, E-mail: John.W@mediscience.co.uk > Buckingham MK18 4EL, UK mediscience@compuserve.com > ----------------------------------------------------------------


Back to: Top of message | Previous page | Main SAS-L page