Date: Wed, 27 Nov 2002 10:51:15 -0500
Reply-To: Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject: Re: Delete data set based on max value of a var
Content-Type: text/plain; charset="iso-8859-1"
Paula,
Comments are embedded below.
IanWhitlock@westat.com
-----Original Message-----
From: PD [mailto:sophe88@YAHOO.COM]
Sent: Wednesday, November 27, 2002 9:18 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Delete data set based on max value of a var
All right, let me know provide more info for this case.
1. Existence of the data set is not an issue. They are frequency
tables, the column var has value 1 only. The row var has 25
distinctive values. The fact is given that they are not going to be
empty at any rate.
*** Why would anyone make cross tabs where the column variable is restricted
to value 1?
2. This 'delete based on max value' exercise is not performed after
all the data sets are generated. Each of these Freq table is about
17k. My initial calculation shows there will be about 3 million small
Freq tables created eventually after the macro is completed.
Therefore, the goal of this exercise is to 'judge' the data set after
the macro spits out a freq table per iteration. If it does not
qualify, I don't want it to stay on the hard drive. I expect to have
about 20 data sets to qualify.
*** 3 million small freq tables to solve a problem sounds like you asking
people on the list to speed up a small difficulty in a Rube Goldberg
solution to some problem. You would probably get better results by telling
the problem rather than the method of solution.
*** Incidentally I am in the process of writing a macro to do something
where the specs said to produce one freq table for each replicate weight.
We typically work with 50 to 200 replicate weights. In my macro, the basic
counting is done in one DATA step. The 50 to 200 frequency tables are not
produced.
*** Do you mean you are looking for 20 freq tables out of 3 million with
some property like there is a positive count? Once again I get the feeling
that if one knew the problem a good solution might follow.
3. I know we don't have a column-wise Max function in SAS to allow me
directly condition the data sets. The max function in SAS is
inter-variable essentially. This does not seem to be a problem in this
application. I sort the freq table by the variable bribe_dollar (it is
only 17k, come on), so the max value of bribe_dollar, unique or not,
is always on the last observation, and if last.bribe_dollar.......so
on.
*** If the key count is on the last observation then just read the last
observation.
data _null_ ;
set myfreq point=nobs nobs = nobs ;
<check for property and set up action>
stop ;
run ;
*** If you are in fact looking for 20 tables out of 3 million according to
some property, my gut guess is that there is some way to eliminate 2.9
million of them before generation rather than after. I probably could not
give the method when I found out the problem, but I would be very surprised
if no one on this list could provide the method. With frequency tables of
17K for the cross tab of two variables, in which one is always 1, suggests
that looking at the appropriate distribution information first might
eliminate the need for producing the frequency table.
By the way, the holidy is Thanksgiving. Thanks for giving me so much
attention during the past 2 and half years.
Paula D
John.W@MEDISCIENCE.CO.UK (John Whittington) wrote in message
news:<5.1.0.14.2.20021126193034.030174a0@pop3.powernet.co.uk>...
> At 14:00 26/11/02 -0500, Ian Whitlock wrote:
>
> >If you have to know the maximum value, then I agree the only way is to
> >find the maximum value. I doubt if there any solution that would work in
> >general.
>
> If one has no additional information about the dataset, beyond the
> observations it contains, then there quite clearly can be no way of
> determining the maximum value without 'examining' each and every value at
> least once - be that 'directly' or in the name of a sort.
>
> >I do remember several years ago that Karsten Self expressed surprise that
> >in a data set indexed by a variable, SQL did not at that time use the
> >index information to obtain the maximum value. A birdie indicated he
> >would look into it, but I don't know if anything was implemented for this
case.
>
> Yes, I think I remember that. Of course, that index would represent one
> manifestation of the 'additional information' of the data to which I
referred.
>
> >If you do not need the precise maximum and know something about the
> >distribution you should be able to get an estimate.
>
> Sure. Having (or taking) some sort of sampole, and 'knowing (or assuming)
> something about the distribution' is obviously the very basis of (more
> -or-less a definition of) estimation. Of course, in terms of the issue
> we've talking about, estimation does not come into it. If one has a
> criterion based on 'any value being >x', then one merely has to look for
> the first instance (which could be the first examined) of a value >x to
> know that the criterion has been fulfilled. As you've said, if there are
> such values within the dataset, then one would be very unlucky to have to
> examine every single one before finding the one 'which was enough'.
>
>
> >In one of the few good statistics books that I have met, I was impressed
> >by the introduction where the author said that in World War II the allies
> >got a pretty good estimate of the number of German tanks by taking a
> >visible sample of tank numbers and assuming that the tanks had been
> >numbered from one to the maximum.
>
> Yes, that's a very famous story, and you'll find it (or variants of it)
> quoted in the elementary chapters of very many statistical texts. The
> 'answer' expected varies according to the level of the students. At the
> lowest level, most would suggest that the best estimate of the total
number
> (assuming consecutive numbering from 1 upwards) would be double the mean
of
> the serial numbers of the observed sample ( a fairly intuitive solution) -
> but, at a slightly more advanced level, it's easy to show that there is a
> better estimate than that ('double the mean' can obviously give some
stupid
> answers - e.g. the estimated total can be less than the highest observed
> serial number!). However, everything obviously depends upon the observed
> tanks being a random sample of those around - if one's sample was based
> upon have stumbled across a collection of 'very early' or 'very late'
> tanks, then the estimate could obviously be drastically wrong!
>
> Kind Regards,
>
>
> John
>
> ----------------------------------------------------------------
> Dr John Whittington, Voice: +44 (0) 1296 730225
> Mediscience Services Fax: +44 (0) 1296 738893
> Twyford Manor, Twyford, E-mail: John.W@mediscience.co.uk
> Buckingham MK18 4EL, UK mediscience@compuserve.com
> ----------------------------------------------------------------