Date: Thu, 27 Jun 1996 01:52:01 +0100
Reply-To: John Whittington <johnw@MAG-NET.CO.UK>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: John Whittington <johnw@MAG-NET.CO.UK>
Subject: Re: Subsetting OBS from a large dataset
On Wed, 26 Jun 1996, W HU <whu@UVIC.CA>
>I have a large SAS dataset containing about 4 million records. I want to
>subset some records from it in the way that every the one fifth (or
>other proportions) record will be extracted. To illustrate, supposed there
>are 21 obs, I want to extract the 5th, 10th, 15th, and the 20th obs into
>the sub-dataset.
>
>What I do now to solve this problem is that I get the total number of
>OBS first, then use this total number didvided by 5 to get the ranking of
>those obs to be extracted. It works well. However, this is not a efficient
>way if the dataset is too large. I am looking for a solution which can do
>the same job with no need for pre-defined total number of obs.
Weimin, I'm not sure that I completely understand what you want to achieve,
and am by no means sure whether either of the solutions I have seen posted
actually correspond to what you want! In the example you give, 21 obs
divided by 5 gives 4.2, which you presumably round down to 4, but then I'm
not sure how you translate that into the need for the 5th, 10th, 15th and
20th obs to be selected.
My initial interpretation (which I suspect is also wrong!) is that (usuing
your example of 5) you wanted to select the observation which was one fifth,
two fifths etc. of the way through the dataset - so that you would actually
always end up with 5 observations being selected, with the last one being
betwen 0 and 4 observations from the end of the datset. On that basis, the
following code would work:
data minitest ;
do x = 1 to 59 ; output ; end ;
run ;
data subset (drop = num increm);
retain num increm ;
if _n_=1 then do ;
num = total ;
increm = floor( num / 5 ) ; /* change '5' as desired, or use macrovar */
end ;
do i = increm to num by increm ;
set minitest nobs=total point=i ;
output ;
end ;
stop ;
run ;
proc print data=subset ; run ;
... which gives output:
OBS X
1 11
2 22
3 33
4 44
5 55
On the other hand, if you removed the rounding FLOOR function, you would get
as close as possible to those one fifth, two fifth etc. points, with the
last observation selected being the final one in the dataset:
OBS X
1 11
2 23
3 35
4 47
5 59
I suspect that neither of these are what you want. If you can clarify your
requirement, I suspect that the above code can be adapted to suit.
John
-----------------------------------------------------------
Dr John Whittington, Voice: +44 1296 730225
Mediscience Services Fax: +44 1296 738893
Twyford Manor, Twyford, E-mail: johnw@mag-net.co.uk
Buckingham MK18 4EL, UK CompuServe: 100517,3677
-----------------------------------------------------------
|