Date: Tue, 25 Mar 1997 15:45:11 +0000
Reply-To: John Whittington <johnw@MAG-NET.CO.UK>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: John Whittington <johnw@MAG-NET.CO.UK>
Subject: Re: How to calculate sum
Content-Type: text/plain; charset="us-ascii"
On Fri, 14 Mar 1997, Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM> posted a
solution (attached below) to the question by Junjia <JUNJIA@MORST.GOVT.NZ>
regarding summation across observations and determination of whether any two
observations accounted for 80% or more of the total for any variable.
Since SAS DATA steps process data one observation at a time, it is usually
much simpler to undertake arithmetic summary activities across variables
(within observations) than across observations. In terms of coding
simplicity, it is therefore often better to TRANSPOSE data for such
exercises. The exercise is question is also aided by the SAS function
ORDINAL() which facilitates extraction of the "Nth largest" (i.e. 2nd
largest in this case), as well as maximum, value from a list of variables.
The following is much briefer (hence more rapidly written and debugged) code
to achieve identical results to Ian's code; the PROC TRANSPOSE step clearly
will add to overall execution time, although this is more than offset by
simpler arithmetic/logic statements and determination of the number of
variables *within* the same datastep, rather than by creation of a
macrovariable in a separate DATA step.
%let data = w ;
proc transpose data=&data out=johns ; run ;
data done (keep = _NAME_) ;
set johns ;
array _n(*) _numeric_ ;
if ordinal(dim(_n), of _numeric_)
+ ordinal(dim(_n)-1, of _numeric_) >= .8 * sum(of _numeric_) ;
run ;
proc print data=done; run ;
Using Ian's test dataset (and with the same proviso about all values being
positive), this produces identical output to his much more lengthy code -
and, in fact, even executes more quickly than Ian's code on my system:
IAN'S MINE
preliminary step 0.16 secs (DATA) 0.17 secs (TRANSPOSE)
main DATA step 0.55 secs 0.17 secs
PROC PRINT 0.11 secs 0.11 secs
---------------------------------------
TOTAL: 0.82 secs 0.45 secs
As always, this illustrates the diversity of possible approaches to the same
problem when using SAS.
Regards
John
----------Ian Whitlock's previous solution ----------
> Subject: How to calculate sum
> Summary: Save the two biggest values and check them out.
> Respondent: Ian Whitlock <whitloi1@westat.com>
>
> Junjia <JUNJIA@MORST.GOVT.NZ> asks:
>
> >I have dataset with 100 variables and 2000 records. I want to calculate
> >the total of 2000 records for each variable, and like to check if any two
> >of records in 2000 account for 80% of total or not in each variable
> >calculating.
>
> First off I hope all the values are non-negative. With negative values
> even two numbers close to 0 might account for 80% of the sum (100 -100
> 1 1). With only non-negative values only the two biggest are the only
> candidates.
>
> It is tempting to sort on each variable and add the top two values but
> that would mean *two-hundred* steps. With arrays one can do it in one
> step. Store the two biggest values and sum each variable. Then at the
> end of file check the condition for each variable and output the names
> of variables meeting the condition.
>
> /* generate test data */
> data w ( drop = i j ) ;
> array y ( * ) a1 - a10 b1 - b10 c1 - c30 ; /* 50 vars */
> do j = 1 to dim ( y ) ; y ( j ) = 5 ; end ; output ;
> do i = 1 to 4 ; /* a little short of 2000 */
> do j = 1 to dim ( y ) ;
> y ( j ) = ranuni (2947561) * 4.2 ;
> end ;
> output ;
> end ;
> run ;
>
> %let data = w ; /* setup problem */
> /* get array size */
> data _null_ ;
> if 0 then set &data ;
> array y (*) _numeric_ ;
> call symput ( 'n' , left ( put ( dim ( y ) , 4. ) ) ) ;
> stop ;
> run ;
>
> data wanted ( keep = name ) ;
> length name $ 8 ;
> set &data end = eof ;
> array _y (*) _numeric_ ; /* the values */
> array _m (%eval(2*(&n))) ; /* hold top two values */
> array _s (&n) ; /* hold sum of values */
> retain _m _s ;
>
> /* save two biggest values for each variable */
> do i = 1 to dim ( _y ) ;
> _s ( i ) + _y ( i ) ;
> if _y ( i ) >= _m ( 2 * i - 1 ) then
> do ; /* new maximum */
> _m ( 2 * i ) = _m ( 2 * i - 1 ) ;
> _m ( 2 * i - 1 ) = _y ( i ) ;
> end ;
> else
> if _y ( i ) >= _m ( 2 * i ) then /* new sub-maximum */
> _m ( 2 * i ) = _y ( i ) ;
> end ;
>
> /* report at end of file */
> if eof then
> do ;
> do i = 1 to dim ( _y ) ;
> if _m ( 2 * i - 1 ) + _m ( 2 * i ) >= .8 * _s ( i ) then
> do ;
> call vname ( _y ( i ) , name ) ;
> output ;
> end ;
> end ;
> end ;
> run ;
>
> proc print data = wanted ; run ;
>
> Ian Whitlock
>
--------------------------------------------
Regards,
John
-----------------------------------------------------------
Dr John Whittington, Voice: +44 1296 730225
Mediscience Services Fax: +44 1296 738893
Twyford Manor, Twyford, E-mail: johnw@mag-net.co.uk
Buckingham MK18 4EL, UK CompuServe: 100517,3677
-----------------------------------------------------------