Date: Fri, 5 Jun 1998 04:07:29 -0400
Reply-To: "Kott, Matt" <MKott@MOBILNET.GTE.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Kott, Matt" <MKott@MOBILNET.GTE.COM>
Subject: Re: Re[5]: sort by format
SAS-Lr's
I'm not much of a SAS person (yet), but I have some experience with sql, and
thought I might offer another suggestion to what I think was the original
issue - being able to sort by state name when the data only included a state
code of some sort. I haven't tested this, but I'm sure you'll all have a
good idea if it would work any more efficiently than the slower running sql
code already offered by a few.
What about creating (if not already available somewhere) another "lookup"
dataset that would contain an obs for each state and contain the variables
"state_cd" and "state_nm" that could then be used by way of a join for
sorting on the name and not code.
Example (states is lookup ds):
proc sql;
create table new1 as
select test.*
from test, states
where test.state_cd = states.state_cd
order by states.state_nm;
This, of course wouldn't include the name of the state in the result, but
could include it if needed. If I'm correct in what I'm thinking, wouldn't
this then leave you with a sorted table in state name order? And - more
importantly - wouldn't this then run much quicker since there is no
reformatting of any vars?
I'm interested to know what you all think.
Matt Kott
mkott@mobilnet.gte.com
> -----Original Message-----
> From: HERMANS1 [SMTP:HERMANS1@WESTAT.COM]
> Sent: Thursday, June 04, 1998 8:11 PM
> To: SAS-L@UGA.CC.UGA.EDU
> Subject: Re: Re[5]: sort by format
>
> John:
>
> In my tests of a SQL "formatted" sort against a data step/proc sort,
> the latter runs about 30% faster on both elapsed time and CPU time.
> Not as great a difference as you found, but it shows a substantial
> difference nonetheless. The Westat SQL side begrudgingly concedes one
> to the Mediscience Datasteppers. Sig
>
> 22 proc sql;
> 23 create table new1 as
> 24 select *
> 25 from test
> 26 order by put(x, tryfmt.)
> 27 ;
> NOTE: The query as specified involves ordering by an item that doesn't
> appear in its SELECT clause.
> NOTE: Table WORK.NEW1 created, with 1000000 rows and 2 columns.
>
> 28 quit;
> NOTE: PROCEDURE SQL used:
> real time 3:30.62
> cpu time 2:04.22
>
>
> 29 data test2 ;
> 30 set test ;
> 31 y = put(x, tryfmt.) ;
> 32 run ;
>
> NOTE: The data set WORK.TEST2 has 1000000 observations and 3
> variables.
> NOTE: DATA statement used:
> real time 28.64 seconds
> cpu time 20.42 seconds
>
>
> 33 proc sort data = test2 out = new2 (drop = y) ;
> 34 by y ;
> 35 run ;
>
> NOTE: The data set WORK.NEW2 has 1000000 observations and 2 variables.
> NOTE: PROCEDURE SORT used:
> real time 1:58.44
> cpu time 1:07.42
>
>
> ______________________________ thread _______________________________
> At 16:10 04/06/98 -0400, HERMANS1 responded to my earlier post with:
>
> >Please remind us why the simple and ordinary SQL solution to this
> >problem of sorting by formatted values does not suffice.
>
> Sig, I'm sure it probably does suffice; at the time I wrote the message
> to
> which you are responding, no-one involved in the thread had suggested the
> SQL method!
>
> >Although
> >this long thread has produced some interesting observations on
> >formatted values in Base SAS, I for one have lost track of the main
> >topic.
>
> Are you sure that you are not confusing different threads? Unless I've
> been
> missing posts, this thread has not been at all 'long' (not by SSAS-L
> standards, anyway!) and hasn't significantly strayed from the initial
> question - which was the one sentence query ... " Is there a simple way to
> sort a dataset by a specified variable's format instead of it's value? "
>
> >Anyone jumping into the discussion at this latest posting would
> >think that the SAS System has a critical defect.
>
> I'm not quite clear why you say that. In response to Jack's comment that
> he
> was surprised that PROC SORT didn't have an option to sort by formatted
> values, I was merely suggesting one possible reason why such an option may
> not have been implemented as yet. Nothing in my post to which you are
> responding mentioned, or even implied, a 'defect', let alone a 'critical
> one'.
>
> >But why would anyone
> >actually need either a sort procedure that works on formatted values
> >(other than a SQL ORDER BY clause) or a user-written program?
>
> I'm not sure what you're asking here. If you want to know the specific
> need
> that resulted in the initial query, you'd obviously have to ask the person
> who wrote it. If you are asking why anyone would need this option when
> they
> can achieve the functionality using SQL, the simply answer is obviously
> that
> they wouldn't 'need' it - but, come to that, the same argument says that
> they don't really need PROC SORT at all! I would also wonderabout the
> performance of the SQL method (see below).
>
> I can but presume that you're not asking why anyone would ever want to
> sort
> by formatted values, since it's something that I find myself wanting to do
> all the time - for example, I often want to PRINT variables ordered by
> formatted values, when they primarily exist as numerical codes. The
> approach I normally take is to create a second variable containing the
> formatted value, and then sort by that.
>
> As for performance, the following log shows the comparison of the SQL and
> DATA/SORT methods for a data set of 1 million observations, using a simple
> 3-level format:
>
>
> 707 proc sql;
> 708 create table new1 as
> 709 select *
> 710 from test
> 711 order by put(x, tryfmt.)
> 712 ;
> NOTE: The query as specified involves ordering by an item that doesn't
> appear in its
> SELECT clause.
> NOTE: Table WORK.NEW1 created, with 1000000 rows and 2 columns.
>
> 713 quit;
> NOTE: The PROCEDURE SQL used 3 minutes 36.24 seconds.
>
> 715 data test2 ;
> 716 set test ;
> 717 y = put(x, tryfmt.) ;
> 718 run ;
>
> NOTE: The data set WORK.TEST2 has 1000000 observations and 3 variables.
> NOTE: The DATA statement used 31.64 seconds.
>
> 719 proc sort data = test2 out = new2 (drop = y) ;
> 720 by y ;
> 721 run ;
>
> NOTE: The data set WORK.NEW2 has 1000000 observations and 2 variables.
> NOTE: The PROCEDURE SORT used 47.07 seconds.
>
> ... so SQL took around 216 seconds and the DATA/SORT about 79 seconds - in
> both cases with 7 'lines' of SAS code. As I said in the message to which
> you're replying, I would suspect that the reduced speed of the SQL method
> (as would be the case for a PROC SORT option which did format lookups 'on
> the fly') is that internal->formatted conversions have to be done every
> time
> a value is 'looked at' during the ordering/sorting process, whereas with
> the
> DATA/SORT approach, the conversion is only undertake once for each
> observation.
>
> Kind Regards,
>
> John
>
> ----------------------------------------------------------------
> Dr John Whittington, Voice: +44 (0) 1296 730225
> Mediscience Services Fax: +44 (0) 1296 738893
> Twyford Manor, Twyford, E-mail: medisci@powernet.com
> Buckingham MK18 4EL, UK mediscience@compuserve.com
> ----------------------------------------------------------------
|