Date: Wed, 13 Jul 2005 18:00:08 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Views and passes (was RE: Output last record of fantom by
group)
Content-Type: text/plain; charset="us-ascii"
Roy (and others entangled in this thread):
So far as I know a view reads and writes data just like a Data step or
query, but only after being invoked by another Data step or query. I've
expanded Paul's examples to illustrate the time required for each
process (in real time seconds):
Data step 1 Data step 2 SQL Query
sort 8:14 8:14
hash order 12.50
select 21.40 13.53 8.00
------- -------- -------
29.54 21.67 20.50
I've fattened up the rows of data by adding a long text variable to each
row. In practical situations, we tend to select more than key values.
A sort orders the dataset prior to the Data steps that rely on ordering
properties. A slightly modified hash routine selects and orders a list
of ID,date pairs that the SQL query uses to select required records. In
general the more rows, the faster the hash routine will work relative to
a physical sort. The Data step that groups by date clearly takes
substantially longer than either the Data step or SQL query that groups
by year.
I would expect to see some variation in the difference between the Data
step 2 and SQL query times. Taking time required for data ordering into
account, a Data step processing a Data step view and the SQL query work
equally fast.
For this type of data subsetting problem, the idea of using a hash
selector to select a subset of key values, then using the subset of key
values to select rows from a dataset, has interesting implications for
SAS SQL query optimization. I wonder about hashing pointers to
observations in SAS datasets ....
I've posted relevant sections of the log below:
294 data have ;
295 do date = -1e4 to 1e4 by 1 ;
296 do id = 1 to 1e2 ;
297 stuff=repeat('X',999);
298 output ;
299 end ;
300 end ;
301 run ;
INFO: Character variables have defaulted to a length of 200 at the
places given by:
(Line):(Column). Truncation may result.
297:12 stuff
NOTE: The data set WORK.HAVE has 2000100 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 54.25 seconds
cpu time 13.67 seconds
302 proc sort data=have;
303 by id date;
304 run;
NOTE: There were 2000100 observations read from the data set WORK.HAVE.
NOTE: SAS sort was used.
NOTE: The data set WORK.HAVE has 2000100 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 8:14.01
cpu time 46.65 seconds
305 data out_g ;
306 format date year4. ;
307 set have ;
308 by id date groupformat ;
309 if first.date ;
310 run ;
NOTE: There were 2000100 observations read from the data set WORK.HAVE.
NOTE: The data set WORK.OUT_G has 5600 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 21.40 seconds
cpu time 5.73 seconds
311
312 data _null_ ;
313 dcl hash q (hashexp:16, ordered:'y') ;
314 q.definekey ('id', 'yr') ;
315 q.definedata ('id', 'yr', 'date') ;
316 q.definedone () ;
317 do until ( eof ) ;
318 set have end = eof ;
319 yr = year (date) ;
320 q.replace () ;
321 end ;
322 q.output (dataset: 'out_h') ;
323 stop ;
324 run ;
NOTE: The data set WORK.OUT_H has 5600 observations and 3 variables.
NOTE: There were 2000100 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 12.50 seconds
cpu time 5.76 seconds
325 proc print data=out_h (obs=10);
326 run;
NOTE: There were 10 observations read from the data set WORK.OUT_H.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.28 seconds
cpu time 0.03 seconds
327
328 proc sql ;
329 create view with_year as
330 select id
331 , year (date) as yr
332 from have
333 quit ;
NOTE: SQL view WORK.WITH_YEAR has been defined.
334
NOTE: PROCEDURE SQL used (Total process time):
real time 0.34 seconds
cpu time 0.00 seconds
335 data out_v ;
336 set with_year ;
337 by id yr ;
338 if last.yr ;
339 run ;
NOTE: There were 2000100 observations read from the data set WORK.HAVE.
NOTE: There were 2000100 observations read from the data set
WORK.WITH_YEAR.
NOTE: The data set WORK.OUT_V has 5600 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 13.53 seconds
cpu time 8.43 seconds
340
341 proc sql ;
342 create table out as
343 select t1.*
344 from have as t1 inner join out_h as t2
345 on t1.ID=t2.ID and t1.date=t2.date
346 ;
NOTE: Table WORK.OUT created, with 5600 rows and 3 columns.
347 quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 8.00 seconds
cpu time 3.43 seconds
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Pardee, Roy
Sent: Tuesday, July 12, 2005 2:20 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Views and passes (was RE: Output last record of fantom by
group)
So a dstep view doesn't get 'realized' when it's used, so you really do
save a pass through the data?
My expectation would be that a SQL view *would* get realized, so the
savings would be illusory. Anybody know for sure?
<snip>
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Dale McLerran
Sent: Tuesday, July 12, 2005 11:09 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Output last record of fantom by group
Michael,
A data step view is perfect for this sort of problem. You describe year
as a fantom (sic) variable. Well, a view creates phantom
(virtual) data.
/* Add year to view. Note that view does not execute until it */
/* is invoked in the next data step. The view is virtual data */
data addyear / view=addyear;
set have;
year = year(date);
run;
/* Set view by permno and year. Year is created as the data */
/* in have are read by the view but before the data are made */
/* available to the PDV of the current data step. Thus, we */
/* can set the data by year. */
data want;
set addyear;
by permno year;
if last.year;
run;
Dale
--- Michael Murff <mjm33@MSM1.BYU.EDU> wrote:
> Hi SAS-L,
>
>
>
> I want the last record of each id/year as shown below; year is a sort
> of fantom by-group since it only exists within another variable, date.
> The dataset "have" is 6 gigs+ so I would prefer to only pass through
> the data
> once. The dataset "have" is sorted by permno and date. Obviously, I
> could
> create a year variable with one pass and then, in a subsequent step,
> use a
> subsetting if statement to conditionally output last.year, but given
> the
> size of have this is undesirable. Any ideas?
>
>
>
> TIA,
>
>
>
> Michael Murff
>
> Provo, UT
>
>
>
>
>
>
>
>
>
> data have;
>
> input id $ date date9. type $;
>
> format date date9.;
>
> cards;
>
> a 01jan2001 F
>
> a 02jan2001 M
>
> a 03jan2001 L
>
> a 01jan2002 F
>
> a 02jan2002 L
>
> b 01jan2001 F
>
> b 02jan2001 M
>
> b 03jan2001 L
>
> b 01jan2002 F
>
> b 02jan2002 L
>
> b 03jan2003 F
>
> b 04jan2003 M
>
> b 05jan2003 M
>
> b 06jan2003 M
>
> b 07jan2003 L
>
> ;
>
> run;
>
>
>
>
>
> data want;
>
> input id $ date date9. type $;
>
> format date date9.;
>
> cards;
>
> a 03jan2001 L
>
> a 02jan2002 L
>
> b 03jan2001 L
>
> b 02jan2002 L
>
> b 07jan2003 L
>
> ;
>
> run;
>
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
____________________________________________________
Sell on Yahoo! Auctions - no fees. Bid on great items.
http://auctions.yahoo.com/
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Pardee, Roy
Sent: Tuesday, July 12, 2005 2:20 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Views and passes (was RE: Output last record of fantom by
group)
So a dstep view doesn't get 'realized' when it's used, so you really do
save a pass through the data?
My expectation would be that a SQL view *would* get realized, so the
savings would be illusory. Anybody know for sure?
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Dale McLerran
Sent: Tuesday, July 12, 2005 11:09 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Output last record of fantom by group
Michael,
A data step view is perfect for this sort of problem. You describe year
as a fantom (sic) variable. Well, a view creates phantom
(virtual) data.
/* Add year to view. Note that view does not execute until it */
/* is invoked in the next data step. The view is virtual data */
data addyear / view=addyear;
set have;
year = year(date);
run;
/* Set view by permno and year. Year is created as the data */
/* in have are read by the view but before the data are made */
/* available to the PDV of the current data step. Thus, we */
/* can set the data by year. */
data want;
set addyear;
by permno year;
if last.year;
run;
Dale
--- Michael Murff <mjm33@MSM1.BYU.EDU> wrote:
> Hi SAS-L,
>
>
>
> I want the last record of each id/year as shown below; year is a sort
> of fantom by-group since it only exists within another variable, date.
> The dataset "have" is 6 gigs+ so I would prefer to only pass through
> the data
> once. The dataset "have" is sorted by permno and date. Obviously, I
> could
> create a year variable with one pass and then, in a subsequent step,
> use a
> subsetting if statement to conditionally output last.year, but given
> the
> size of have this is undesirable. Any ideas?
>
>
>
> TIA,
>
>
>
> Michael Murff
>
> Provo, UT
>
>
>
>
>
>
>
>
>
> data have;
>
> input id $ date date9. type $;
>
> format date date9.;
>
> cards;
>
> a 01jan2001 F
>
> a 02jan2001 M
>
> a 03jan2001 L
>
> a 01jan2002 F
>
> a 02jan2002 L
>
> b 01jan2001 F
>
> b 02jan2001 M
>
> b 03jan2001 L
>
> b 01jan2002 F
>
> b 02jan2002 L
>
> b 03jan2003 F
>
> b 04jan2003 M
>
> b 05jan2003 M
>
> b 06jan2003 M
>
> b 07jan2003 L
>
> ;
>
> run;
>
>
>
>
>
> data want;
>
> input id $ date date9. type $;
>
> format date date9.;
>
> cards;
>
> a 03jan2001 L
>
> a 02jan2002 L
>
> b 03jan2001 L
>
> b 02jan2002 L
>
> b 07jan2003 L
>
> ;
>
> run;
>
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
____________________________________________________
Sell on Yahoo! Auctions - no fees. Bid on great items.
http://auctions.yahoo.com/