LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2005, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 13 Jul 2005 18:00:08 -0400
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: Views and passes (was RE: Output last record of fantom by
              group)
Comments: To: "Pardee, Roy" <pardee.r@ghc.org>
Content-Type: text/plain; charset="us-ascii"

Roy (and others entangled in this thread): So far as I know a view reads and writes data just like a Data step or query, but only after being invoked by another Data step or query. I've expanded Paul's examples to illustrate the time required for each process (in real time seconds):

Data step 1 Data step 2 SQL Query sort 8:14 8:14 hash order 12.50 select 21.40 13.53 8.00 ------- -------- ------- 29.54 21.67 20.50

I've fattened up the rows of data by adding a long text variable to each row. In practical situations, we tend to select more than key values.

A sort orders the dataset prior to the Data steps that rely on ordering properties. A slightly modified hash routine selects and orders a list of ID,date pairs that the SQL query uses to select required records. In general the more rows, the faster the hash routine will work relative to a physical sort. The Data step that groups by date clearly takes substantially longer than either the Data step or SQL query that groups by year.

I would expect to see some variation in the difference between the Data step 2 and SQL query times. Taking time required for data ordering into account, a Data step processing a Data step view and the SQL query work equally fast.

For this type of data subsetting problem, the idea of using a hash selector to select a subset of key values, then using the subset of key values to select rows from a dataset, has interesting implications for SAS SQL query optimization. I wonder about hashing pointers to observations in SAS datasets ....

I've posted relevant sections of the log below:

294 data have ; 295 do date = -1e4 to 1e4 by 1 ; 296 do id = 1 to 1e2 ; 297 stuff=repeat('X',999); 298 output ; 299 end ; 300 end ; 301 run ;

INFO: Character variables have defaulted to a length of 200 at the places given by: (Line):(Column). Truncation may result. 297:12 stuff NOTE: The data set WORK.HAVE has 2000100 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 54.25 seconds cpu time 13.67 seconds

302 proc sort data=have; 303 by id date; 304 run;

NOTE: There were 2000100 observations read from the data set WORK.HAVE. NOTE: SAS sort was used. NOTE: The data set WORK.HAVE has 2000100 observations and 3 variables. NOTE: PROCEDURE SORT used (Total process time): real time 8:14.01 cpu time 46.65 seconds

305 data out_g ; 306 format date year4. ; 307 set have ; 308 by id date groupformat ; 309 if first.date ; 310 run ;

NOTE: There were 2000100 observations read from the data set WORK.HAVE. NOTE: The data set WORK.OUT_G has 5600 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 21.40 seconds cpu time 5.73 seconds

311 312 data _null_ ; 313 dcl hash q (hashexp:16, ordered:'y') ; 314 q.definekey ('id', 'yr') ; 315 q.definedata ('id', 'yr', 'date') ; 316 q.definedone () ; 317 do until ( eof ) ; 318 set have end = eof ; 319 yr = year (date) ; 320 q.replace () ; 321 end ; 322 q.output (dataset: 'out_h') ; 323 stop ; 324 run ;

NOTE: The data set WORK.OUT_H has 5600 observations and 3 variables. NOTE: There were 2000100 observations read from the data set WORK.HAVE. NOTE: DATA statement used (Total process time): real time 12.50 seconds cpu time 5.76 seconds

325 proc print data=out_h (obs=10); 326 run;

NOTE: There were 10 observations read from the data set WORK.OUT_H. NOTE: PROCEDURE PRINT used (Total process time): real time 0.28 seconds cpu time 0.03 seconds

327 328 proc sql ; 329 create view with_year as 330 select id 331 , year (date) as yr 332 from have 333 quit ; NOTE: SQL view WORK.WITH_YEAR has been defined. 334 NOTE: PROCEDURE SQL used (Total process time): real time 0.34 seconds cpu time 0.00 seconds

335 data out_v ; 336 set with_year ; 337 by id yr ; 338 if last.yr ; 339 run ;

NOTE: There were 2000100 observations read from the data set WORK.HAVE. NOTE: There were 2000100 observations read from the data set WORK.WITH_YEAR. NOTE: The data set WORK.OUT_V has 5600 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 13.53 seconds cpu time 8.43 seconds

340 341 proc sql ; 342 create table out as 343 select t1.* 344 from have as t1 inner join out_h as t2 345 on t1.ID=t2.ID and t1.date=t2.date 346 ; NOTE: Table WORK.OUT created, with 5600 rows and 3 columns.

347 quit ; NOTE: PROCEDURE SQL used (Total process time): real time 8.00 seconds cpu time 3.43 seconds

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Pardee, Roy Sent: Tuesday, July 12, 2005 2:20 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Views and passes (was RE: Output last record of fantom by group)

So a dstep view doesn't get 'realized' when it's used, so you really do save a pass through the data?

My expectation would be that a SQL view *would* get realized, so the savings would be illusory. Anybody know for sure? <snip>

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dale McLerran Sent: Tuesday, July 12, 2005 11:09 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Output last record of fantom by group

Michael,

A data step view is perfect for this sort of problem. You describe year as a fantom (sic) variable. Well, a view creates phantom (virtual) data.

/* Add year to view. Note that view does not execute until it */ /* is invoked in the next data step. The view is virtual data */ data addyear / view=addyear; set have; year = year(date); run;

/* Set view by permno and year. Year is created as the data */ /* in have are read by the view but before the data are made */ /* available to the PDV of the current data step. Thus, we */ /* can set the data by year. */ data want; set addyear; by permno year; if last.year; run;

Dale

--- Michael Murff <mjm33@MSM1.BYU.EDU> wrote:

> Hi SAS-L, > > > > I want the last record of each id/year as shown below; year is a sort > of fantom by-group since it only exists within another variable, date.

> The dataset "have" is 6 gigs+ so I would prefer to only pass through > the data > once. The dataset "have" is sorted by permno and date. Obviously, I > could > create a year variable with one pass and then, in a subsequent step, > use a > subsetting if statement to conditionally output last.year, but given > the > size of have this is undesirable. Any ideas? > > > > TIA, > > > > Michael Murff > > Provo, UT > > > > > > > > > > data have; > > input id $ date date9. type $; > > format date date9.; > > cards; > > a 01jan2001 F > > a 02jan2001 M > > a 03jan2001 L > > a 01jan2002 F > > a 02jan2002 L > > b 01jan2001 F > > b 02jan2001 M > > b 03jan2001 L > > b 01jan2002 F > > b 02jan2002 L > > b 03jan2003 F > > b 04jan2003 M > > b 05jan2003 M > > b 06jan2003 M > > b 07jan2003 L > > ; > > run; > > > > > > data want; > > input id $ date date9. type $; > > format date date9.; > > cards; > > a 03jan2001 L > > a 02jan2002 L > > b 03jan2001 L > > b 02jan2002 L > > b 07jan2003 L > > ; > > run; >

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

____________________________________________________ Sell on Yahoo! Auctions - no fees. Bid on great items. http://auctions.yahoo.com/

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Pardee, Roy Sent: Tuesday, July 12, 2005 2:20 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Views and passes (was RE: Output last record of fantom by group)

So a dstep view doesn't get 'realized' when it's used, so you really do save a pass through the data?

My expectation would be that a SQL view *would* get realized, so the savings would be illusory. Anybody know for sure?

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dale McLerran Sent: Tuesday, July 12, 2005 11:09 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Output last record of fantom by group

Michael,

A data step view is perfect for this sort of problem. You describe year as a fantom (sic) variable. Well, a view creates phantom (virtual) data.

/* Add year to view. Note that view does not execute until it */ /* is invoked in the next data step. The view is virtual data */ data addyear / view=addyear; set have; year = year(date); run;

/* Set view by permno and year. Year is created as the data */ /* in have are read by the view but before the data are made */ /* available to the PDV of the current data step. Thus, we */ /* can set the data by year. */ data want; set addyear; by permno year; if last.year; run;

Dale

--- Michael Murff <mjm33@MSM1.BYU.EDU> wrote:

> Hi SAS-L, > > > > I want the last record of each id/year as shown below; year is a sort > of fantom by-group since it only exists within another variable, date.

> The dataset "have" is 6 gigs+ so I would prefer to only pass through > the data > once. The dataset "have" is sorted by permno and date. Obviously, I > could > create a year variable with one pass and then, in a subsequent step, > use a > subsetting if statement to conditionally output last.year, but given > the > size of have this is undesirable. Any ideas? > > > > TIA, > > > > Michael Murff > > Provo, UT > > > > > > > > > > data have; > > input id $ date date9. type $; > > format date date9.; > > cards; > > a 01jan2001 F > > a 02jan2001 M > > a 03jan2001 L > > a 01jan2002 F > > a 02jan2002 L > > b 01jan2001 F > > b 02jan2001 M > > b 03jan2001 L > > b 01jan2002 F > > b 02jan2002 L > > b 03jan2003 F > > b 04jan2003 M > > b 05jan2003 M > > b 06jan2003 M > > b 07jan2003 L > > ; > > run; > > > > > > data want; > > input id $ date date9. type $; > > format date date9.; > > cards; > > a 03jan2001 L > > a 02jan2002 L > > b 03jan2001 L > > b 02jan2002 L > > b 07jan2003 L > > ; > > run; >

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

____________________________________________________ Sell on Yahoo! Auctions - no fees. Bid on great items. http://auctions.yahoo.com/


Back to: Top of message | Previous page | Main SAS-L page