LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2003, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 19 Mar 2003 20:57:40 -0800
Reply-To:   shiling zhang <shiling99@YAHOO.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   shiling zhang <shiling99@YAHOO.COM>
Organization:   http://groups.google.com/
Subject:   Re: Help with making SQL/Data Step More Efficient - Was Workaround for No Resources
Content-Type:   text/plain; charset=ISO-8859-1

Doing a Cartesian product is really painful. If your query is a small data set against a big one, then you may consider loading the small data set into memories, and then matching output.

The following is an idea which shows you may save some run-time.

Note: you may combine your second data step into ONE and even save more time.

HTH

344 data t1; 345 do i1 = 50 to 400; 346 output; 347 end; 348 run;

NOTE: The data set WORK.T1 has 351 observations and 1 variables. NOTE: DATA statement used: real time 0.00 seconds

349 350 data t2; 351 do i2 = 3 to 40000; 352 output; 353 end; 354 run;

NOTE: The data set WORK.T2 has 39998 observations and 1 variables. NOTE: DATA statement used: real time 0.05 seconds

355 356 proc sql; 357 create table t3 as 358 select i1, i2 359 from t1, t2 360 where i1<i2 361 ; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Table WORK.T3 created, with 13961025 rows and 2 columns.

362 quit; NOTE: PROCEDURE SQL used: real time 33.87 seconds

363 364 data t4; 365 array i(100000) _temporary_; 366 do until(end1); 367 set t1 end=end1; 368 n+1; 369 i(n)=i1; 370 end; 371 do until(end2); 372 set t2 end=end2; 373 do j=1 to n; 374 if i(j) <i2 then output; 375 end; 376 end; 377 drop j n; 378 run;

NOTE: There were 351 observations read from the data set WORK.T1. NOTE: There were 39998 observations read from the data set WORK.T2. NOTE: The data set WORK.T4 has 13961025 observations and 2 variables. NOTE: DATA statement used: real time 26.75 seconds

379 proc sql; 380 create table t3 as 381 select i1, i2 382 from t1, t2 383 where i1<i2 384 ; NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized. NOTE: Table WORK.T3 created, with 13961025 rows and 2 columns.

385 quit; NOTE: PROCEDURE SQL used: real time 36.35 seconds

386 387 data t4; 388 array i(100000) _temporary_; 389 do until(end1); 390 set t1 end=end1; 391 n+1; 392 i(n)=i1; 393 end; 394 do until(end2); 395 set t2 end=end2; 396 do j=1 to n; 397 if i(j) <i2 then output; 398 end; 399 end; 400 drop j n; 401 run;

NOTE: There were 351 observations read from the data set WORK.T1. NOTE: There were 39998 observations read from the data set WORK.T2. NOTE: The data set WORK.T4 has 13961025 observations and 2 variables. NOTE: DATA statement used: real time 23.06 seconds

yzg9@CDC.GOV (Gerstle, John) wrote in message news:<8726523B5547D711A0D500508B6686B257252E@MCDC-ATL-43>... > Puddin', > > >> Why must you run "where t1.lineID < t2.LineID" in the SQL step? > >> > >> "where t1.lineID = t2.LineID" would save your po' system > >> lots of resources. > >> > > This line is in there to make sure that every record is matched to every > other record. It doesn't need to be matched with itself (would be a little > redundant, methinks). > > Thanks for your code. I'd been thinking about using a view but had't the > time to research it enough. The online doc's suggested that a data step > view would not be useful in this situation (at least that's how I understood > it). But I had not read about the SQL view. I seem to remember discussions > about this awhile back (years...) but didn't find anything searching the > archives. > > I'm going to give it a go with the suggestions so far and see what I get... > > John Gerstle > CDC Information Technological Support Contract (CITS) > Biostatistician > > > >> -----Original Message----- > >> From: Puddin' Man [mailto:pudding_man@lycos.com] > >> Sent: Wednesday, March 19, 2003 1:48 PM > >> To: SAS-L@LISTSERV.UGA.EDU; Gerstle, John > >> Subject: Re: Help with making SQL/Data Step More Efficient - Was > >> Workaround for No Resources > >> > >> Untested but straitforward: > >> > >> proc sql; > >> create view matcheSN_BS as > >> select t1.lineID as lineID1, t2.lineID as lineID2, > >> t1.stateno as stateno1, t2.stateno as stateno2, > >> t1.birthsex as birthsex1, t2.birthsex as birthsex2 > >> from temp2 as t1, temp2 as t2 > >> where t1.lineID < t2.LineID > >> order by stateno1,stateno2,lineID1 ; > >> quit; > >> > >> data both matchSN matchbs neither; > >> set matcheSN_BS; > >> by stateno1 stateno2 lineID1 ; > >> if stateno1 = stateno2 then matchSN = 1; > >> if birthsex1 = birthsex2 then matchBS = 1; > >> if matchSN and matchBS then output both; > >> else if matchSN then output matchSN; > >> else if matchBS then output matchBS; > >> else output neither; > >> > >> format matchSN matchBS linked. ; > >> label matchSN = 'STATENO Pairs' matchBS = 'BIRTHSEX Pairs'; > >> run; > >> > > >> Failing that, any other restriction(s) you can additionally put > >> on your SQL join will potentially improve performance > >> tremendously. > >> > >> Hope it hep's ... > >> > >> Puddin' > >> > >> *********************************************************** > >> *** Puddin' Man *** Pudding_Man@lycos.com ******** > >> ***********************************************************; > >> > >> "Gimme back my wig! > >> And, honey, let you head go bald!" > >> Hound Dog Taylor, maybe 1972 > >> > >> > >> On Wed, 19 Mar 2003 11:29:18 > >> Gerstle, John wrote: > >> >Folks, > >> >A couple of weeks ago I ran into a problem where my computer could not > run a > >> >Cartesian match program using SQL. After finding out from SAS Tech > Support > >> >that my hard drive was too small for the creation of the large dataset, > I > >> >was blessed with a brand spanking new computer, 30GB hard drive, 400MB > RAM, > >> >Windows 2000 with NTFS. My program then was able to run - a Cartesian > match > >> >of a 7100 record dataset which created a ~25,000,000 record dataset via > SQL. > >> >It took 51 minutes real time and 14 (!) minutes cpu time. > >> > > >> >Well, I had to cut down the original dataset from 14,000 for this > immediate > >> >run, but will need to re-run with the entire dataset instead of the > 'halved' > >> >dataset ran above. And I expect to have larger datasets that I will > need to > >> >run over the next several weeks/months. So, I set out to make my SQL > and > >> >the data step immediately following it more efficient so that I would > not > >> >use excess memory/time. I removed all non-essential variables and moved > the > >> >PROC SORT into the PROC SQL. (Also changed my default Paging File Option > >> >size to range between 2.5 and 4 GB) > >> > > >> >The most recent run with the 7100 record dataset (Code is at the bottom > of > >> >the email): > >> >************** > >> >NOTE: The execution of this query involves performing one or more > Cartesian > >> >product joins that can not be optimized. > >> >NOTE: Table WORK.MATCHESN_BS created, with 25837266 rows and 6 columns. > >> > > >> >NOTE: PROCEDURE SQL used: > >> > real time 37:18.79 > >> > cpu time 7:48.07 > >> > > >> >NOTE: There were 25837266 observations read from the data set > >> >WORK.MATCHESN_BS. > >> >NOTE: The data set WORK.MATCHES2 has 25837266 observations and 8 > variables. > >> >NOTE: DATA statement used: > >> > real time 6:38.96 > >> > cpu time 57.28 seconds > >> >*********** > >> > > >> >The PROC SQL creates a table of all possible matches of each record in > the > >> >dataset. The following DATA STEP defines whether or not STATENO1 & 2 > match > >> >and BIRTHSEX1 & 2 match. I need the numbers for determining the number > of > >> >concordant and discordant pairs. After this, I then divvy up this large > >> >dataset into four smaller ones based on whether we have matches or not > on > >> >both STATENO and BIRTHSEX (think of the four quadrants of the frequency > >> >table between matchSN and matchBS). It seems to me that I should be able > to > >> >combine all of this into the one SQL step, thereby only creating/reading > the > >> >dataset once. Unfortunately, I'm still grasping the SQL wisdom and > haven't > >> >the best idea on how to do that. > >> > > >> >Suggestions? > >> > > >> >And is there other wisdom I need here or should I expect to be running > this > >> >program over night (which is an option)??? > >> > > >> >Thanks!!! > >> > > >> >*************************** > >> >CODE: > >> > > >> >proc sql; create table matcheSN_BS as > >> > select t1.lineID as lineID1, t2.lineID as lineID2, > >> > t1.stateno as stateno1, t2.stateno as stateno2, > >> > t1.birthsex as birthsex1, t2.birthsex as > birthsex2 > >> > from temp2 as t1, temp2 as t2 where t1.lineID < t2.LineID order by > >> >stateno1,stateno2,lineID1 ; > >> > quit; > >> > > >> >data matches2 ; set matcheSN_BS; by stateno1 stateno2 lineID1 ; > matchSN=2; > >> >matchBS=2; > >> > if stateno1 = stateno2 then matchSN = 1; > >> > if birthsex1 = birthsex2 then matchBS = 1; > >> > format matchSN matchBS linked. ; > >> > label matchSN = 'STATENO Pairs' matchBS = 'BIRTHSEX Pairs'; > >> > run; > >> > > >> > > >> > > >> >John Gerstle > >> >CDC Information Technological Support Contract (CITS) > >> >Biostatistician > >> > > >> > >> > >> _____________________________________________________________ > >> Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year. > >> http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus


Back to: Top of message | Previous page | Main SAS-L page