| Date: | Wed, 19 Mar 2003 20:57:40 -0800 |
| Reply-To: | shiling zhang <shiling99@YAHOO.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | shiling zhang <shiling99@YAHOO.COM> |
| Organization: | http://groups.google.com/ |
| Subject: | Re: Help with making SQL/Data Step More Efficient - Was
Workaround for No Resources |
| Content-Type: | text/plain; charset=ISO-8859-1 |
Doing a Cartesian product is really painful. If your query is a small
data set against a big one, then you may consider loading the small
data set into memories, and then matching output.
The following is an idea which shows you may save some run-time.
Note: you may combine your second data step into ONE and even save
more time.
HTH
344 data t1;
345 do i1 = 50 to 400;
346 output;
347 end;
348 run;
NOTE: The data set WORK.T1 has 351 observations and 1 variables.
NOTE: DATA statement used:
real time 0.00 seconds
349
350 data t2;
351 do i2 = 3 to 40000;
352 output;
353 end;
354 run;
NOTE: The data set WORK.T2 has 39998 observations and 1 variables.
NOTE: DATA statement used:
real time 0.05 seconds
355
356 proc sql;
357 create table t3 as
358 select i1, i2
359 from t1, t2
360 where i1<i2
361 ;
NOTE: The execution of this query involves performing one or more
Cartesian product joins that
can not be optimized.
NOTE: Table WORK.T3 created, with 13961025 rows and 2 columns.
362 quit;
NOTE: PROCEDURE SQL used:
real time 33.87 seconds
363
364 data t4;
365 array i(100000) _temporary_;
366 do until(end1);
367 set t1 end=end1;
368 n+1;
369 i(n)=i1;
370 end;
371 do until(end2);
372 set t2 end=end2;
373 do j=1 to n;
374 if i(j) <i2 then output;
375 end;
376 end;
377 drop j n;
378 run;
NOTE: There were 351 observations read from the data set WORK.T1.
NOTE: There were 39998 observations read from the data set WORK.T2.
NOTE: The data set WORK.T4 has 13961025 observations and 2 variables.
NOTE: DATA statement used:
real time 26.75 seconds
379 proc sql;
380 create table t3 as
381 select i1, i2
382 from t1, t2
383 where i1<i2
384 ;
NOTE: The execution of this query involves performing one or more
Cartesian product joins that
can not be optimized.
NOTE: Table WORK.T3 created, with 13961025 rows and 2 columns.
385 quit;
NOTE: PROCEDURE SQL used:
real time 36.35 seconds
386
387 data t4;
388 array i(100000) _temporary_;
389 do until(end1);
390 set t1 end=end1;
391 n+1;
392 i(n)=i1;
393 end;
394 do until(end2);
395 set t2 end=end2;
396 do j=1 to n;
397 if i(j) <i2 then output;
398 end;
399 end;
400 drop j n;
401 run;
NOTE: There were 351 observations read from the data set WORK.T1.
NOTE: There were 39998 observations read from the data set WORK.T2.
NOTE: The data set WORK.T4 has 13961025 observations and 2 variables.
NOTE: DATA statement used:
real time 23.06 seconds
yzg9@CDC.GOV (Gerstle, John) wrote in message news:<8726523B5547D711A0D500508B6686B257252E@MCDC-ATL-43>...
> Puddin',
>
> >> Why must you run "where t1.lineID < t2.LineID" in the SQL step?
> >>
> >> "where t1.lineID = t2.LineID" would save your po' system
> >> lots of resources.
> >>
>
> This line is in there to make sure that every record is matched to every
> other record. It doesn't need to be matched with itself (would be a little
> redundant, methinks).
>
> Thanks for your code. I'd been thinking about using a view but had't the
> time to research it enough. The online doc's suggested that a data step
> view would not be useful in this situation (at least that's how I understood
> it). But I had not read about the SQL view. I seem to remember discussions
> about this awhile back (years...) but didn't find anything searching the
> archives.
>
> I'm going to give it a go with the suggestions so far and see what I get...
>
> John Gerstle
> CDC Information Technological Support Contract (CITS)
> Biostatistician
>
>
> >> -----Original Message-----
> >> From: Puddin' Man [mailto:pudding_man@lycos.com]
> >> Sent: Wednesday, March 19, 2003 1:48 PM
> >> To: SAS-L@LISTSERV.UGA.EDU; Gerstle, John
> >> Subject: Re: Help with making SQL/Data Step More Efficient - Was
> >> Workaround for No Resources
> >>
> >> Untested but straitforward:
> >>
> >> proc sql;
> >> create view matcheSN_BS as
> >> select t1.lineID as lineID1, t2.lineID as lineID2,
> >> t1.stateno as stateno1, t2.stateno as stateno2,
> >> t1.birthsex as birthsex1, t2.birthsex as birthsex2
> >> from temp2 as t1, temp2 as t2
> >> where t1.lineID < t2.LineID
> >> order by stateno1,stateno2,lineID1 ;
> >> quit;
> >>
> >> data both matchSN matchbs neither;
> >> set matcheSN_BS;
> >> by stateno1 stateno2 lineID1 ;
> >> if stateno1 = stateno2 then matchSN = 1;
> >> if birthsex1 = birthsex2 then matchBS = 1;
> >> if matchSN and matchBS then output both;
> >> else if matchSN then output matchSN;
> >> else if matchBS then output matchBS;
> >> else output neither;
> >>
> >> format matchSN matchBS linked. ;
> >> label matchSN = 'STATENO Pairs' matchBS = 'BIRTHSEX Pairs';
> >> run;
> >>
>
> >> Failing that, any other restriction(s) you can additionally put
> >> on your SQL join will potentially improve performance
> >> tremendously.
> >>
> >> Hope it hep's ...
> >>
> >> Puddin'
> >>
> >> ***********************************************************
> >> *** Puddin' Man *** Pudding_Man@lycos.com ********
> >> ***********************************************************;
> >>
> >> "Gimme back my wig!
> >> And, honey, let you head go bald!"
> >> Hound Dog Taylor, maybe 1972
> >>
> >>
> >> On Wed, 19 Mar 2003 11:29:18
> >> Gerstle, John wrote:
> >> >Folks,
> >> >A couple of weeks ago I ran into a problem where my computer could not
> run a
> >> >Cartesian match program using SQL. After finding out from SAS Tech
> Support
> >> >that my hard drive was too small for the creation of the large dataset,
> I
> >> >was blessed with a brand spanking new computer, 30GB hard drive, 400MB
> RAM,
> >> >Windows 2000 with NTFS. My program then was able to run - a Cartesian
> match
> >> >of a 7100 record dataset which created a ~25,000,000 record dataset via
> SQL.
> >> >It took 51 minutes real time and 14 (!) minutes cpu time.
> >> >
> >> >Well, I had to cut down the original dataset from 14,000 for this
> immediate
> >> >run, but will need to re-run with the entire dataset instead of the
> 'halved'
> >> >dataset ran above. And I expect to have larger datasets that I will
> need to
> >> >run over the next several weeks/months. So, I set out to make my SQL
> and
> >> >the data step immediately following it more efficient so that I would
> not
> >> >use excess memory/time. I removed all non-essential variables and moved
> the
> >> >PROC SORT into the PROC SQL. (Also changed my default Paging File Option
> >> >size to range between 2.5 and 4 GB)
> >> >
> >> >The most recent run with the 7100 record dataset (Code is at the bottom
> of
> >> >the email):
> >> >**************
> >> >NOTE: The execution of this query involves performing one or more
> Cartesian
> >> >product joins that can not be optimized.
> >> >NOTE: Table WORK.MATCHESN_BS created, with 25837266 rows and 6 columns.
> >> >
> >> >NOTE: PROCEDURE SQL used:
> >> > real time 37:18.79
> >> > cpu time 7:48.07
> >> >
> >> >NOTE: There were 25837266 observations read from the data set
> >> >WORK.MATCHESN_BS.
> >> >NOTE: The data set WORK.MATCHES2 has 25837266 observations and 8
> variables.
> >> >NOTE: DATA statement used:
> >> > real time 6:38.96
> >> > cpu time 57.28 seconds
> >> >***********
> >> >
> >> >The PROC SQL creates a table of all possible matches of each record in
> the
> >> >dataset. The following DATA STEP defines whether or not STATENO1 & 2
> match
> >> >and BIRTHSEX1 & 2 match. I need the numbers for determining the number
> of
> >> >concordant and discordant pairs. After this, I then divvy up this large
> >> >dataset into four smaller ones based on whether we have matches or not
> on
> >> >both STATENO and BIRTHSEX (think of the four quadrants of the frequency
> >> >table between matchSN and matchBS). It seems to me that I should be able
> to
> >> >combine all of this into the one SQL step, thereby only creating/reading
> the
> >> >dataset once. Unfortunately, I'm still grasping the SQL wisdom and
> haven't
> >> >the best idea on how to do that.
> >> >
> >> >Suggestions?
> >> >
> >> >And is there other wisdom I need here or should I expect to be running
> this
> >> >program over night (which is an option)???
> >> >
> >> >Thanks!!!
> >> >
> >> >***************************
> >> >CODE:
> >> >
> >> >proc sql; create table matcheSN_BS as
> >> > select t1.lineID as lineID1, t2.lineID as lineID2,
> >> > t1.stateno as stateno1, t2.stateno as stateno2,
> >> > t1.birthsex as birthsex1, t2.birthsex as
> birthsex2
> >> > from temp2 as t1, temp2 as t2 where t1.lineID < t2.LineID order by
> >> >stateno1,stateno2,lineID1 ;
> >> > quit;
> >> >
> >> >data matches2 ; set matcheSN_BS; by stateno1 stateno2 lineID1 ;
> matchSN=2;
> >> >matchBS=2;
> >> > if stateno1 = stateno2 then matchSN = 1;
> >> > if birthsex1 = birthsex2 then matchBS = 1;
> >> > format matchSN matchBS linked. ;
> >> > label matchSN = 'STATENO Pairs' matchBS = 'BIRTHSEX Pairs';
> >> > run;
> >> >
> >> >
> >> >
> >> >John Gerstle
> >> >CDC Information Technological Support Contract (CITS)
> >> >Biostatistician
> >> >
> >>
> >>
> >> _____________________________________________________________
> >> Get 25MB, POP3, Spam Filtering with LYCOS MAIL PLUS for $19.95/year.
> >> http://login.mail.lycos.com/brandPage.shtml?pageId=plus&ref=lmtplus
|