LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2002, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 3 May 2002 18:18:32 -0400
Reply-To:     "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject:      Re: Sorting linked transaction data
Comments: To: Ian Whitlock <WHITLOI1@WESTAT.COM>
Content-Type: text/plain; charset=iso-8859-1

Ian,

It kind of harks back to our past discussion of a similar issue (raised I think by Ludwig Boltzmann circa 1998). Then, too, you used an index, with its instrinsic value of being able to swallow quite serious a number of keys before it may become a resource problem. And, just like before, I wondered whether a single-pass solution is possible if the NEXT column can be kept in a memory-resident lookup table one way or another. Unfortunately, nothing really elegant is bubbling up through the crevices (an excuse: At lunch, I was forced to consume more food than can be digested without washing it down with a liberal amount of wine - making it exceedingly difficult to engage in a rational thought process), just this:

data asorted ( keep = previd currid nextid ) ; array x (0 : 99999) _temporary_ ; do _n_ = 1 to n ; set transactions (rename = (previd = p)) nobs = n ; x (c) = nextid ; if p = . then start = currid ; end ; currid = start ; do _n_ = 1 to n ; nextid = x (currid) ; output ; previd = currid ; currid = nextid ; end ; run ;

Understandably, this performs exceptionally fast, as long as ID remains a narrow-range integer. From the way Richard concocted the test data, it appears that it will cease to be the case already at quite moderate values of &N. This dictates using a hash table - thus limiting the number of distinct IDs that can be searched in O(1) by the amount of available memory. Assuming that the key is still numeric and integer, the code is the same as above in principle, just the simple key-indexing lines

x (c) = nextid ; nextid = x (currid) ;

become more convoluted:

data hsorted (keep = previd currid nextid) ; array x (0 : &h) _temporary_ ; array c (0 : &h) _temporary_ ; do _n_ = 1 to n ; set transactions (rename = (previd = p)) nobs = n ; link hash ; x (j) = currid ; c (j) = nextid ; ; if p = . then start = currid ; end ; currid = start ; do _n_ = 1 to n ; link hash ; nextid = c (j) ; output ; previd = currid ; currid = nextid ; end ; stop ; hash: do j = mod(currid,&h) by 1 until (x(j) = . or x(j) = currid) ; if j = &h then j = 0 ; end ; run ;

To feel the penalty O(log(n)) index search imposes over O(1) search in memory, I ran this step and your 'natural' code against Richard's test data at N = 300000 (on a quite wussy office PC with V8.1). The log is after the sig line.

Kind regards, ================= Paul M. Dorfman Jacksonville, FL =================

26 data trans ( index = ( currid ) ) ; 27 set transactions ; 28 if previd = . then 29 call symput ( "start" , left(put(currid,best8.))) ; 30 run ;

NOTE: There were 299999 observations read from the data set WORK.TRANSACTIONS. NOTE: The data set WORK.TRANS has 299999 observations and 4 variables. NOTE: DATA statement used: real time 7.14 seconds cpu time 1.71 seconds

31 32 data isorted ( drop = next i ) ; 33 retain next &start ; 34 currid = next ; 35 set trans key = currid nobs = nobs ; 36 if _n_ > nobs or _iorc_ then 37 do ; 38 _error_ = 0 ; 39 stop ; 40 end ; 41 next = nextid ; 42 43 run ;

NOTE: There were 300000 observations read from the data set WORK.TRANS. NOTE: The data set WORK.ISORTED has 299999 observations and 3 variables. NOTE: DATA statement used: real time 19.35 seconds cpu time 17.93 seconds

44 45 %let h = 500001 ; 46 47 data hsorted (keep = previd currid nextid) ; 48 array x (0 : &h) _temporary_ ; 49 array c (0 : &h) _temporary_ ; 50 do _n_ = 1 to n ; 51 set transactions (rename = (previd = p)) nobs = n ; 52 link hash ; 53 x (j) = currid ; 54 c (j) = nextid ; ; 55 if p = . then start = currid ; 56 end ; 57 currid = start ; 58 do _n_ = 1 to n ; 59 link hash ; 60 nextid = c (j) ; 61 output ; 62 previd = currid ; 63 currid = nextid ; 64 end ; 65 stop ; 66 hash: do j = mod(currid,&h) by 1 until (x(j) = . or x(j) = currid) ; 67 if j = &h then j = 0 ; 68 end ; 69 run ;

NOTE: There were 299999 observations read from the data set WORK.TRANSACTIONS. NOTE: The data set WORK.HSORTED has 299999 observations and 3 variables. NOTE: DATA statement used: real time 2.78 seconds cpu time 1.29 seconds

> From: Ian Whitlock [mailto:WHITLOI1@WESTAT.COM] > > Richard, > > Fast natural code solution. > > data trans ( index = ( currid ) ) ; > set transactions ; > if previd = . then > call symput ( "start" , left(put(currid,best8.))) ; > run ; > > data sorted ( drop = next ) ; > retain next &start ; > currid = next ; > set trans key = currid /unique nobs = nobs ; > if _n_ > nobs or _iorc_ then > do ; > _error_ = 0 ; > stop ; > end ; > next = nextid ; > run ; > > If you need efficiency please state your requirements and > give the problem > your sort is suppose to solve. > > I found the placement of the NOBS option quite a surprise. > Apparently the > "/ unique" is a part of the KEY= option and not a SET > statement option since > NOBS= can also be put before the KEY= option. NOBS= was used to guard > against cyclic mistakes. > > IanWhitlock@westat.com > -----Original Message----- > From: Richard A. DeVenezia [mailto:radevenz@IX.NETCOM.COM] > Sent: Friday, May 03, 2002 6:19 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Sorting linked transaction data > > > I have data where each row has a value that links it with one > other row. I > need to sort this data in a 'follow the link' manner. > > Here is some sample data: > > %let N = 20; > > * make fake data, ordered according to linkage; > * Id is unique, but not necessarily 1<=ID<=N; > > data transactions; > > array use [ &N ] _temporary_; > > seed = 0; > > do i = 1 to &N ; > > prevId = currId; > currId = nextId; > nextId = int( &N * ranuni(seed) ) + 1; > > do while (use[nextId] ne .); > if nextId >= &N then nextId=1; else nextId=nextId+1; > end; > > use[nextId] = 1; > > nextId = nextId * (1 + int (1000*ranuni(seed))); > > output; > end; > > format _numeric_ 8.; > keep prevId currId nextId I; > label i = "I normally would not be here"; > run; > > * disorder the transactions; > > proc sort data=transactions; > by nextId; > where currId; > run; > > * need to sort transactions such that > * > * prevId[1] = . > * nextId[rowN] = currId[rowN+1] > *; > > > > -- > Richard A. DeVenezia >

Blue Cross Blue Shield of Florida, Inc., and its subsidiary and affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.


Back to: Top of message | Previous page | Main SAS-L page