Date: Fri, 3 May 2002 18:18:32 -0400
Reply-To: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject: Re: Sorting linked transaction data
Content-Type: text/plain; charset=iso-8859-1
Ian,
It kind of harks back to our past discussion of a similar issue (raised I
think by Ludwig Boltzmann circa 1998). Then, too, you used an index, with
its instrinsic value of being able to swallow quite serious a number of keys
before it may become a resource problem. And, just like before, I wondered
whether a single-pass solution is possible if the NEXT column can be kept in
a memory-resident lookup table one way or another. Unfortunately, nothing
really elegant is bubbling up through the crevices (an excuse: At lunch, I
was forced to consume more food than can be digested without washing it down
with a liberal amount of wine - making it exceedingly difficult to engage in
a rational thought process), just this:
data asorted ( keep = previd currid nextid ) ;
array x (0 : 99999) _temporary_ ;
do _n_ = 1 to n ;
set transactions (rename = (previd = p)) nobs = n ;
x (c) = nextid ;
if p = . then start = currid ;
end ;
currid = start ;
do _n_ = 1 to n ;
nextid = x (currid) ;
output ;
previd = currid ;
currid = nextid ;
end ;
run ;
Understandably, this performs exceptionally fast, as long as ID remains a
narrow-range integer. From the way Richard concocted the test data, it
appears that it will cease to be the case already at quite moderate values
of &N. This dictates using a hash table - thus limiting the number of
distinct IDs that can be searched in O(1) by the amount of available memory.
Assuming that the key is still numeric and integer, the code is the same as
above in principle, just the simple key-indexing lines
x (c) = nextid ;
nextid = x (currid) ;
become more convoluted:
data hsorted (keep = previd currid nextid) ;
array x (0 : &h) _temporary_ ;
array c (0 : &h) _temporary_ ;
do _n_ = 1 to n ;
set transactions (rename = (previd = p)) nobs = n ;
link hash ;
x (j) = currid ;
c (j) = nextid ; ;
if p = . then start = currid ;
end ;
currid = start ;
do _n_ = 1 to n ;
link hash ;
nextid = c (j) ;
output ;
previd = currid ;
currid = nextid ;
end ;
stop ;
hash: do j = mod(currid,&h) by 1 until (x(j) = . or x(j) = currid) ;
if j = &h then j = 0 ;
end ;
run ;
To feel the penalty O(log(n)) index search imposes over O(1) search in
memory, I ran this step and your 'natural' code against Richard's test data
at N = 300000 (on a quite wussy office PC with V8.1). The log is after the
sig line.
Kind regards,
=================
Paul M. Dorfman
Jacksonville, FL
=================
26 data trans ( index = ( currid ) ) ;
27 set transactions ;
28 if previd = . then
29 call symput ( "start" , left(put(currid,best8.))) ;
30 run ;
NOTE: There were 299999 observations read from the data set
WORK.TRANSACTIONS.
NOTE: The data set WORK.TRANS has 299999 observations and 4 variables.
NOTE: DATA statement used:
real time 7.14 seconds
cpu time 1.71 seconds
31
32 data isorted ( drop = next i ) ;
33 retain next &start ;
34 currid = next ;
35 set trans key = currid nobs = nobs ;
36 if _n_ > nobs or _iorc_ then
37 do ;
38 _error_ = 0 ;
39 stop ;
40 end ;
41 next = nextid ;
42
43 run ;
NOTE: There were 300000 observations read from the data set WORK.TRANS.
NOTE: The data set WORK.ISORTED has 299999 observations and 3 variables.
NOTE: DATA statement used:
real time 19.35 seconds
cpu time 17.93 seconds
44
45 %let h = 500001 ;
46
47 data hsorted (keep = previd currid nextid) ;
48 array x (0 : &h) _temporary_ ;
49 array c (0 : &h) _temporary_ ;
50 do _n_ = 1 to n ;
51 set transactions (rename = (previd = p)) nobs = n ;
52 link hash ;
53 x (j) = currid ;
54 c (j) = nextid ; ;
55 if p = . then start = currid ;
56 end ;
57 currid = start ;
58 do _n_ = 1 to n ;
59 link hash ;
60 nextid = c (j) ;
61 output ;
62 previd = currid ;
63 currid = nextid ;
64 end ;
65 stop ;
66 hash: do j = mod(currid,&h) by 1 until (x(j) = . or x(j) = currid) ;
67 if j = &h then j = 0 ;
68 end ;
69 run ;
NOTE: There were 299999 observations read from the data set
WORK.TRANSACTIONS.
NOTE: The data set WORK.HSORTED has 299999 observations and 3 variables.
NOTE: DATA statement used:
real time 2.78 seconds
cpu time 1.29 seconds
> From: Ian Whitlock [mailto:WHITLOI1@WESTAT.COM]
>
> Richard,
>
> Fast natural code solution.
>
> data trans ( index = ( currid ) ) ;
> set transactions ;
> if previd = . then
> call symput ( "start" , left(put(currid,best8.))) ;
> run ;
>
> data sorted ( drop = next ) ;
> retain next &start ;
> currid = next ;
> set trans key = currid /unique nobs = nobs ;
> if _n_ > nobs or _iorc_ then
> do ;
> _error_ = 0 ;
> stop ;
> end ;
> next = nextid ;
> run ;
>
> If you need efficiency please state your requirements and
> give the problem
> your sort is suppose to solve.
>
> I found the placement of the NOBS option quite a surprise.
> Apparently the
> "/ unique" is a part of the KEY= option and not a SET
> statement option since
> NOBS= can also be put before the KEY= option. NOBS= was used to guard
> against cyclic mistakes.
>
> IanWhitlock@westat.com
> -----Original Message-----
> From: Richard A. DeVenezia [mailto:radevenz@IX.NETCOM.COM]
> Sent: Friday, May 03, 2002 6:19 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Sorting linked transaction data
>
>
> I have data where each row has a value that links it with one
> other row. I
> need to sort this data in a 'follow the link' manner.
>
> Here is some sample data:
>
> %let N = 20;
>
> * make fake data, ordered according to linkage;
> * Id is unique, but not necessarily 1<=ID<=N;
>
> data transactions;
>
> array use [ &N ] _temporary_;
>
> seed = 0;
>
> do i = 1 to &N ;
>
> prevId = currId;
> currId = nextId;
> nextId = int( &N * ranuni(seed) ) + 1;
>
> do while (use[nextId] ne .);
> if nextId >= &N then nextId=1; else nextId=nextId+1;
> end;
>
> use[nextId] = 1;
>
> nextId = nextId * (1 + int (1000*ranuni(seed)));
>
> output;
> end;
>
> format _numeric_ 8.;
> keep prevId currId nextId I;
> label i = "I normally would not be here";
> run;
>
> * disorder the transactions;
>
> proc sort data=transactions;
> by nextId;
> where currId;
> run;
>
> * need to sort transactions such that
> *
> * prevId[1] = .
> * nextId[rowN] = currId[rowN+1]
> *;
>
>
>
> --
> Richard A. DeVenezia
>
Blue Cross Blue Shield of Florida, Inc., and its subsidiary and
affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.